Assembly QA Jupyter

From Genecats
Jump to navigationJump to search

QAing Assemblies with Jupyter Notebook

This page is a wiki for the Jupyter Notebook which streamlines the steps in the Assembly QA wiki (http://genomewiki.ucsc.edu/genecats/index.php/Assembly_QA_Part_1_DEV_Steps). Jupyter notebook provides a live web environment that can run code, such as bash shell or python. First, you will want to copy the latest script to a safe location in your hive. From its current location:

cp /cluster/home/lrnassar/Random_Test_Files/Assembly_QA_Streamline.ipynb /hive/users/$user/jupyter/

Starting the notebook

Jupyter notebooks need to be started from the directory your notebook is located in. Following from the example above, you will first want to be in the directory:

cd /hive/users/$user/jupyter/

You will then want to start the notebook with the following command:

jupyter-notebook --ip 128.114.198.32 --no-browser --port 8085 

While specifying any port between 8081 and 8090. This should start jupyter, and give you a URL to enter on your web browser. Going to that URL you should see a list of files in the directory, including the copied file: Assembly_QA_Streamline. Clicking into it will start the notebook.

Specifying your variables

The script is organized into 5 separate sections.

  1. Auto Dev Steps - http://genomewiki.ucsc.edu/genecats/index.php/Assembly_QA_Part_1_DEV_Steps
  2. Manual Dev Steps - http://genomewiki.ucsc.edu/genecats/index.php/Assembly_QA_Part_1_DEV_Steps
  3. Track Steps - http://genomewiki.ucsc.edu/genecats/index.php/Assembly_QA_Part_2_Track_Steps
  4. Beta Steps - http://genomewiki.ucsc.edu/genecats/index.php/Assembly_QA_Part_3_BETA_Steps
  5. RR Steps - http://genomewiki.ucsc.edu/genecats/index.php/Assembly_QA_Part_4_RR_Steps

Each 'cell' is run independently. The first one, 'Auto Dev Steps', mostly performs automatic checks such as checking for minimum browser criteria, seeing if a BLAT server exists, etc.

The notebook currently takes 2 variables, with the later parts taking 3. These variables are located at the top of each cell. Currently they are:

  1. assembly
  2. prev_assembly
  3. RedmineNumber

You will have to fill these out at the top of each cell, each jupyter cell works independently and thus each cell requires its own variables. Enter the assembly in UCSC syntax, e.x 'equCab3', there are examples there as well. If this is a new assembly, 'prev_assembly' should be "None". When ready, run the cell by hitting the play button at the top of the script or the shortcut "control + enter".

Saving your progress

The notebook automatically writes out to a file whenever it completes a step as a way to save the current progress. This way, if there is an error, or if the process needs to be stopped for some time, it can be resumed where it was left off. This will not happen, however, if the cell is "stopped", or interrupted. In order to save progress, enter anything that is not "Done" into the notebook prompts. Any errors found will automatically save progress.

In order to re-run previous steps, the progress file can be deleted or altered. This file (as well as other generated files such as push lists), can be found in the directory:

/hive/users/$user/Assemblies/$assembly/

The files to be altered/deleted to re-run steps are as follows:

  1. Auto Dev Steps - '$assembly'_P1dev.txt
  2. Manual Dev Steps - '$assembly'_P1devMan.txt
  3. Track Steps - '$assembly'_P2tracks.txt
  4. Beta Steps - '$assembly'_P3Beta.txt
  5. RR Steps - '$assembly'_P4RR.txt