Assembly QA Part 1 DEV Steps

From Genecats
Revision as of 23:42, 24 May 2017 by Cath (talk | contribs)
Jump to navigationJump to search

See also: Releasing an assembly (old steps)


Navigation Menu

Home Page
Assembly QA Part 1: DEV Steps
Assembly QA Part 2: Track Steps
Assembly QA Part 3: BETA Steps
Assembly QA Part 4: RR Steps

Setup: Create a Google spreadsheet checklist from a template

Steps:

  1. Open a new Google Spreadsheet.
  2. Go to the Google spreadsheet template: Assembly Release Checklist
  3. Copy the template: File > Make a copy
  4. Give your new spreadsheet a title, like "manPen1 Assembly Release Checklist".
  5. Move your spreadsheet to a good folder on your Google Drive so that you can easily find it later.
  6. All set! You can now use your checklist.

Tips:

  1. Note: This system works best when you create one spreadsheet per assembly.
  2. See the tab, "README" for more info.
  3. If a wiki section is h4 ("====Wiki Section===="), denoted by surrounding the section with exactly 4 equal signs, then the h4 section will appear as a step in your checklist.
  4. To add a new step to your checklist - do not add it directly to your spreadsheet. Instead add a new h4 section to the wiki. Just copy an existing h4 and edit it!
  5. To see your change, toggle the "#" character in your formula. The "#" is not really needed in the formula, and removing it or adding it back in will re-load the page.

Setup: Make a directory in your hive

During this assembly release process, you will be generating a lot of output, and you'll need a place to put everything. The use of the "hive" directory is encouraged as the best location because of ample space.

 
mkdir /hive/users/userName/assemblies/assemblyName  

e.g.:  mkdir /hive/users/cath/assemblies/manPen1

Setup: Create an alias to your new dir

When you add an alias from your .bashrc file, you can simply type that alias in your command line as a shortcut to the associated command. A "shortcut" alias can be created to allow fast access to your hive directory for this assembly.

To do this, follow the steps below:

  1. In your terminal, connect to hgwdev and type "cd" (go to your home directory).
  2. Confirm the location of .bashrc. Type "ls -a" in your home directory to see all hidden files that have a " . " in the filename. This way you can confirm the location of your .bashrc file.
  3. Open your .bashrc file for editing. If you're using the vi editor, you can type "vi .bashrc" to edit the file. Add an alias by typing in the line below, then save your changes.
alias hive='cd /hive/users/yourUserName/assemblies/yourAssembly'

e.g., alias hive='cd /hive/users/cath/assemblies/manPen1'


Redmine: Review "Redmine as PushQ" wiki

  • As of March 2017, the PushQ has been replaced with Redmine to track and release new assemblies.
  • Review the Redmine as the pushQ replacement wiki page.
  • Go to Redmine > GB > Issues > Filter: "Ready for QA"
  • Find the assembly you will QA/Release


Redmine: Set assignee as yourself

Redmine: Set the engineer as "watcher" if they are not the developer

Redmine: Set Status to Reviewing

Dev: Check minimal browser criteria

Does this assembly have the required tracks?

Visit this page to check that the assembly contains the required tracks to be considered a minimal browser on the RR.

To add explaination: genbank mrnas & ests (/cluster/data/genbank/data/organism.lst) How to view/interpret the file

Dev: Check that BLAT Server is running

To check if your organism has a blat servers set up, run the following command:

hgwdev > copyHgcentral test $db blatServers dev beta

The developer has often already requested that the blat servers be set up for the new assembly. If not, and/or if entries for your assembly are missing from hgcentraltest.blatServers, please make a note in the Redmine ticket and ask the assembly builder to 1) request the setup of the blat servers and to 2) manually add the entries to hgcentraltest.blatServers. Make sure that this assembly is not hosted on "blatx" BLAT server. That server is not as stable and therefore is for assemblies that are not destined for the RR. For more information about where the blat servers for different machines should be hosted, go to Updating blat servers.

You should see results like this (below) since this should only be setup on dev so far:


copyHgcentral test manPen1 blatServers dev beta

--------------------------------------------------
--------------------------------------------------
<<< blatServers >>>

hgcentraltest
-------------
manPen1	blat1b	17878	1	0
manPen1	blat1b	17879	0	1

hgcentralbeta
-------------


hgcentral
-------------


*** There are blatServers differences between dev and beta ***

*** The blatServers data on beta and rr is identical ***

Dev: Do a BLAT search: DNA

From BLAT tool on dev:

  1. Go to your browser and copy some DNA sequence
  2. Go to BLAT: Home > Tools > Blat
  3. Paste in sequence
  4. Change query type to DNA and press submit
  5. Click on various blat results to make sure they look as expected
  6. Make a custom track of blat results and then look at them in the browser.

Dev: Do a BLAT search: protein

From BLAT tool on dev:

  1. Go to your browser and copy some DNA sequence
  2. Go to BLAT: Home > Tools > Blat
  3. Paste in sequence
  4. Change query type to "protein" (amino acid) and press submit
  5. Click on various blat results to make sure they look as expected
  6. Make a custom track of blat results and then look at them in the browser.

Dev: PCR test

  • Go to dev's PCR Tool and test a PCR search for your assembly.

Dev: Compare chrom sizes

Skip this if your assembly is the first for a species (hosted by UCSC), there will be no chrom sizes to compare to!
For a new assembly version, compare the chrom sizes from the last assembly to this new assembly version. For some assemblies, chrom names were changed, be aware of this if comparing. You are not checking annotations on the reference sequence, you are just checking the number of base pairs per chrom/contig, and making sure that nothing has changed drastically (i.e., millions of base pairs different). Also take a look for general differences, such as chrom labels or number of chrom/contigs.
Output chrom sizes into two files, sort each file by using the command below
Compare the sorted files
There are two ways to compare chromosomes:
1.Navigate to http://hgwdev.cse.ucsc.edu/cgi-bin/hgGateway, find your assembly and click on the "View Sequences" button - bring up 2 windows side by side to view both old and new assemblies. Now, you can compare the chromosome sizes.

or

2. open up a terminal window and input the following commands:

hgwdev > hgsql -Ne "select chrom, size from chromInfo" $oldDb > oldChromSizes assemblyName (e.g., "panTro4")
hgwdev > hgsql -Ne "select chrom, size from chromInfo" $newDb > newChromSizes assemblyName (e.g., "panTro5")
hgwdev > sdiff -s oldChromSizes newChromSizes

You may want to use "$cat oldChromSizes | head" to clean up the output in both old and new chromSize files, we are only concerned with the "chr# ####" labels.

Dev: Gateway: Check the tree

On hgGateway, make sure your db appears in the tree.

  1. Type the first few letters of your assembly name in the search field above the "Represented Species" tree, "m-a-n-P-e..." and the rest should populate.
  2. Your assembly should now be highlighted in the tree, and the tree position should have moved so that you are now centered on the tree position for your org.
  3. Hover over the name of your org within the tree, you should see the scientific name.
  4. Hover over the horizontal branch leading to your org, you should see the genus - family - order.
  5. Hover over the vertical branch leading to your org, you should see the superorder.
  6. Go to a different organism on hgGateway. Then scroll down the tree and find your organism. Click on the name of your organism in the tree and you should go to the default assembly for your organism.

Dev: Gateway: Check default position

  1. Go to gateway page
  2. Reset all user settings (Home > Genome Browser > Reset All User Settings
  3. Press "Go" on hgGateway
  4. You will be taken to the default position for your assembly.
  5. Make sure that the resulting area is scientifically interesting and aesthetically pleasing!
  6. You can edit the default location here: hgcentralbeta.dbDb.defaultPos

Dev: Gateway: Check default tracks

  • Each assembly has certain tracks that are hidden or visible by default.
  • You can edit the default tracks here: /kent/src/hg/makeDb/trackDb/$db/trackDb.ra.

Below is an example for turning on a default gene track that was off when the developer released the assembly to dev.

Resource: https://genome.ucsc.edu/goldenpath/help/trackDb/trackDbDoc.html

  • manPen1 has no gene tracks on by default.
  • I want to turn on the augustus track (on by default, pack visibility).
  • Looking at ~/kent/src/hg/makeDb/trackDb/$db/trackDb.ra, I see that there is no stanza for the augustus track, because it is inheriting the parent *.ra files configuration, making it hidden.
  • I need to override the parent config in the manPen1 .ra file.

Steps:

  • go to dev, Genome Browser > Reset All User Settings
  • note which track you would like to turn on, see if you want it in 'pack' or 'full' etc.
  • vim ~/kent/src/hg/makeDb/trackDb/pangolin/manPen1/trackDb.ra
  • Add something like this:
Local declaration so that augustus genes is picked up.
track augustusGene override
visibility pack
  • cd ~/kent/src/hg/makeDb/trackDb
  • make alpha DBS=manPen1
  • refresh your dev hgTracks browser and see that your track is now on, inheriting the parent's visibility (pack, in this case).
  • if all looks good, add, commit, push your .ra file.
  • make beta DBS=manPen1
  • make public DBS=manPen1
  • Push request to admins: Make trackDb & friends for manPen1
  • Check the rr/euro/asia for your newly visible track.

Dev: Gateway: Organism image check

From your file.list from Redmine, make sure a scientificName.jpg image is listed, check to see that it does exist on dev.

The image file that appears on the gateway page should reside in the kent source tree in:

~/kent/src/hg/htdocs/images/

and a copy should exist at:

hgwdev > /usr/local/apache/htdocs/images/

Dev: Gateway: Accession ID check


Assemblies/sequences, from various organizations, are submitted to the mother ship GenBank.
Those assemblies might be included in RefSeq if criteria are met.

The QA check should be to go out to NCBI and double check that the accessionID is correct.

RefSeq assemblies:
use accession ID: GCF_000002315.4 (e.g., galGal5)
are delivered with chrMt (if they exisit)
are delivered with NCBI gene predictions
Genbank assemblies:
use accession ID: GCA_000001305.2
delivered without a chrMt.
do not have gene predictions.

For the UCSC Genome Browser, it is preferable to use RefSeq assemblies (in part due to 'more data'). This is a "learn as we go" direction; historically GeneBank was preferred.

Helpful article: Nature, 2012 A beginner's guide to eukaryotic genome annotation

Dev: Gateway: Check the NCBI assembly version link

Check that there is an NCBI link to the exact assembly.version http://www.ncbi.nlm.nih.gov/assembly/organism/

Dev: Verify make doc for all tracks

  • The makefile/s for your assembly describes the browser build.
  • Location should be here: ~/kent/src/hg/makeDb/doc/$db/*

You can create a shortcut to this file from your hive dir: e.g.,

ln -s ~/kent/src/hg/makeDb/doc/manPen1/initialBuild.txt
  • e.g., the make doc for manPen1 is in: ~/kent/src/hg/makeDb/doc/manPen1/initialBuild.txt
  • tracks are list as
hg38.chainManPen1
hg38.chainManPen1Link
hg38.netManPen1
manPen1.augustusGene
manPen1.author
manPen1.cds
manPen1.cell
manPen1.chainHg38

Let's create a file of just the track names (table names), which we can then use with grep to find them in the make doc.

You can make a clean list of tracks and save this file in your hive dir

From hive: cat redmine.manPen1.table.list | cut -d . -f2 > cleanTableList

chainManPen1
chainManPen1Link
netManPen1
augustusGene
author
cds
cell
chainHg38

Some tables might not be listed in the make doc if they are automatically generated, such as:

  • nestedRepeats
  • rmsk
  • simpleRepeat
  • tableDescriptions
  • genbank tables (xenoRefGene, etc)
  • supporting tables

The following tables don't need to be referenced in the make doc:

  • chromInfo
  • gap
  • gc5BaseBw
  • gold
  • grp
  • hgFindSpec
  • history
  • trackDb

Use your cleanTableList as the grep search string list, and look in the make file:

grep -f cleanTableList ~/kent/src/hg/makeDb/doc/manPen1/initialBuild.txt

Dev: Review downloads dir

View the contents of the downloads directory.

hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath/$db/

LiftOver files and vs* directories are for the chain/net tracks; and the multiz*way, phastCons*way and phyloP*way directories are for conservation tracks.

Note that $db/database dir will be empty except for README.txt. This directory will contain a dump of the database on the RR, but will always remain empty on hgwdev.

Also note that these files:

est.fa.gz      mrna.fa.gz      refMrna.fa.gz      xenoMrna.fa.gz
est.fa.gz.md5  mrna.fa.gz.md5  refMrna.fa.gz.md5  xenoMrna.fa.gz.md5

will not be present on hgwdev. They are generated automatically and rsync'ed to hgdownload after an assembly is added to hgwbeta.dbs and "make etc-update-server" is run in the kent/src/hg/makeDb/genbank/ directory on hgwbeta.

Dev: Run dbCheck

Run the following command to check that all MySQL tables are in good repair:

hgwdev > sudo dbCheck.sh $db

Dev: Alignment files are to valid assemblies

In Redmine for your assembly, the engineer should have provided a path to redmine.$db.file.list E.g., /hive/data/genomes/manPen1/redmine5515/redmine.manPen1.file.list

From hive, copy the file list to your assembly dir:

/hive/data/genomes/manPen1/redmine5515/redmine.manPen1.file.list

Take a look at the alignment "To" and "From" files, and make sure they are to valid assemblies on the RR.

LiftOver Files
A liftOver file is a chain file, it is a subset of all chains used in creating the net file.
hg38ToAnoCar2.over.chain.gz: These files are required for the liftOver utility.
The file names reflect the assembly conversion data contained within in the format <db1>To<Db2>.over.chain.gz. For example, a file named hg38ToAnoCar2.over.chain.gz file contains the liftOver data needed to convert hg38 coordinates to the anoCar2 assembly.
Chain Files
Chain files contain all possible chains generated by lastz, before a subset of "best" chains are filtered into the liftOver file.
hg38.anoCar2.all.chain.gz: chained blastz alignments.
The chain format is described in on the chain help page.
Net Files
hg38.anoCar2.net.gz: "net" file.
This file describes rearrangements between the species and the best Lizard match to any part of the Human genome. The net format is described in on the net help page.
Axt Files
hg38.anoCar2.net.axt.gz: chained and netted alignments.
i.e. the best chains in the Human genome, with gaps in the best chains filled in by next-best chains where possible. The axt format is described in the axt help page.

Dev: liftOver exists: old-to-new, new-to-old

Skip this if your assembly is the first version for the organism. Otherwise, check that the previous assembly version has a liftOver file to

  • the new version
  • and a reciprocal file in the
/gbdb/[your Db]/liftOver/[your Db]To[the older version of your org].over.chain.gz
/gbdb/[the older version of your org]/liftOver/[the older version of you org]To[your Db].over.chain.gz

Dev: liftOver exists: other orgs

Your assembly will probably also have liftOver files to/from other major orgs, such as the newer human and mouse assemblies. Check that liftOver files exist in BOTH directories,

/gbdb/[your database]/liftOver/
/gbdb/[some other org database]/liftOver

For example, if your assembly is manPen1, see what liftOver files are there. These should also match what is in your filelist from Redmine.

 ★  /gbdb/manPen1/liftOver
ls
manPen1ToHg38.over.chain.gz  manPen1ToMm10.over.chain.gz

Note that there are liftOver files to TWO other orgs, human and mouse. If this assembly was not the first, it should also have liftOver files to the previous assembly version.

Let's go look at liftOver files for hg38:

 ★  /gbdb/hg38/liftOver
ls | grep ManPen
hg38ToManPen1.over.chain.gz

and then we'll check mm10:

 ★  /gbdb/mm10/liftOver
ls | grep ManPen
mm10ToManPen1.over.chain.gz

Dev: Check Tools: LiftOver

  • Go to dev's LiftOver Tool and test lifts to & from other assembly versions and other organisms that you have liftOver files for.

Dev: Make temp in home dir for md5sum checks

Action item: make a dir named "temp" in your home dir.

Review the following section, which is a guide to verify that the download files exist and are not corrupt in the following directory:

hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath/$db/

Before we begin, create a junk folder in your home directory:

mkdir ~/temp

We will be using a computer program called Md5sum to generate MD5 hashes to verify the integrity of the files since any change to the file will cause its MD5 hash to change. The MD5 hashes for each file was generated and stored in the md5sum.txt file.

An easy way to compare the MD5 hashes of each file is to do a diff. This can be easily automated by running the following commands.

The first command is to run md5sum for all files in your current directory (these will be listed in the steps below), sort them, and then redirect the output to a file.

md5sum * | sort > ~/temp/filename_1

The second command sorts the md5sum.txt file and redirects the output to a different file.

sort md5sum.txt > ~/temp/filename_2

The final command compares the two files created and displays the lines that differ between the two files.

diff ~/temp/filename_1 ~/temp/filename_2

Note that the md5sum.txt file obviously does not contain md5sum.txt and it was created before there was a README.txt file, so your diff will show md5sum.txt and README.txt in the results. If everything is ok, those should be the only results. In the vs* directories, the XXXX.net.axt.gz file will show up as axtNet/XXXX.net.axt.gz.


Dev: bigZips: check md5sum

Change your directory to:

hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath/$db/bigZips

then run the following command:

md5sum * | sort > ~/temp/filename_1; sort md5sum.txt > ~/temp/filename_2; diff ~/temp/filename_1 ~/temp/filename_2

Dev: bigZips: check README

/usr/local/apache/htdocs-hgdownload/goldenPath/$db/bigZips
  • Verify that the README.txt exists
  • cat the file and read it, check the contents (such as urls listed, etc.)

Dev: bigZips: check for corruption

Change your working directory to:

hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath/$db/bigZips

Run the following in the directory:

for file in *; do zcat $file | head; zcat $file | tail; done

Scroll the output and make sure all the text is ASCII.


Dev: database: check README

/usr/local/apache/htdocs-hgdownload/goldenPath/$db/database
  • Verify that the README.txt exists
  • cat the file and read it, check the contents (such as urls listed, etc.)


Dev: liftOver: check md5sum

Change your directory to:

hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath/$db/liftOver

then run this command:

md5sum * | sort > ~/temp/filename_1; sort md5sum.txt > ~/temp/filename_2; diff ~/temp/filename_1 ~/temp/filename_2

Dev: liftOver: check README

/usr/local/apache/htdocs-hgdownload/goldenPath/$db/liftOver
  • Verify that the README.txt exists
  • cat the file and read it, check the contents (such as urls listed, etc.)

Dev: liftOver: corruption

Run the following in each directory and check the output:

for file in *; do zcat $file | head; zcat $file | tail; done

Scroll the output and make sure all the text is ASCII.

Dev: vsXXX: check md5sum

This section is only relevant if your assembly has chain/net files to another organism.

Note: there may be multiple organisms that your assembly has alignment files to, check them all.

Change your directory to:

hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath/$db/vsXXX

then run the following command:

md5sum * | sort > ~/temp/filename_1; sort md5sum.txt > ~/temp/filename_2; diff ~/temp/filename_1 ~/temp/filename_2

REPEAT this process for subdirectories:

  • recipprocalBest
  • axtRBestNet

Dev: vsXXX: check README

/usr/local/apache/htdocs-hgdownload/goldenPath/$db/vsXXX
  • Verify that the README.txt exists
  • cat the file and read it, check the contents (such as urls listed, etc.)

Dev: vsXXX: corruption

Change your directory to:

hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath//queryDb/vsYourDb

Run the following in each directory and check the output:

for file in *; do zcat $file | head; zcat $file | tail; done

Scroll the output and make sure all the text is ASCII.

REPEAT this process for subdirectories:

  • recipprocalBest
  • axtRBestNet

Dev: for :queryDb:vsYourDb: check README

/usr/local/apache/htdocs-hgdownload/goldenPath/queryDb/vsYourDb

Check the README.txt files for any other organisms that your assembly has alignments (chain/net/liftover/etc) to:

  • Verify that the README.txt exists
  • cat the file and read it, check the contents (such as urls listed, etc.)

REPEAT this process for subdirectory:

  • recipprocalBest (this readme covers the subdir, axtRBestNet).

Dev: for :queryDb:vsYourDb: check md5sum

Note: there may be multiple organisms that your assembly has alignment files to, check them all.

Change your directory to:

hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath/queryDb/vsYourDb

then run the following command:

md5sum * | sort > ~/temp/filename_1; sort md5sum.txt > ~/temp/filename_2; diff ~/temp/filename_1 ~/temp/filename_2

REPEAT this process for subdirectories:

  • recipprocalBest
  • axtRBestNet

Dev: for :queryDb:vsYourDb: check corruption

/usr/local/apache/htdocs-hgdownload/goldenPath/queryDb/vsYourDb

Just do a zcat for otherOrg -> yourDb:

zcat $file | head

REPEAT this process for subdirectories:

  • recipprocalBest
  • axtRBestNet

Dev: md5sum check with "2bitCompare $db"

The .2bit files contain the new assembly sequence in a compact, binary format. The .2bit files are located at:

  • /scratch/$db (on the blat server)
  • /usr/local/apache/htdocs-hgdownload/goldenPath/$db/bigZips/ (on hgwdev)
  • /gbdb/$db/ (on hgwdev)
  • /gbdb/$db/ (on hgwbeta)

Check the to make sure that the .2bit files are identical by running the 2bitCompare script. Particularly if the assembly has been part of a multiz track without a Browser, the file may exist on beta and RR and may not have been masked.

Below is some sample output:

hgwdev> 2bitCompare allMis1

  Checking md5sums.  This could take a few minutes.  Please be patient...

        blat4a md5sum: 134e740c05eedadc24de3a96775a25d6 /scratch/allMis1/allMis1.2bit
      download md5sum: 134e740c05eedadc24de3a96775a25d6 /usr/local/apache/htdocs-hgdownload/goldenPath/allMis1/bigZips/allMis1.2bit
   hgwdev gbdb md5sum: 134e740c05eedadc24de3a96775a25d6 /gbdb/allMis1/allMis1.2bit
  hgwbeta gbdb md5sum: 134e740c05eedadc24de3a96775a25d6 /gbdb/allMis1/allMis1.2bit

        blat4a date,size: Jun 19 11:03 569794406
      download date,size: Jul 3 10:55 53
   hgwdev gbdb date,size: Jun 7 13:34 39
  hgwbeta gbdb date,size: Jun 7 13:33 569794406

The first part of the script output lists the md5sums of all four .2bit files. These should be identical.

The second part of the script output lists the timestamps and filesizes.

  • The download and hgwdev gbdb files should be symlinks, as evidenced by a small filesize.
  • The blat and hgwbeta gbdb files should be the actual files, as evidenced by a large filesize.
  • The two symlink filesizes will likely be different, but the filesize of the two actual files should be identical.

If the blat .2bit is not the same as the other .2bit files, ask the pushers to restart the assembly and to pull the newest .2bit file from /gbdb.

Dev: Permissions check: downloads dir

The developer may need to update permissions to the download directory to be at least 664.

hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath/$db/

ls -lLR *
-rw-rw-r


Dev: Review blastTabs wiki

Review the blastTabs wiki page to determine if your assembly needs a blastTab update.



🔵 Done with DEV steps? Go to Assembly QA Part 2: Track Steps