Assembly QA Part 1 DEV Steps
See also: Releasing an assembly (old steps)
Navigation Menu |
Setup: Create a Google spreadsheet checklist from a template
Steps:
- Open a new Google Spreadsheet.
- Go to the Google spreadsheet template: Assembly Release Checklist
- Copy the template: File > Make a copy
- Give your new spreadsheet a title, like "manPen1 Assembly Release Checklist".
- Move your spreadsheet to a good folder on your Google Drive so that you can easily find it later.
- All set! You can now use your checklist.
Tips:
- Note: This system works best when you create one spreadsheet per assembly.
- See the tab, "README" for more info.
- If a wiki section is h4 ("====Wiki Section===="), denoted by surrounding the section with exactly 4 equal signs, then the h4 section will appear as a step in your checklist.
- To add a new step to your checklist - do not add it directly to your spreadsheet. Instead add a new h4 section to the wiki. Just copy an existing h4 and edit it!
- To see your change, toggle the "#" character in your formula. The "#" is not really needed in the formula, and removing it or adding it back in will re-load the page.
Setup: Make a directory in your hive
During this assembly release process, you will be generating a lot of output, and you'll need a place to put everything. The use of the "hive" directory is encouraged as the best location because of ample space.
mkdir /hive/users/userName/assemblies/assemblyName e.g.: mkdir /hive/users/cath/assemblies/manPen1
Setup: Create an alias to your new dir
When you add an alias from your .bashrc file, you can simply type that alias in your command line as a shortcut to the associated command. A "shortcut" alias can be created to allow fast access to your hive directory for this assembly.
To do this, follow the steps below:
- In your terminal, connect to hgwdev and type "cd" (go to your home directory).
- Confirm the location of .bashrc. Type "ls -a" in your home directory to see all hidden files that have a " . " in the filename. This way you can confirm the location of your .bashrc file.
- Open your .bashrc file for editing. If you're using the vi editor, you can type "vi .bashrc" to edit the file. Add an alias by typing in the line below, then save your changes.
alias hive='cd /hive/users/yourUserName/assemblies/yourAssembly' e.g., alias hive='cd /hive/users/cath/assemblies/manPen1'
Redmine: Review "Redmine as PushQ" wiki
- As of March 2017, the PushQ has been replaced with Redmine to track and release new assemblies.
- Review the Redmine as the pushQ replacement wiki page.
- Go to Redmine > GB > Issues > Filter: "Ready for QA"
- Find the assembly you will QA/Release
Redmine: Set assignee as yourself
Redmine: Set the engineer as "watcher" if they are not the developer
Redmine: Set Status to Reviewing
Dev: Check minimal browser criteria
Does this assembly have the required tracks?
Visit this page to check that the assembly contains the required tracks to be considered a minimal browser on the RR.
To add explaination: genbank mrnas & ests (/cluster/data/genbank/data/organism.lst) How to view/interpret the file
Dev: Check that BLAT Server is running
To check if your organism has a blat servers set up, run the following command:
hgwdev > copyHgcentral test $db blatServers dev beta
The developer has often already requested that the blat servers be set up for the new assembly. If not, and/or if entries for your assembly are missing from hgcentraltest.blatServers, please make a note in the Redmine ticket and ask the assembly builder to 1) request the setup of the blat servers and to 2) manually add the entries to hgcentraltest.blatServers. Make sure that this assembly is not hosted on "blatx" BLAT server. That server is not as stable and therefore is for assemblies that are not destined for the RR. For more information about where the blat servers for different machines should be hosted, go to Updating blat servers.
You should see results like this (below) since this should only be setup on dev so far:
copyHgcentral test manPen1 blatServers dev beta -------------------------------------------------- -------------------------------------------------- <<< blatServers >>> hgcentraltest ------------- manPen1 blat1b 17878 1 0 manPen1 blat1b 17879 0 1 hgcentralbeta ------------- hgcentral ------------- *** There are blatServers differences between dev and beta *** *** The blatServers data on beta and rr is identical ***
Dev: Do a BLAT search: DNA
From BLAT tool on dev:
- Go to your browser and copy some DNA sequence
- Go to BLAT: Home > Tools > Blat
- Paste in sequence
- Change query type to DNA and press submit
- Click on various blat results to make sure they look as expected
- Make a custom track of blat results and then look at them in the browser.
Dev: Do a BLAT search: protein
From BLAT tool on dev:
- Go to your browser and copy some DNA sequence
- Go to BLAT: Home > Tools > Blat
- Paste in sequence
- Change query type to "protein" (amino acid) and press submit
- Click on various blat results to make sure they look as expected
- Make a custom track of blat results and then look at them in the browser.
Dev: PCR test
- Go to dev's PCR Tool and test a PCR search for your assembly.
Dev: Compare chrom sizes
- Skip this if your assembly is the first for a species (hosted by UCSC), there will be no chrom sizes to compare to!
- For a new assembly version, compare the chrom sizes from the last assembly to this new assembly version. For some assemblies, chrom names were changed, be aware of this if comparing. You are not checking annotations on the reference sequence, you are just checking the number of base pairs per chrom/contig, and making sure that nothing has changed drastically (i.e., millions of base pairs different). Also take a look for general differences, such as chrom labels or number of chrom/contigs.
- Output chrom sizes into two files, sort each file by using the command below
- Compare the sorted files
- There are two ways to compare chromosomes:
- 1.Navigate to http://hgwdev.cse.ucsc.edu/cgi-bin/hgGateway, find your assembly and click on the "View Sequences" button - bring up 2 windows side by side to view both old and new assemblies. Now, you can compare the chromosome sizes.
or
2. open up a terminal window and input the following commands:
hgwdev > hgsql -Ne "select chrom, size from chromInfo" $oldDb > oldChromSizes assemblyName (e.g., "panTro4") hgwdev > hgsql -Ne "select chrom, size from chromInfo" $newDb > newChromSizes assemblyName (e.g., "panTro5") hgwdev > sdiff -s oldChromSizes newChromSizes
You may want to use "$cat oldChromSizes | head" to clean up the output in both old and new chromSize files, we are only concerned with the "chr# ####" labels.
Dev: Gateway: Check the tree
On hgGateway, make sure your db appears in the tree.
- Type the first few letters of your assembly name in the search field above the "Represented Species" tree, "m-a-n-P-e..." and the rest should populate.
- Your assembly should now be highlighted in the tree, and the tree position should have moved so that you are now centered on the tree position for your org.
- Hover over the name of your org within the tree, you should see the scientific name.
- Hover over the horizontal branch leading to your org, you should see the genus - family - order.
- Hover over the vertical branch leading to your org, you should see the superorder.
- Go to a different organism on hgGateway. Then scroll down the tree and find your organism. Click on the name of your organism in the tree and you should go to the default assembly for your organism.
Dev: Gateway: Check default position
- Go to gateway page
- Reset all user settings (Home > Genome Browser > Reset All User Settings
- Press "Go" on hgGateway
- You will be taken to the default position for your assembly.
- Make sure that the resulting area is scientifically interesting and aesthetically pleasing!
- You can edit the default location here: hgcentralbeta.dbDb.defaultPos
Dev: Gateway: Check default tracks
- Each assembly has certain tracks that are hidden or visible by default.
- You can edit the default tracks here: /kent/src/hg/makeDb/trackDb/$db/trackDb.ra.
Below is an example for turning on a default gene track that was off when the developer released the assembly to dev.
Resource: https://genome.ucsc.edu/goldenpath/help/trackDb/trackDbDoc.html
- manPen1 has no gene tracks on by default.
- I want to turn on the augustus track (on by default, pack visibility).
- Looking at ~/kent/src/hg/makeDb/trackDb/$db/trackDb.ra, I see that there is no stanza for the augustus track, because it is inheriting the parent *.ra files configuration, making it hidden.
- I need to override the parent config in the manPen1 .ra file.
Steps:
- go to dev, Genome Browser > Reset All User Settings
- note which track you would like to turn on, see if you want it in 'pack' or 'full' etc.
- vim ~/kent/src/hg/makeDb/trackDb/pangolin/manPen1/trackDb.ra
- Add something like this:
Local declaration so that augustus genes is picked up. track augustusGene override visibility pack
- cd ~/kent/src/hg/makeDb/trackDb
- make alpha DBS=manPen1
- refresh your dev hgTracks browser and see that your track is now on, inheriting the parent's visibility (pack, in this case).
- if all looks good, add, commit, push your .ra file.
- make beta DBS=manPen1
- make public DBS=manPen1
- Push request to admins: Make trackDb & friends for manPen1
- Check the rr/euro/asia for your newly visible track.
Dev: Gateway: Organism image check
From your file.list from Redmine, make sure a scientificName.jpg image is listed, check to see that it does exist on dev.
The image file that appears on the gateway page should reside in the kent source tree in:
~/kent/src/hg/htdocs/images/
and a copy should exist at:
hgwdev > /usr/local/apache/htdocs/images/
Dev: Gateway: Accession ID check
Assemblies/sequences, from various organizations, are submitted to the mother ship GenBank.
Those assemblies might be included in RefSeq if criteria are met.
The QA check should be to go out to NCBI and double check that the accessionID is correct.
- RefSeq assemblies:
- use accession ID: GCF_000002315.4 (e.g., galGal5)
- are delivered with chrMt (if they exisit)
- are delivered with NCBI gene predictions
- Genbank assemblies:
- use accession ID: GCA_000001305.2
- delivered without a chrMt.
- do not have gene predictions.
For the UCSC Genome Browser, it is preferable to use RefSeq assemblies (in part due to 'more data'). This is a "learn as we go" direction; historically GeneBank was preferred.
Helpful article: Nature, 2012 A beginner's guide to eukaryotic genome annotation
Dev: Gateway: Check the NCBI assembly version link
Check that there is an NCBI link to the exact assembly.version http://www.ncbi.nlm.nih.gov/assembly/organism/
Dev: Verify make doc for all tracks
- The makefile/s for your assembly describes the browser build.
- Location should be here: ~/kent/src/hg/makeDb/doc/$db/*
You can create a shortcut to this file from your hive dir: e.g.,
ln -s ~/kent/src/hg/makeDb/doc/manPen1/initialBuild.txt
- e.g., the make doc for manPen1 is in: ~/kent/src/hg/makeDb/doc/manPen1/initialBuild.txt
- tracks are list as
- hg38.chainManPen1
- hg38.chainManPen1Link
- hg38.netManPen1
- manPen1.augustusGene
- manPen1.author
- manPen1.cds
- manPen1.cell
- manPen1.chainHg38
Let's create a file of just the track names (table names), which we can then use with grep to find them in the make doc.
You can make a clean list of tracks and save this file in your hive dir
From hive: cat redmine.manPen1.table.list | cut -d . -f2 > cleanTableList
- chainManPen1
- chainManPen1Link
- netManPen1
- augustusGene
- author
- cds
- cell
- chainHg38
Some tables might not be listed in the make doc if they are automatically generated, such as:
- nestedRepeats
- rmsk
- simpleRepeat
- tableDescriptions
- genbank tables (xenoRefGene, etc)
- supporting tables
The following tables don't need to be referenced in the make doc:
- chromInfo
- gap
- gc5BaseBw
- gold
- grp
- hgFindSpec
- history
- trackDb
Use your cleanTableList as the grep search string list, and look in the make file:
grep -f cleanTableList ~/kent/src/hg/makeDb/doc/manPen1/initialBuild.txt
Dev: Review downloads dir
View the contents of the downloads directory.
hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath/$db/
LiftOver files and vs* directories are for the chain/net tracks; and the multiz*way, phastCons*way and phyloP*way directories are for conservation tracks.
Note that $db/database dir will be empty except for README.txt. This directory will contain a dump of the database on the RR, but will always remain empty on hgwdev.
Also note that these files:
est.fa.gz mrna.fa.gz refMrna.fa.gz xenoMrna.fa.gz est.fa.gz.md5 mrna.fa.gz.md5 refMrna.fa.gz.md5 xenoMrna.fa.gz.md5
will not be present on hgwdev. They are generated automatically and rsync'ed to hgdownload after an assembly is added to hgwbeta.dbs and "make etc-update-server" is run in the kent/src/hg/makeDb/genbank/ directory on hgwbeta.
Dev: Run dbCheck
Run the following command to check that all MySQL tables are in good repair:
hgwdev > sudo dbCheck.sh $db
Dev: Alignment files are to valid assemblies
In Redmine for your assembly, the engineer should have provided a path to redmine.$db.file.list E.g., /hive/data/genomes/manPen1/redmine5515/redmine.manPen1.file.list
From hive, copy the file list to your assembly dir:
/hive/data/genomes/manPen1/redmine5515/redmine.manPen1.file.list
Take a look at the alignment "To" and "From" files, and make sure they are to valid assemblies on the RR.
- LiftOver Files
- A liftOver file is a chain file, it is a subset of all chains used in creating the net file.
- hg38ToAnoCar2.over.chain.gz: These files are required for the liftOver utility.
- The file names reflect the assembly conversion data contained within in the format <db1>To<Db2>.over.chain.gz. For example, a file named hg38ToAnoCar2.over.chain.gz file contains the liftOver data needed to convert hg38 coordinates to the anoCar2 assembly.
- hg38ToAnoCar2.over.chain.gz: These files are required for the liftOver utility.
- A liftOver file is a chain file, it is a subset of all chains used in creating the net file.
- Chain Files
- Chain files contain all possible chains generated by lastz, before a subset of "best" chains are filtered into the liftOver file.
- hg38.anoCar2.all.chain.gz: chained blastz alignments.
- The chain format is described in on the chain help page.
- hg38.anoCar2.all.chain.gz: chained blastz alignments.
- Chain files contain all possible chains generated by lastz, before a subset of "best" chains are filtered into the liftOver file.
- Net Files
- hg38.anoCar2.net.gz: "net" file.
- This file describes rearrangements between the species and the best Lizard match to any part of the Human genome. The net format is described in on the net help page.
- hg38.anoCar2.net.gz: "net" file.
- Axt Files
- hg38.anoCar2.net.axt.gz: chained and netted alignments.
- i.e. the best chains in the Human genome, with gaps in the best chains filled in by next-best chains where possible. The axt format is described in the axt help page.
- hg38.anoCar2.net.axt.gz: chained and netted alignments.
Dev: liftOver exists: old-to-new, new-to-old
Skip this if your assembly is the first version for the organism. Otherwise, check that the previous assembly version has a liftOver file to
- the new version
- and a reciprocal file in the
/gbdb/[your Db]/liftOver/[your Db]To[the older version of your org].over.chain.gz /gbdb/[the older version of your org]/liftOver/[the older version of you org]To[your Db].over.chain.gz
Dev: liftOver exists: other orgs
Your assembly will probably also have liftOver files to/from other major orgs, such as the newer human and mouse assemblies. Check that liftOver files exist in BOTH directories,
/gbdb/[your database]/liftOver/ /gbdb/[some other org database]/liftOver
For example, if your assembly is manPen1, see what liftOver files are there. These should also match what is in your filelist from Redmine.
★ /gbdb/manPen1/liftOver ls manPen1ToHg38.over.chain.gz manPen1ToMm10.over.chain.gz
Note that there are liftOver files to TWO other orgs, human and mouse. If this assembly was not the first, it should also have liftOver files to the previous assembly version.
Let's go look at liftOver files for hg38:
★ /gbdb/hg38/liftOver ls | grep ManPen hg38ToManPen1.over.chain.gz
and then we'll check mm10:
★ /gbdb/mm10/liftOver ls | grep ManPen mm10ToManPen1.over.chain.gz
Dev: Check Tools: LiftOver
- Go to dev's LiftOver Tool and test lifts to & from other assembly versions and other organisms that you have liftOver files for.
Dev: Make temp in home dir for md5sum checks
Action item: make a dir named "temp" in your home dir.
Review the following section, which is a guide to verify that the download files exist and are not corrupt in the following directory:
hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath/$db/
Before we begin, create a junk folder in your home directory:
mkdir ~/temp
We will be using a computer program called Md5sum to generate MD5 hashes to verify the integrity of the files since any change to the file will cause its MD5 hash to change. The MD5 hashes for each file was generated and stored in the md5sum.txt file.
An easy way to compare the MD5 hashes of each file is to do a diff. This can be easily automated by running the following commands.
The first command is to run md5sum for all files in your current directory (these will be listed in the steps below), sort them, and then redirect the output to a file.
md5sum * | sort > ~/temp/filename_1
The second command sorts the md5sum.txt file and redirects the output to a different file.
sort md5sum.txt > ~/temp/filename_2
The final command compares the two files created and displays the lines that differ between the two files.
diff ~/temp/filename_1 ~/temp/filename_2
Note that the md5sum.txt file obviously does not contain md5sum.txt and it was created before there was a README.txt file, so your diff will show md5sum.txt and README.txt in the results. If everything is ok, those should be the only results. In the vs* directories, the XXXX.net.axt.gz file will show up as axtNet/XXXX.net.axt.gz.
Dev: bigZips: check md5sum
Change your directory to:
hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath/$db/bigZips
then run the following command:
md5sum * | sort > ~/temp/filename_1; sort md5sum.txt > ~/temp/filename_2; diff ~/temp/filename_1 ~/temp/filename_2
Dev: bigZips: check README
/usr/local/apache/htdocs-hgdownload/goldenPath/$db/bigZips
- Verify that the README.txt exists
- cat the file and read it, check the contents (such as urls listed, etc.)
Dev: bigZips: check for corruption
Change your working directory to:
hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath/$db/bigZips
Run the following in the directory:
for file in *; do zcat $file | head; zcat $file | tail; done
Scroll the output and make sure all the text is ASCII.
Dev: database: check README
/usr/local/apache/htdocs-hgdownload/goldenPath/$db/database
- Verify that the README.txt exists
- cat the file and read it, check the contents (such as urls listed, etc.)
Dev: liftOver: check md5sum
Change your directory to:
hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath/$db/liftOver
then run this command:
md5sum * | sort > ~/temp/filename_1; sort md5sum.txt > ~/temp/filename_2; diff ~/temp/filename_1 ~/temp/filename_2
Dev: liftOver: check README
/usr/local/apache/htdocs-hgdownload/goldenPath/$db/liftOver
- Verify that the README.txt exists
- cat the file and read it, check the contents (such as urls listed, etc.)
Dev: liftOver: corruption
Run the following in each directory and check the output:
for file in *; do zcat $file | head; zcat $file | tail; done
Scroll the output and make sure all the text is ASCII.
Dev: vsXXX: check md5sum
This section is only relevant if your assembly has chain/net files to another organism.
Note: there may be multiple organisms that your assembly has alignment files to, check them all.
Change your directory to:
hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath/$db/vsXXX
then run the following command:
md5sum * | sort > ~/temp/filename_1; sort md5sum.txt > ~/temp/filename_2; diff ~/temp/filename_1 ~/temp/filename_2
REPEAT this process for subdirectories:
- recipprocalBest
- axtRBestNet
Dev: vsXXX: check README
/usr/local/apache/htdocs-hgdownload/goldenPath/$db/vsXXX
- Verify that the README.txt exists
- cat the file and read it, check the contents (such as urls listed, etc.)
Dev: vsXXX: corruption
Change your directory to:
hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath//queryDb/vsYourDb
Run the following in each directory and check the output:
for file in *; do zcat $file | head; zcat $file | tail; done
Scroll the output and make sure all the text is ASCII.
REPEAT this process for subdirectories:
- recipprocalBest
- axtRBestNet
Dev: for :queryDb:vsYourDb: check README
/usr/local/apache/htdocs-hgdownload/goldenPath/queryDb/vsYourDb
Check the README.txt files for any other organisms that your assembly has alignments (chain/net/liftover/etc) to:
- Verify that the README.txt exists
- cat the file and read it, check the contents (such as urls listed, etc.)
REPEAT this process for subdirectory:
- recipprocalBest (this readme covers the subdir, axtRBestNet).
Dev: for :queryDb:vsYourDb: check md5sum
Note: there may be multiple organisms that your assembly has alignment files to, check them all.
Change your directory to:
hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath/queryDb/vsYourDb
then run the following command:
md5sum * | sort > ~/temp/filename_1; sort md5sum.txt > ~/temp/filename_2; diff ~/temp/filename_1 ~/temp/filename_2
REPEAT this process for subdirectories:
- recipprocalBest
- axtRBestNet
Dev: for :queryDb:vsYourDb: check corruption
/usr/local/apache/htdocs-hgdownload/goldenPath/queryDb/vsYourDb
Just do a zcat for otherOrg -> yourDb:
zcat $file | head
REPEAT this process for subdirectories:
- recipprocalBest
- axtRBestNet
Dev: md5sum check with "2bitCompare $db"
The .2bit files contain the new assembly sequence in a compact, binary format. The .2bit files are located at:
- /scratch/$db (on the blat server)
- /usr/local/apache/htdocs-hgdownload/goldenPath/$db/bigZips/ (on hgwdev)
- /gbdb/$db/ (on hgwdev)
- /gbdb/$db/ (on hgwbeta)
Check the to make sure that the .2bit files are identical by running the 2bitCompare script. Particularly if the assembly has been part of a multiz track without a Browser, the file may exist on beta and RR and may not have been masked.
Below is some sample output:
hgwdev> 2bitCompare allMis1 Checking md5sums. This could take a few minutes. Please be patient... blat4a md5sum: 134e740c05eedadc24de3a96775a25d6 /scratch/allMis1/allMis1.2bit download md5sum: 134e740c05eedadc24de3a96775a25d6 /usr/local/apache/htdocs-hgdownload/goldenPath/allMis1/bigZips/allMis1.2bit hgwdev gbdb md5sum: 134e740c05eedadc24de3a96775a25d6 /gbdb/allMis1/allMis1.2bit hgwbeta gbdb md5sum: 134e740c05eedadc24de3a96775a25d6 /gbdb/allMis1/allMis1.2bit blat4a date,size: Jun 19 11:03 569794406 download date,size: Jul 3 10:55 53 hgwdev gbdb date,size: Jun 7 13:34 39 hgwbeta gbdb date,size: Jun 7 13:33 569794406
The first part of the script output lists the md5sums of all four .2bit files. These should be identical.
The second part of the script output lists the timestamps and filesizes.
- The download and hgwdev gbdb files should be symlinks, as evidenced by a small filesize.
- The blat and hgwbeta gbdb files should be the actual files, as evidenced by a large filesize.
- The two symlink filesizes will likely be different, but the filesize of the two actual files should be identical.
If the blat .2bit is not the same as the other .2bit files, ask the pushers to restart the assembly and to pull the newest .2bit file from /gbdb.
Dev: Permissions check: downloads dir
The developer may need to update permissions to the download directory to be at least 664.
hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath/$db/
ls -lLR * -rw-rw-r
Dev: Review blastTabs wiki
Review the blastTabs wiki page to determine if your assembly needs a blastTab update.
🔵 Done with DEV steps? Go to Assembly QA Part 2: Track Steps