Assembly QA Part 1 DEV Steps: Difference between revisions
No edit summary |
mNo edit summary |
||
Line 80: | Line 80: | ||
</span> | </span> | ||
====<span style="color:deepskyblue">Dev | ====<span style="color:deepskyblue">Dev 4.0: BLAT Servers==== | ||
</span> | </span> | ||
Line 88: | Line 88: | ||
Make sure that this assembly is not hosted on "blatx" BLAT server. That server is not as stable and therefore is for assemblies that are not destined for the RR. For more information about where the blat servers for different machines should be hosted, go to [[Updating blat servers]]. | Make sure that this assembly is not hosted on "blatx" BLAT server. That server is not as stable and therefore is for assemblies that are not destined for the RR. For more information about where the blat servers for different machines should be hosted, go to [[Updating blat servers]]. | ||
====<span style="color:deepskyblue">Dev | ====<span style="color:deepskyblue">Dev 5.0. Check validity of alignment files==== | ||
</span> | </span> | ||
This step ensures that all associated alignment (chain/net/liftOver) files are to other VALID assemblies on the RR. | This step ensures that all associated alignment (chain/net/liftOver) files are to other VALID assemblies on the RR. | ||
#Search for Chain/Net on the file list noted on the Redmine page for your assembly and note the file names. | |||
#Search for Chain/Net on the page and note the file names. | |||
#Next, go to /gbdb and see what liftOver files exist for your assembly. For example, | #Next, go to /gbdb and see what liftOver files exist for your assembly. For example, | ||
Line 134: | Line 132: | ||
:::: i.e. the best chains in the Human genome, with gaps in the best chains filled in by next-best chains where possible. The axt format is described in [http://genome.ucsc.edu/goldenPath/help/axt.html the axt help page]. | :::: i.e. the best chains in the Human genome, with gaps in the best chains filled in by next-best chains where possible. The axt format is described in [http://genome.ucsc.edu/goldenPath/help/axt.html the axt help page]. | ||
====<span style="color:dodgerblue">Dev 6.0: Compare Chrom Sizes==== | |||
====<span style="color:dodgerblue">Dev | |||
</span> | </span> | ||
: Ignore this if assembly is the first for a species. | : Ignore this if assembly is the first for a species. | ||
: For a new assembly version, compare the chrom sizes from the last assembly to this new assembly version. You are not checking annotations on the reference sequence, you are just checking the number of base pairs per chrom/contig, and making sure that nothing has changed drastically (i.e., millions of base pairs different). Also take a look for general differences, such as chrom labels or number of chrom/contigs. | : For a new assembly version, compare the chrom sizes from the last assembly to this new assembly version. For some assemblies, chrom names were changed, be aware of this if comparing. You are not checking annotations on the reference sequence, you are just checking the number of base pairs per chrom/contig, and making sure that nothing has changed drastically (i.e., millions of base pairs different). Also take a look for general differences, such as chrom labels or number of chrom/contigs. | ||
:Output chrom sizes into two files, sort each file by using the command below | :Output chrom sizes into two files, sort each file by using the command below | ||
:Compare the sorted files | :Compare the sorted files | ||
Line 163: | Line 156: | ||
You may want to use "$cat oldChromSizes | head" to clean up the output in both old and new chromSize files, we are only concerned with the "chr# ####" labels. | You may want to use "$cat oldChromSizes | head" to clean up the output in both old and new chromSize files, we are only concerned with the "chr# ####" labels. | ||
====<span style="color:dodgerblue">Dev | ====<span style="color:dodgerblue">Dev 7.0: Gateway Page Checks==== | ||
</span> | </span> | ||
====<span style="color:deepskyblue">Dev | ====<span style="color:deepskyblue">Dev 8.0. Check default position==== | ||
</span> | </span> | ||
====<span style="color:deepskyblue">Dev | ====<span style="color:deepskyblue">Dev 9.0. Alphabetical menu order==== | ||
</span> | </span> | ||
====<span style="color:deepskyblue">Dev | ====<span style="color:deepskyblue">Dev 10.0. Organism image check==== | ||
</span> | </span> | ||
====<span style="color:deepskyblue">Dev | ====<span style="color:deepskyblue">Dev 11.0. Accession ID check==== | ||
</span> | </span> | ||
Line 200: | Line 193: | ||
====<span style="color:dodgerblue">Dev 12.0: md5sum Checks==== | |||
</span> | |||
We need to verify that the download files exist and are not corrupt in the following directory: | We need to verify that the download files exist and are not corrupt in the following directory: | ||
Line 229: | Line 218: | ||
Note that the md5sum.txt file obviously does not contain md5sum.txt and it was created before there was a README.txt file, so your diff will show md5sum.txt and README.txt in the results. If everything is ok, those should be the only results. In the vs* directories, the XXXX.net.axt.gz file will show up as axtNet/XXXX.net.axt.gz. | Note that the md5sum.txt file obviously does not contain md5sum.txt and it was created before there was a README.txt file, so your diff will show md5sum.txt and README.txt in the results. If everything is ok, those should be the only results. In the vs* directories, the XXXX.net.axt.gz file will show up as axtNet/XXXX.net.axt.gz. | ||
====<span style="color:deepskyblue">Dev | ====<span style="color:deepskyblue">Dev 12.1. Check md5sum for /bigZips==== | ||
</span> | </span> | ||
Line 238: | Line 227: | ||
md5sum * | sort > ~/temp/filename_1; sort md5sum.txt > ~/temp/filename_2; diff ~/temp/filename_1 ~/temp/filename_2 | md5sum * | sort > ~/temp/filename_1; sort md5sum.txt > ~/temp/filename_2; diff ~/temp/filename_1 ~/temp/filename_2 | ||
====<span style="color:deepskyblue">Dev | ====<span style="color:deepskyblue">Dev 12.2. Check md5sum for /liftOver==== | ||
</span> | </span> | ||
Line 247: | Line 236: | ||
md5sum * | sort > ~/temp/filename_1; sort md5sum.txt > ~/temp/filename_2; diff ~/temp/filename_1 ~/temp/filename_2 | md5sum * | sort > ~/temp/filename_1; sort md5sum.txt > ~/temp/filename_2; diff ~/temp/filename_1 ~/temp/filename_2 | ||
====<span style="color:deepskyblue">Dev | ====<span style="color:deepskyblue">Dev 12.3. Check md5sum for /vsXXX==== | ||
</span> | </span> | ||
Line 258: | Line 247: | ||
md5sum * | sort > ~/temp/filename_1; sort md5sum.txt > ~/temp/filename_2; diff ~/temp/filename_1 ~/temp/filename_2 | md5sum * | sort > ~/temp/filename_1; sort md5sum.txt > ~/temp/filename_2; diff ~/temp/filename_1 ~/temp/filename_2 | ||
====<span style="color:dodgerblue">Dev | ====<span style="color:dodgerblue">Dev 13.0: README.txt Checks==== | ||
</span> | </span> | ||
Check that we have READMEs at top level, and for bigZips, chromosomes, liftOvers and comparatives (multiz, phastCons, vsXXX). Verify that the information in the READMEs is correct. Note that some of the files mentioned in the README are generated by the Genbank process, so they won't be present yet. | Check that we have READMEs at top level, and for bigZips, chromosomes, liftOvers and comparatives (multiz, phastCons, vsXXX). Verify that the information in the READMEs is correct. Note that some of the files mentioned in the README are generated by the Genbank process, so they won't be present yet. | ||
The genbank process will build the upstream* files on hgdownload if they don't exist, or are more than seven days old, with whatever genePred table is defined in etc/genbank.conf (e.g. hg16.upstreamGeneTbl = refGene ). Make sure that the README for the upstream* files reflects the genePred table listed for this assembly. | The genbank process will build the upstream* files on hgdownload if they don't exist, or are more than seven days old, with whatever genePred table is defined in etc/genbank.conf (e.g. hg16.upstreamGeneTbl = refGene ). Make sure that the README for the upstream* files reflects the genePred table listed for this assembly. | ||
====<span style="color:deepskyblue">Dev | ====<span style="color:deepskyblue">Dev 13.1. Check README for /bigZips==== | ||
</span> | </span> | ||
====<span style="color:deepskyblue">Dev | ====<span style="color:deepskyblue">Dev 13.2. Check README for /database==== | ||
</span> | </span> | ||
====<span style="color:deepskyblue">Dev | ====<span style="color:deepskyblue">Dev 13.3. Check README for /liftOver==== | ||
</span> | </span> | ||
====<span style="color:deepskyblue">Dev | ====<span style="color:deepskyblue">Dev 13.4. Check README for /vsXXX==== | ||
</span> | </span> | ||
====<span style="color:dodgerblue">Dev | ====<span style="color:dodgerblue">Dev 14.0: corruption Checks==== | ||
</span> | </span> | ||
Check that the files in each bigZips, liftOver, vsXXX don't contain weird characters. | Check that the files in each bigZips, liftOver, vsXXX don't contain weird characters. | ||
====<span style="color:deepskyblue">Dev | ====<span style="color:deepskyblue">Dev 14.1. Check /bigZip for corruption==== | ||
</span> | </span> | ||
Line 298: | Line 284: | ||
Scroll the output and make sure all the text is ASCII. If there are any issues alert the developer. | Scroll the output and make sure all the text is ASCII. If there are any issues alert the developer. | ||
====<span style="color:deepskyblue">Dev | ====<span style="color:deepskyblue">Dev 14.2. database: corruption==== | ||
</span> | </span> | ||
Line 305: | Line 291: | ||
Scroll the output and make sure all the text is ASCII. If there are any issues alert the developer. | Scroll the output and make sure all the text is ASCII. If there are any issues alert the developer. | ||
====<span style="color:deepskyblue">Dev | ====<span style="color:deepskyblue">Dev 14.3. liftOver: corruption==== | ||
</span> | </span> | ||
Line 312: | Line 298: | ||
Scroll the output and make sure all the text is ASCII. If there are any issues alert the developer. | Scroll the output and make sure all the text is ASCII. If there are any issues alert the developer. | ||
====<span style="color:deepskyblue">Dev | ====<span style="color:deepskyblue">Dev 14.4. vsXXX: corruption==== | ||
</span> | </span> | ||
Line 319: | Line 305: | ||
Scroll the output and make sure all the text is ASCII. If there are any issues alert the developer. | Scroll the output and make sure all the text is ASCII. If there are any issues alert the developer. | ||
====<span style="color:dodgerblue">Dev | ====<span style="color:dodgerblue">Dev 15.0: liftOver files exist?==== | ||
</span> | </span> | ||
====<span style="color:deepskyblue">Dev | ====<span style="color:deepskyblue">Dev 15.1. liftOver: new-to-old==== | ||
</span> | </span> | ||
====<span style="color:deepskyblue">Dev | ====<span style="color:deepskyblue">Dev 15.2. liftOver: old-to-new==== | ||
</span> | </span> | ||
Line 332: | Line 318: | ||
====<span style="color:dodgerblue">Dev | ====<span style="color:dodgerblue">Dev 16.0: downloads permissions check==== | ||
</span> | </span> | ||
Line 338: | Line 324: | ||
====<span style="color:dodgerblue">Dev | ====<span style="color:dodgerblue">Dev 17.0: Do Track QA for all relevant tracks==== | ||
</span> | </span> | ||
Revision as of 22:44, 27 March 2017
This page is currently a draft in progress. For now, use Releasing an assembly instead.
Navigation Menu |
Dev 1.0. Getting started in hive
During this assembly release process, you will be generating a lot of output, and you'll need a place to put everything. The use of the "hive" directory is encouraged as the best location because of ample space.
Dev 1.1. Make a directory in your hive
mkdir /hive/users/userName/assemblies/assemblyName e.g.: mkdir /hive/users/cath/assemblies/manPen1
Dev 1.2. Optional: Create an alias to your new dir
When you add an alias from your .bashrc file, you can simply type that alias in your command line as a shortcut to the associated command. A "shortcut" alias can be created to allow fast access to your hive directory for this assembly.
To do this, follow the steps below:
- In your terminal, connect to hgwdev and type "cd" (go to your home directory).
- Confirm the location of .bashrc. Type "ls -a" in your home directory to see all hidden files that have a " . " in the filename. This way you can confirm the location of your .bashrc file.
- Open your .bashrc file for editing. If you're using the vi editor, you can type "vi .bashrc" to edit the file. Add an alias by typing in the line below, then save your changes.
alias hive='cd /hive/users/yourUserName/assemblies/yourAssembly' e.g., alias hive='cd /hive/users/cath/assemblies/manPen1'
Dev 2.0: Getting started in Redmine
- As of March 2017, the PushQ has been replaced with Redmine to track and release new assemblies.
- Review the Redmine as the pushQ replacement wiki page.
- Go to Redmine > GB > Issues > Filter: "Ready for QA"
- Find the assembly you will QA/Release
Dev 2.1. Redmine: Set 'assignee' as yourself
Dev 2.2. Redmine: Set the engineer as 'watcher' (if not developer)
Dev 2.3. Redmine: Set Status to 'Reviewing'
Dev 3.0: Check "minimal browser" criteria
Under construction: Jairo
Does this assembly have the required tracks?
Visit this page to check that the assembly contains the required tracks to be considered a minimal browser on the RR.
To add explaination: genbank mrnas & ests (/cluster/data/genbank/data/organism.lst) How to view/interpret the file
Dev 4.0: BLAT Servers
To check if your organism has Blatservers setup, run the following command:
hgwdev > copyHgcentral test $db blatServers dev beta
The developer has often already requested that the blat servers be set up for the new assembly. If not, and/or if entries for your assembly are missing from hgcentraltest.blatServers, please make a note in the Redmine ticket and ask the assembly builder to 1) request the setup of the blat servers and to 2) manually add the entries to hgcentraltest.blatServers. Make sure that this assembly is not hosted on "blatx" BLAT server. That server is not as stable and therefore is for assemblies that are not destined for the RR. For more information about where the blat servers for different machines should be hosted, go to Updating blat servers.
Dev 5.0. Check validity of alignment files
This step ensures that all associated alignment (chain/net/liftOver) files are to other VALID assemblies on the RR.
- Search for Chain/Net on the file list noted on the Redmine page for your assembly and note the file names.
- Next, go to /gbdb and see what liftOver files exist for your assembly. For example,
cd /gbdb/manPen1/liftOver or, from the location /gbdb on hgwdev: ls -d */liftOver/*hg38*
If your assembly has chain/net/liftOver to/from an assembly that is *not* on the RR (and not in the pushQ as an upcoming new assembly), you do not need to QA them or push them to the RR. Drop the relevant row(s) from your sub-pushQ by going to the track entry, clicking lock and then clicking the delete button.
Below is a description of alignment files, for example, human (hg38) and lizard (anoCar2).
Hiram has suggested to go to this directory and read more about the file types here:
less /hive/data/genomes/hg38/bed/lastzAnoCar2.2015-02-05/axtChain/netChains.csh
- LiftOver Files
- A liftOver file is a chain file, it is a subset of all chains used in creating the net file.
- hg38ToAnoCar2.over.chain.gz: These files are required for the liftOver utility.
- The file names reflect the assembly conversion data contained within in the format <db1>To<Db2>.over.chain.gz. For example, a file named hg38ToAnoCar2.over.chain.gz file contains the liftOver data needed to convert hg38 coordinates to the anoCar2 assembly.
- hg38ToAnoCar2.over.chain.gz: These files are required for the liftOver utility.
- A liftOver file is a chain file, it is a subset of all chains used in creating the net file.
- Chain Files
- Chain files contain all possible chains generated by lastz, before a subset of "best" chains are filtered into the liftOver file.
- hg38.anoCar2.all.chain.gz: chained blastz alignments.
- The chain format is described in on the chain help page.
- hg38.anoCar2.all.chain.gz: chained blastz alignments.
- Chain files contain all possible chains generated by lastz, before a subset of "best" chains are filtered into the liftOver file.
- Net Files
- hg38.anoCar2.net.gz: "net" file.
- This file describes rearrangements between the species and the best Lizard match to any part of the Human genome. The net format is described in on the net help page.
- hg38.anoCar2.net.gz: "net" file.
- Axt Files
- hg38.anoCar2.net.axt.gz: chained and netted alignments.
- i.e. the best chains in the Human genome, with gaps in the best chains filled in by next-best chains where possible. The axt format is described in the axt help page.
- hg38.anoCar2.net.axt.gz: chained and netted alignments.
Dev 6.0: Compare Chrom Sizes
- Ignore this if assembly is the first for a species.
- For a new assembly version, compare the chrom sizes from the last assembly to this new assembly version. For some assemblies, chrom names were changed, be aware of this if comparing. You are not checking annotations on the reference sequence, you are just checking the number of base pairs per chrom/contig, and making sure that nothing has changed drastically (i.e., millions of base pairs different). Also take a look for general differences, such as chrom labels or number of chrom/contigs.
- Output chrom sizes into two files, sort each file by using the command below
- Compare the sorted files
- There are two ways to compare chromosomes:
- 1.Navigate to http://hgwdev.cse.ucsc.edu/cgi-bin/hgGateway, find your assembly and click on the "View Sequences" button - bring up 2 windows side by side to view both old and new assemblies. Now, you can compare the chromosome sizes.
or
2. open up a terminal window and input the following commands:
hgwdev > hgsql -Ne "select chrom, size from chromInfo" $oldDb > oldChromSizes assemblyName (e.g., "panTro4") hgwdev > hgsql -Ne "select chrom, size from chromInfo" $newDb > newChromSizes assemblyName (e.g., "panTro5") hgwdev > sdiff -s oldChromSizes newChromSizes
You may want to use "$cat oldChromSizes | head" to clean up the output in both old and new chromSize files, we are only concerned with the "chr# ####" labels.
Dev 7.0: Gateway Page Checks
Dev 8.0. Check default position
Dev 10.0. Organism image check
Dev 11.0. Accession ID check
Assemblies/sequences, from various organizations, are submitted to the mother ship GenBank.
Those assemblies might be included in RefSeq if criteria are met.
The QA check should be to go out to NCBI and double check that the accessionID is correct.
- RefSeq assemblies:
- use accession ID: GCF_000002315.4 (e.g., galGal5)
- are delivered with chrMt (if they exisit)
- are delivered with NCBI gene predictions
- Genbank assemblies:
- use accession ID: GCA_000001305.2
- delivered without a chrMt.
- do not have gene predictions.
For the UCSC Genome Browser, it is preferable to use RefSeq assemblies (in part due to 'more data'). This is a "learn as we go" direction; historically GeneBank was preferred.
Helpful article: Nature, 2012 A beginner's guide to eukaryotic genome annotation
Dev 12.0: md5sum Checks
We need to verify that the download files exist and are not corrupt in the following directory:
hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath/$db/
Before we begin, create a junk folder in your home directory:
mkdir ~/temp
We will be using a computer program called Md5sum to generate MD5 hashes to verify the integrity of the files since any change to the file will cause its MD5 hash to change. The MD5 hashes for each file was generated and stored in the md5sum.txt file.
An easy way to compare the MD5 hashes of each file is to do a diff. This can be easily automated by running the following commands.
The first command is to run md5sum for all files in your current directory, sort them, and then redirect the output to a file.
md5sum * | sort > ~/temp/filename_1
The second command sorts the md5sum.txt file and redirects the output to a different file.
sort md5sum.txt > ~/temp/filename_2
The final command compares the two files created and displays the lines that differ between the two files.
diff ~/temp/filename_1 ~/temp/filename_2
Note that the md5sum.txt file obviously does not contain md5sum.txt and it was created before there was a README.txt file, so your diff will show md5sum.txt and README.txt in the results. If everything is ok, those should be the only results. In the vs* directories, the XXXX.net.axt.gz file will show up as axtNet/XXXX.net.axt.gz.
Dev 12.1. Check md5sum for /bigZips
Change your directory to:
hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath/$db/bigZips
then run the following command:
md5sum * | sort > ~/temp/filename_1; sort md5sum.txt > ~/temp/filename_2; diff ~/temp/filename_1 ~/temp/filename_2
Dev 12.2. Check md5sum for /liftOver
Change your directory to:
hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath/$db/liftOver
then run this command:
md5sum * | sort > ~/temp/filename_1; sort md5sum.txt > ~/temp/filename_2; diff ~/temp/filename_1 ~/temp/filename_2
Dev 12.3. Check md5sum for /vsXXX
This section is only relevant if your assembly has chain/net files to another organism.
Change your directory to:
hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath/$db/vsXXX
then run the following command:
md5sum * | sort > ~/temp/filename_1; sort md5sum.txt > ~/temp/filename_2; diff ~/temp/filename_1 ~/temp/filename_2
Dev 13.0: README.txt Checks
Check that we have READMEs at top level, and for bigZips, chromosomes, liftOvers and comparatives (multiz, phastCons, vsXXX). Verify that the information in the READMEs is correct. Note that some of the files mentioned in the README are generated by the Genbank process, so they won't be present yet.
The genbank process will build the upstream* files on hgdownload if they don't exist, or are more than seven days old, with whatever genePred table is defined in etc/genbank.conf (e.g. hg16.upstreamGeneTbl = refGene ). Make sure that the README for the upstream* files reflects the genePred table listed for this assembly.
Dev 13.1. Check README for /bigZips
Dev 13.2. Check README for /database
Dev 13.3. Check README for /liftOver
Dev 13.4. Check README for /vsXXX
Dev 14.0: corruption Checks
Check that the files in each bigZips, liftOver, vsXXX don't contain weird characters.
Dev 14.1. Check /bigZip for corruption
Change your working directory to:
hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath/$db/bigZips
Run the following in the directory:
for file in *; do zcat $file | head; zcat $file | tail; done
Scroll the output and make sure all the text is ASCII. If there are any issues alert the developer.
Dev 14.2. database: corruption
Run the following in each directory and check the output:
for file in *; do zcat $file | head; zcat $file | tail; done
Scroll the output and make sure all the text is ASCII. If there are any issues alert the developer.
Dev 14.3. liftOver: corruption
Run the following in each directory and check the output:
for file in *; do zcat $file | head; zcat $file | tail; done
Scroll the output and make sure all the text is ASCII. If there are any issues alert the developer.
Dev 14.4. vsXXX: corruption
Run the following in each directory and check the output:
for file in *; do zcat $file | head; zcat $file | tail; done
Scroll the output and make sure all the text is ASCII. If there are any issues alert the developer.
Dev 15.0: liftOver files exist?
Dev 15.1. liftOver: new-to-old
Dev 15.2. liftOver: old-to-new
Dev 16.0: downloads permissions check
Dev 17.0: Do Track QA for all relevant tracks
- Follow the [New_track_checklist | New Track Checklist] on the wiki.
- NOTE TO SELF - Explain which tracks don't need checking, this is confusing for new employees.
.
.
🔵 Done with DEV steps? Go to Assembly QA Part 2: Track Steps