Assembly QA Part 2 Track Steps

From Genecats
Jump to navigationJump to search

These steps were revised in 2017, but you can also see the old steps: Releasing an assembly (old steps)


Navigation Menu

Home Page
Assembly QA Part 1: DEV Steps
Assembly QA Part 2: Track Steps
Assembly QA Part 3: BETA Steps
Assembly QA Part 4: RR Steps


Tracks: The BIG QA Script: qaGbTracks

Recommended usage of qaGbTracks:

 cd /hive/data/genomes/$db/pushQ
 cat redmine.$db.table.list | grep "$db"> my_table_list.txt
 cat my_table_list.txt | grep -v "author\|cds\|cell\|description\|development\|gbCdnaInfo\|gbExtFile\|gbLoaded\|gbMiscDiff\|gbSeq\|gbWarn\|geneName\|imageClone\|keyword\|\library|\mrnaClone|\organism|\productName|\refLink|\refSeqStatus|\refSeqSummary|\sex|\source|\tissue" > final_table_list.txt
 qaGbTracks -f final_table_list.txt $db qaGbTracks_output.all

The following command checks the summary table for errors:

 grep YES qaGbTracks_output.all.summary

You can ignore Errors that involve checking label lengths (cytoBandIdeo). These do not cause functional errors.

 checks for underscores in table names
 checks for the existence of table descriptions
 checks shortLabel and longLabel length
 positionalTblCheck
 checkTblCoords
 genePredCheck
 pslCheck
 featureBits
 (a version of) countPerChrom
  • Example with just one file database (gc5Base): qaGbTracks papHam1 gc5Base output.gc5
  • Example with file list of tables: qaGbTracks -f table.List susScr11 qaGbTracks.all
  • Create a list like so: cat redmine.$db.table.list | grep "$db" | cut -f 2 -d . > my_table_list.txt
  • You can remove tables that were moved to the hgFixed database. A list of them can be found at the following help page
  • cat $db.table.list | grep -v "author\|cds\|cell\|description\|development\|gbCdnaInfo\|gbExtFile\|gbLoaded\|gbMiscDiff\|gbSeq\|gbWarn\|geneName\|imageClone\|keyword\|\library|\mrnaClone|\organism|\productName|\refLink|\refSeqStatus|\refSeqSummary|\sex|\source|\tissue" > final_table_list.txt


  1. Run the output for your assembly tables
  2. Check your output in the following steps:

Tracks: qaGbTracks: Table name underscores

  • Check to make sure that none of the table names have underscores(_) except
  • older tables that have underscores (all_est and all_mrna) -- these are OK
  • product of trackDb make command (trackDb_qateam or hgFindSpec_dschmelt).
  • Otherwise, check that split tables (tables that start with chr) DO NOT have more than one underscore in their name.
  • Run the two queries below and verify that the only returned results follow the above rules:
mysql > show tables like "%\_%";
mysql > show tables like "%\_%\_%";

Tracks: qaGbTracks: Track label lengths

The track shortLabel must be 17 characters or less.
The shortLabel is visible in the main hgTracks display if you turn the track to dense.
If it is 17 characters or less, it won't be cut off in this part of the display.
The track longLabel must be 80 characters or less.
The longLabel is visible in hgTracks, and it can also be copied from the configuration page (hgTrackUi).

Tracks: qaGbTracks: positionalTblCheck

positionalTblCheck - Checks to see that the positional table is ordered by chrom and chromStart. A positional table being out of order can cause a huge slowdown in display speed. If the table passes there will be no output. If the table does not pass there will be an error message like:

table hg18.snp129 not sorted starting at row 4867: chr1:387005

Alert the track sponsor if there is an error. He/she may determine that the items are sorted enough to be released. (As long as the items are almost all in order, it will not affect performance.) Also note that Genbank tables are not expected to be in order after updates have run. (Almost every table will need this check.)

Tracks: qaGbTracks: checkTableCoords

checkTableCoords - Checks that the genomic coordinates in positional tables are legal (e.g., coordinates are not off the end of a chromosome). If the table passes there will be no output. (Almost every table will need this check, unless it has no chromosome positions in it.)


Tracks: qaGbTracks: genePredCheck

genePredCheck - Checks that tables in the genePred format are valid. (Tables of this type are usually in the Genes and Gene Prediction group. Usually only the primary table will need this check. Some examples: knownGene, ensGene, refGene.)

Tracks: qaGbTracks: pslCheck

pslCheck - Checks that psl tables are valid. (PSL tables show up in nearly any track where alignments are used. Some examples: mrna, est, refSeqAli.)

Tracks: qaGbTracks: FeatureBits and Gaps

Run featureBits, or use the runBits.csh script to run featureBits. runBits.csh checks for coverage and overlap with gap, and also checks for undbridged gaps.

 runBits.csh $db $table

For example, run featureBits on gold/gap

Run featureBits to verify that the gold and gap tables together cover the entire genome. Run:

 featureBits -countGaps -or $db gold gap

to make sure that the gold and gap table together cover the entire genome (should be 100%).

Example:

featureBits -countGaps -or manPen1 gold gap
2204741241 bases of 2204741241 (100.000%) in intersection

You can run each table separately to see the coverage of each table on the genome:

featureBits -countGaps manPen1 gold
1999066070 bases of 2204741241 (90.671%) in intersection

featureBits -countGaps manPen1 gap
205675171 bases of 2204741241 (9.329%) in intersection


Alert the track sponsor if there are suspicious unbridged gaps. If previous assembly also has this track, compare featureBits between current assembly and previous assembly -- if there are big differences between the old and new tracks, alert the track sponsor.

If there is a similar track that this one can be compared to, use either "featureBits -enrichment" or getYield.csh to compare the tracks. Alert the track sponsor if the difference seems unreasonable.

getYield.csh can be used to see how well a new track captures the footprint of existing tracks, such as refGene or xenoRefGene, e.g.,

 getYield.csh hg19 ensGene refGene

output includes:

 yield       = 93.8% (intersection / refGene)
 enrichment  = 24.7x ((intersection / ensGene) / (refGene / genome))

shows that 93.8% of refGene is present in ensGene and that compared to the refGene footprint on the genome, ensGene is 24x enriched for refGenes. (Enrichment is the amount of table1 that covers table2 vs. the amount of table1 that covers the genome. It's how much denser table1 is in table2 than it is genome-wide.)

"featureBits -enrichment" does almost the exact same thing as getYield.csh, except the coverage amount is the number of bases in the intersection of the two tables divided by the first table instead of the second table.

Tracks: qaGbTracks: Chromosome coverage: countPerChrom


Check the count of items on each chromosome. Bigger chromosomes (the biggest is usually chr1) should have more items. Look for chromosomes that have suspiciously few or no items on them. Note: this script must be run on dev:

 countPerChrom.csh $db $table

or as a histogram:

countPerChrom.csh $db $table histogram
Example: 
countPerChrom.csh hg19 refGene histogram

You can also view two tables' counts side by side with the pr command, make your terminal window relatively wide, and then run:

pr -mt -w 120 <(countPerChrom.csh $db $table1 histogram) <(countPerChrom.csh $db $table2 histogram)

For example, to compare the counts of the crisprRanges and flyBaseGene tables on drosophila:

$ pr -mt -w 155 <(countPerChrom.csh dm6 flyBaseGene histogram) <(countPerChrom.csh dm6 crisprRanges histogram)
									      
  M									      	M
  X xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx				      	X xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
  Y									      	Y
 2L xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx				       2L xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

If a "regular" chrom (not random or haplotype) has no data, a note like this should appear in place of the track drop-down on hgTracks:

 [No data-chr21]

(This is controlled by the "chromosomes" setting in trackDb.ra. The chromosomes line in trackDb.ra specifies the chromosomes that DO have data. This line is the same as the "restrictList" field in the trackDb table.)

OR use these alternative ways to see chromosome coverage:

  • Import the table into Genome Graphs.
  • Select the table in the Table Browser, hit the "describe table schema" button, and click the "values" link for the chrom field.
    • If the table is very large, there may not be an "info" column to display the "values" link.

Tracks: qaGbTracks: Table Descriptions: view table schema

  1. Go to your assembly on dev
  2. Click on your track short name in hgTracks (taking you to the track description page).
  3. Click the "view table schema" button
  4. Make sure there is a description column present with descriptions of the table fields.
  5. If a track has more than one table, be sure to check for table descriptions on each of them.
  6. The description column uses the 'tableDescriptions' table to display this information.

The descriptions in the 'tableDescriptions' table are built from autoSql (or .as files), and are added to that table via the script buildTableDescriptions.pl (kent/src/test/buildTableDescriptions.pl). This script first looks for .as files that match a database table by name or trackDb type. If it doesn't find a matching .as that way, it can match .as with table by comparing the set of fields in a .as versus the set of fields in the database table. This means that if one .as has the same fields as several database tables, the .as name doesn't matter -- the script will still match them up. The 'tableDescriptions' table is built nightly on hgwdev and hgwbeta, and must be pushed to the RR if it contains descriptions for a new type of table (things like psl will already be out there). A cron job sends a push request email to the admins to push the tableDescriptions tables to the RR once a week.

Having as few .as files as possible is a good thing because duplicated content is harder to maintain, and hg/lib is already enormous so it's good to reduce the number of new files. To help reduce the number of files added to hg/lib, try to do the following when QA-ing a new track:

  • Check tables that are similar to your track's tables to see if there are any where all of the fields are identical to yours
  • Check kent/src/hg/lib/ for any .as files that have the same name as yours, or may have been created for your track's tables
  • If there are new .as files for your track's tables that are unnecessary, notify the track sponsor that these should be removed

More information on the tableDescriptions table and autoSql can be found here:


Be sure to check the following: If there is another table like the one you are reviewing that has a different schema, be sure that the track type is also different (i.e. don't use the the same track type name for tables with two different schemas). For example, see these two tables in the hg19 database:

mysql> select tableName, type from trackDb where tableName like "wgEncodeRegTfbsClusteredV%";
+----------------------------+--------------+
| tableName                  | type         |
+----------------------------+--------------+
| wgEncodeRegTfbsClusteredV3 | factorSource |
| wgEncodeRegTfbsClusteredV2 | factorSource |
+----------------------------+--------------+
2 rows in set (0.04 sec)

And note that although they both use type = factorSource, the schemas are different. This is not OK.

mysql> desc wgEncodeRegTfbsClusteredV3;
+------------+----------------------+------+-----+---------+-------+
| Field      | Type                 | Null | Key | Default | Extra |
+------------+----------------------+------+-----+---------+-------+
| bin        | smallint(5) unsigned | NO   |     | NULL    |       |
| chrom      | varchar(255)         | NO   | MUL | NULL    |       |
| chromStart | int(10) unsigned     | NO   |     | NULL    |       |
| chromEnd   | int(10) unsigned     | NO   |     | NULL    |       |
| name       | varchar(255)         | NO   | MUL | NULL    |       |
| score      | int(10) unsigned     | NO   |     | NULL    |       |
| expCount   | int(10) unsigned     | NO   |     | NULL    |       |
| expNums    | longblob             | NO   |     | NULL    |       |
| expScores  | longblob             | NO   |     | NULL    |       |
+------------+----------------------+------+-----+---------+-------+
9 rows in set (0.00 sec)

mysql> desc wgEncodeRegTfbsClusteredV2;
+-------------+----------------------+------+-----+---------+-------+
| Field       | Type                 | Null | Key | Default | Extra |
+-------------+----------------------+------+-----+---------+-------+
| bin         | smallint(5) unsigned | NO   |     | NULL    |       |
| chrom       | varchar(255)         | NO   | MUL | NULL    |       |
| chromStart  | int(10) unsigned     | NO   |     | NULL    |       |
| chromEnd    | int(10) unsigned     | NO   |     | NULL    |       |
| name        | varchar(255)         | NO   | MUL | NULL    |       |
| score       | int(10) unsigned     | NO   |     | NULL    |       |
| strand      | char(1)              | NO   |     | NULL    |       |
| thickStart  | int(10) unsigned     | NO   |     | NULL    |       |
| thickEnd    | int(10) unsigned     | NO   |     | NULL    |       |
| reserved    | int(10) unsigned     | NO   |     | NULL    |       |
| blockCount  | int(10) unsigned     | NO   |     | NULL    |       |
| blockSizes  | longblob             | NO   |     | NULL    |       |
| chromStarts | longblob             | NO   |     | NULL    |       |
| expCount    | int(10) unsigned     | NO   |     | NULL    |       |
| expIds      | longblob             | NO   |     | NULL    |       |
| expScores   | longblob             | NO   |     | NULL    |       |
+-------------+----------------------+------+-----+---------+-------+
16 rows in set (0.00 sec)

Tracks: joinerCheck

Recommended Usage

 cd /cluster/home/$userName/kent/src/hg/makeDb/schema
   #Get in directory
 joinerCheck -keys -database=$db all.joiner 2>&1 | tee /hive/users/$userName/$db/all.joiner.output 
   #Run program, pipe output to file in Assembly directory and stdout

The document kent/src/hg/makeDb/schema/joiner.doc describes what all.joiner is. In a nutshell, all.joiner is a file that describes joinable fields in the UCSC Genome Databases. With each sandbox, alpha, beta build (not make alpha DBS=) all.joiner uses it's definitions of relationships through its identifiers to link tables. You can see the results in the Table Browser when you click the describe schema button and see the "Connected Tables and Joining Fields" section for tables that have all.joiner definitions. The tool you want to focus on is joinerCheck, which is used to check that the rules in all.joiner are being followed.

Look for your table names in src/hg/makeDb/schema/all.joiner and find the identifiers associated with those tables. Then, for each identifier, run joinerCheck like so:

 joinerCheck -keys -identifier=$identifier -database=$db all.joiner

Note, the above is run from all.jointer's home kent/src/hg/makeDb/schema/, otherwise the location of all.joiner needs to be spelled out in the command line. If there are errors or the table is not mentioned in all.joiner, notify the track sponsor. An entry in the tablesIgnored section of all.joiner is sufficient if there are no table relationships to check.

Be aware of this problem with joinerCheck. Basically, if you get output that looks like this:

 Checking keys on database hg18

that is NOT followed by lines like this:

 anoCar1.blastHg18KG.qName - hits 45332 of 45332 ok

then the rule didn't really run, and you need to remove the -database parameter.

The other option is to run joinerCheck with the verbose option set to 2, for example:

 joinerCheck -keys -identifier=$identifier -database=$db -verbose=2 all.joiner

Unfortunately, the -verbose=2 option outputs all of the identifiers, so you must search through the list to find your identifier. After you find your identifier in the list, you should hopefully see the output as described above.

You can also run joinercheck with the -times flag:

 joinerCheck -times -database=$db all.joiner

Look for any errors that are relevant to your track.

The runJoiner.csh script is a shortcut for all of the above, but beware that wildcards in tablesIgnored sections are not recognized, and if the problem above occurs, then you need to run joinerCheck directly.

Tracks: Comparison to similar track

Compare your track to a similar track. Look to see that the features in your track are more or less in the same position as similar tracks. Look for a lot of items that have a chromStart or chromEnd that is one position different from existing tracks. This could indicate an off-by-one error in the new track. Also use getYield.csh and/or featureBits to compare intersection of similar tracks (discussed above).

Tracks: Checking track search

If the track item names are relatively unique, check to see if search works by pasting an item name in the position/search box. (For example, to check if search is enabled for the "Common SNPs" track, choose a SNP from the track, say "rs17885219," paste it in the box, and hit "jump." If search is enabled, you will either be taken directly to the position of the item, and it will be highlighted in the display, or you will get a list of all of the tracks that contain the item (and your track should be included). One way to find out if your track should be searchable is to use the assembly $db and table name $table in the following command:

hgsql -Ne "select searchName,searchTable,searchMethod,termRegex from hgFindSpec where searchTable like '%$table%';" $db;

If when searching you get an error, or if you get a list of tracks but your track isn't included in the list, search is not enabled.

If the track item names are relatively similar (e.g., the items in RNA Genes, and TFBS) we don't want to enable search, as it would return too many matches. If search isn't enabled and you think it should be, make that request to the track sponsor.

Finally, if search is enabled, make sure that all of the item names in the track can be found. Do this by checking this page for your assembly and table from the cronjob output from the checkHgFindSpec script which is run against all assemblies.

http://genecats.cse.ucsc.edu/qa/test-results/checkHgFindSpec/hgwdevOutput

This output file is created every day. To check:

curl -I http://genecats.cse.ucsc.edu/qa/test-results/checkHgFindSpec/hgwdevOutput
HTTP/1.1 200 OK
Date: Wed, 01 Nov 2017 22:32:28 GMT

If your assembly and table appear in this list, there is a problem with searches for some identifiers in the track. You'll see an error message like this one:

 Error: mm9.jaxQtl.name value "Idd21.1" doesn't match termRegex "^[a-z0-9-]+$" for search jaxQtl

This means that the item with the name "Idd21.1" in the jaxQtl (MGI QTL) track on mm9 (mouse) doesn't work. Alert the track sponsor.

You can check the entire database (run checkHgFindSpec to check the usage statement).

checkHgFindSpec $db -checkTermRegex

Tracks: Track description


Read the track description and edit for clarity, spelling, and grammar. Be sure our conventions are followed.

Ensure references are in the correct format and in alphabetical order (by first author listed). Links to journal articles should go directly to the journal rather than PubMed if the journal article is open access (i.e., doesn't require a subscription). For articles that are not open access, links can go either to the journal or to PubMed, and they should go to the abstract, not the full text. To make life easier use the getTrackReferences program, where you feed the script a PMID and it outputs the html text. Example usage to get three references: getTrackReferences 24972169 26780180 26322839

Ensure quotes, ampersands, and less than and greater than signs are represented with their html names.

Make sure that any email addresses given on the details page have been through Hiram's sanitizer (encodeEmail.pl). It turns the address into an encrypted HREF "mailto:" address that makes it harder for spammers to use.


Tracks: All details: 1 data point

Choose a representative data point for the track. Check all details for this data point, including all links. Make sure information from the table is displaying correctly (e.g., if a color is used in the table, make sure that color appears for the item.)

For links that are hard-coded to a particular server, there are some tricks that are used to make them testable on hgwdev and hgwbeta. See the Static_Page_JS_Protocol page for more details.

One method of obtaining an item's data to check is to click the "View table schema" link and use the Sample Rows as entry coordinates to navigate to and review the data.

Another option is to use a MySQL query to pull the information from the table for an item you maybe looking at that you are curious to learn more about. Here is an example query followed by a general method:

hgsql -Ne 'select * from gold where frag like "%AMGL01129756.1%";' oviAri3
hgsql -Ne 'select * from (tableName -displayed in schema) where (fieldName -find appropriate field from schema) like "%(yourSearchTerm -from tableName.fieldName)%";' $db

Tracks: Performance and Display

For tracks displayed by default, the full chromosome view (chr1) should display within 20 seconds. For tracks which are not displayed by default, the full chromosome view should display within a minute. You can use the 'measureTiming' cart variable to get accurate load times. To enable this timing option, add &measureTiming=0 to the end of your hgTracks URL. If you did it correctly, your URL should look like this: http://hgwdev.gi.ucsc.edu/cgi-bin/hgTracks?db=hg19&position=chr21%3A33296501-33297001&measureTiming=0. The total time it took to load the page will be next to the label 'Overall total time:' somewhere on the page. You can deactivate this timing option by either doing a cartReset or adding 'measureTiming=', with no value specified to your hgTracks URL.

Turn on to full display mode a track that is located physically below your track in the display. Make sure that when your track is in full display mode, that the items in the track below it are still mapping correctly. Sometimes there can be an off-by-one error which is caused by your track. If this is happening, you should not push your track. This is only likely to be an issue with new track types.


Command line alternative. Have $db table list.

cat redmine.$db.table.list | tr -t "." "\t" | cut -f 2 > $db.tables.txt

Run:

for table in $(cat $db.tables.txt); do echo $table >> filter.log ; /usr/local/apache/cgi-bin-USER/hgTracks  db=$db hideTracks=1 $table=full position=chr_position > /dev/null 2>> filter.log ; done; grep -v trackLog filter.log > perform.log 
 Example:
        for table in $(cat neoSch1.tables.txt); do echo $table >> filter.log ; /usr/local/apache/cgi-bin- 
        gperez2/hgTracks  db=neoSch1 hideTracks=1 $table=full position=NW_018734349v1%3A1%2D84771923 > /dev/null 2>> 
        filter.log ; done; grep -v trackLog filter.log > perform.log 

Check CGI_TIME in the perform.log.

  • Note: not all tracks get checked with this method, oligoMatch and cutters are examples.


Hit the Reverse button and ensure your track displays correctly.

In early 2015, we added a feature that displays the exon number on mouseover/hover. This is on by default for a few different track types. Be sure that if your track is displaying the exon number on mouseover, that it makes sense for your track to be displaying it. If it doesn't make sense for your track, then add exonNumbers off to the trackDb stanza for your track.

Tracks: Track Settings: hgTrackUi

Ensure that track settings work as expected.

You can adjust the track settings in one of two ways. First, by clicking on the track name or the mini-button to the left of the track (in hgTracks). Or by right-clicking the track and selecting "Configure <track name>". When using the right-click menu to adjust the tracks settings, you can immediately view your changes by using the "Apply" button.

Tracks: Check track data in Table Browser

Table Browser Tools' (Ex: check custom track output, intersection, ect.) hgTables

Tracks: Check track data in Data Integrator

Data Integrator' hgIntegrator

Tracks: Check track data in VAI

Check if data related) hgVai

Tracks: GenBank: Review GenBank QA steps

Follow the steps to see which Genbank tables your assembly has.

To show Genbank tables: hgsql -N -e 'SHOW TABLES' oreNil2 | egrep -f /cluster/data/genbank/etc/genbank.tbls http://genomewiki.ucsc.edu/genecats/index.php/Genbank_updates

For example, xenlae2 has the following genebank tables:

  1. all_est
  2. all_mrna
  3. estOrientInfo
  4. gbLoaded
  5. intronEst
  6. mrnaOrientInfo
  7. refFlat
  8. refGene
  9. refSeqAli
  10. xenoRefFlat
  11. xenoRefGene
  12. xenoRefSeqAli

Tracks: Check that your assembly is listed in hgwdev.dbs

The new assembly should already be listed in hgwdev.dbs in the DEV source tree at ~/kent/src/hg/makeDb/genbank/etc/hgwdev.dbs

cat ~/kent/src/hg/makeDb/genbank/etc/hgwdev.dbs | grep $db

If your assembly is missing from hgwdev.dbs , add it alphabetically. If a previous assembly is there, put the previous assembly in the lower list alphabetically.

Tracks: Review Chain Nets QA


Follow the steps to complete Chain/Nets QA for all chain/net tracks for your assembly. Remember that you may also have chain/net tracks to QA on the other organism's assembly that has been aligned to your assembly. If your assembly is "cow" and you have chain/nets to human, you may also need to go to the human assembly and QA the chain/net tracks on human (to cow).

Tracks: Chain Nets QA complete

Tracks: Check if tracks need other type specific QA

Certain track types, such as SNP or Conservation tracks, have additional QA steps that are specific to that track type.

Here's list of track type-specific QA wiki pages:

If the track you're QA-ing is one of these track types, look over the wiki page and ensure you've carried out these additional QA steps.

Tracks: Type specific QA completed

Checklist task for recording progress->completion of type-specific QA needed on any tracks.

Tracks: Note about big vcf track type

All big*/vcf track types used to have a single-entry table that simply contained a pointer to a file in /gbdb/$db. These tracks no longer have such tables, but rather are now dependent upon a bigDataUrl in the trackDb.ra entry. The problem here is that trackDb is most likely going to be pushed to the RR long before the track is actually ready. This is fine as long as the gbdb files don't get pushed to hgnfs1. Since hgwbeta and the RR both use hgnfs1, however, once the gbdb files get pushed to hgnfs1, the track is immediately visible on both hgwbeta and the RR.

To prevent this from happening, all new big*/vcf tracks should have an alpha,beta release tag in the respective trackDb.ra entry. This way, when the gbdb file push to hgnfs1 goes through, the track will show up on hgwbeta, but not the RR. Once it is verified that the track looks good on hgwbeta, the release tag can be removed and a push of trackDb and friends can be requested.

More information about release tags can be found here:
http://genomewiki.ucsc.edu/index.php/ThreeStateTrackDb#Updated_trackDb_release_process


đŸ”” Done with TRACK steps? Go to Assembly QA Part 3: BETA Steps