All.joiner problem

From Genecats
Jump to navigationJump to search

As of 2017 this problem still exists, but there is a way to check your identifier quickly. Please see Fix section for a way around this.


Background

There is a problem with all.joiner rules and the -database parameter. This is taken from an email from Ann in 2007:

Bob and I just ran into something sneaky with running joinerCheck or runJoiner.csh. There are some rules that can be tricky and we are not exactly sure why. What we do know is the kind of output you can be on the lookout for to know that you're dealing with a "tricky" rule. The output sort of makes it look like the rule has run (with no problems), but it really hasn't run at all.

Example of rules that aren't really running from joinerCheck:

[hgwdev:~/joiner> joinerCheck -keys -database=fr2 -identifier=blastHg18KGPep04Id all.joiner
Checking keys on database fr2

from runJoiner.csh:

[hgwdev:~/joiner> runJoiner.csh fr2 blastHg18KG
found identifiers:

blastHg18KGRef04Id
blastHg18KGPep04Id

Checking keys on database fr2
Checking keys on database fr2


If the rule had really run there would be some output saying something like X hits out of Y ok.

The way to force these rules to run is to leave off the "database" parameter (you can only do this when using the joinerCheck (runJoiner.csh can't handle that)). This will run the rule for all databases that are listed -- just pick yours out of the output.

[hgwdev:~/joiner> joinerCheck -keys -identifier=blastHg18KGPep04Id all.joiner Checking keys on database (null)
 anoCar1.blastHg18KG.qName - hits 45332 of 45332 ok
 equCab1.blastHg18KG.qName - hits 61628 of 61628 ok
 ornAna1.blastHg18KG.qName - hits 35761 of 35761 ok
 oryLat1.blastHg18KG.qName - hits 38111 of 38111 ok
 bosTau2.blastHg18KG.qName - hits 56832 of 56832 ok
 strPur1.blastHg18KG.qName - hits 30907 of 30907 ok
 mm8.blastHg18KG.qName - hits 70754 of 70754 ok
 fr2.blastHg18KG.qName - hits 40510 of 40510 ok 
 dm2.blastHg18KG.qName - hits 22239 of 22239 ok
 felCat3.blastHg18KG.qName - hits 37715 of 37715 ok
 canFam2.blastHg18KG.qName - hits 62741 of 62741 ok
 danRer4.blastHg18KG.qName - hits 44692 of 44692 ok
 xenTro2.blastHg18KG.qName - hits 40291 of 40291 ok
 galGal3.blastHg18KG.qName - hits 35851 of 35851 ok
 monDom4.blastHg18KG.qName - hits 63264 of 63264 ok
 hg18.blastKGRef04.acc - hits 36727 of 36727 ok
 hg18.knownGene.name - hits 0 of 56722
Error: 56722 of 56722 elements of hg18.knownGene.name are not in key blastKGPep04.name line 1995 of all.joiner
Example miss: UC001AAA.1

runJoiner.csh will also do this for all the identifiers:

[hgwdev]$ runJoiner.csh all blastHg18KG |& grep fr2
fr2.blastHg18KG.qName - hits 40510 of 40510 ok
fr2.blastHg18KG.qName - hits 40510 of 40510 ok

This happens due to the way joinerCheck -keys works and is not a bug. When you have a rule like the following:

identifier blastHg18KGPep04Id
"hg18 peptides 04"
     hg18.blastKGPep04.name
     hg18.blastKGRef04.acc full
     hg18.knownGeneOld2.name  minCheck=0.95
     monDom4.blastHg18KG.qName
     galGal3.blastHg18KG.qName
     xenTro2.blastHg18KG.qName
     danRer4.blastHg18KG.qName

When joinerCheck -keys is run, it tries to check all of the rules are met. However, if the -database=abcDef1 parameter is supplied, joinerCheck only checks rules for which abcDef1 is the primary database. In the blastHg18KGPep04Id example above, the primary database is hg18, and the primary key is the name column of the blastKGPep04 table. If -database=fr2 is supplied instead, joinerCheck won't check the table because fr2 is not the primary database for the rule. For some rules, this means leaving off the database parameter and checking only the identifier parameter will work, however for rules where large tables are to be checked, a better option is needed.

Fix

A way around this is to just use MySQL directly to query the table. Here is an example for checking the metaGbCdnas identifier, which often comes up during new assembly QA:

identifier metaGbCdnas
"Connect metaGbd gbCdnaInfo with mRNA/EST"
    hgFixed.gbCdnaInfo.acc
    $metaGbd.all_mrna.qName
    $metaGbd.all_est.qName
    $metaGbd.refGene.name
    $metaGbd.xenoRefGene.name
    hgFixed.gbMiscDiff.acc
    $metaGbd.refSeqAli.qName
    $metaGbd.xenoRefSeqAli.qName
    $metaGbd.xenoMrna.qName
    $metaGbd.xenoEst.qName
    hgFixed.gbWarn.acc

If you were QAing the ci2 assembly, and it had a xenoRefGene track, you would run into this bug if you ran the following runJoiner command:

$ runJoiner.csh ci2 xenoRefGene noTimes
$ runJoiner.csh ci2 xenoRefGene noTimes

found identifiers:

gbCdnas
metaGbCdnas
xenoRefSeqId

Checking keys on database ci2
Checking keys on database ci2
Checking keys on database ci2
 ci2.xenoRefFlat.name - hits 190784 of 190784 (100.000%) ok
 ci2.xenoRefSeqAli.qName - hits 190784 of 190784 (100.000%) ok

Instead, you can check the rule by directly querying MySQL:

$ hgsql -e "select count(name) from ci2.xenoRefGene inner join hgFixed.gbCdnaInfo on xenoRefGene.name = gbCdnaInfo.acc"
+-------------+
| count(name) |
+-------------+
|      190784 |
+-------------+

Then all you have to do is verify that the returned count matches the count from ci2.xenoRefGene, and you have full coverage:

$ hgsql -e "select count(name) from ci2.xenoRefGene"
+-------------+
| count(name) |
+-------------+
|      190784 |
+-------------+

This is way faster than running joinerCheck on just the identifier and picking your database out of the output.

If your rule had a minCheck, like the following vegaPseudoGeneZfishName rule, then the join would only return the matching number, you just need to verify that it is within the minCheck rules:

identifier vegaPseudoGeneZfishName typeOf=vegaPseudoZfishName dependency
"Link together Zebrafish VEGA predicted gene structure additional information"
    $danRer.vegaPseudoGene.name
    $danRer.vegaInfoZfish.transcriptId minCheck=0.004
    $danRer.vegaToCloneId.transcriptId minCheck=0.004

$ hgsql -e "select count(vegaPseudoGene.name) from danRer5.vegaPseudoGene inner join danRer5.vegaInfoZfish on vegaPseudoGene.name = vegaInfoZfish.transcriptId"
+----------------------------+
| count(vegaPseudoGene.name) |
+----------------------------+
|                         68 |
+----------------------------+
$ hgsql -e "select count(transcriptId) from vegaInfoZfish" danRer5
+---------------------+
| count(transcriptId) |
+---------------------+
|               14001 |
+---------------------+
$ calc 68 / 14001 \* 100
68 / 14001 * 100 = 0.485680