BlastTabs

From Genecats
Jump to navigationJump to search

When to update the blastTab tables

Update the blastTabs anytime the reference gene track for any of the following species is updated/changed: human, mouse, rat, zebrafish, fly, C. elegans, yeast. The reference gene track for human and mouse is UCSC Genes, for rat it is RGD Genes (as of Dec 2010), for zebrafish it is Ensembl Genes, for fly it is FlyBase Genes, for C. elegans it is WormBase Genes, and for yeast it is SGD Genes.

What are the blastTab tables

There are two types of blastTabs: one type is the "self"BlastTabs, which are used by the gene sorter (hgNear) for showing how closely genes are related to each other within one species in protein space. These blastTabs are given the prefix of the name of the gene set. Thus for human and mouse these blastTabs are called "knownBlastTab", for rat it is called "rgdBlastTab" (as of Dec 2010), for fly it is called "flyBaseBlastTab", for C. elegans it is called "sangerBlastTab" and for yeast it is called "sgdBlastTab". It is important that these blastTabs retain their names exactly, as the CGIs are looking for these particular names. There is no selfBlastTab for zebrafish because there is no gene sorter for this species.

The other type is the "xx"BlastTabs, which are used by the "known genes"-like description page (hgGene) for showing closely related proteins in other species (click "Other Species" at the top of hgGene page) in addition to the other species column in the gene sorter (hgNear). These xxBlastTabs are named with the first letter of the genus and species of the organism they go to (except for human, which is "hg"). For example, the xxBlastTab relating mouse genes to fly genes in the fly database will be called mmBlastTab (note that the first letter is *always* lowercase). There are no xxBlastTabs in the zebrafish database, as zebrafish does not have a "known genes"-like description page for Ensembl Genes. Thus, while other organisms will have drBlastTabs, there are no corresponding xxBlastTabs in the zebrafish database.

Both types of blastTabs are made by taking the protein sequences from the reference gene track and using BLAST to find homologs either within the same species or to other species.

Important possible CGI changes

The developer may have had to make changes to certain files, which are part of the CGI build and thus will need to be pushed out with the new tables. There are two reasons the developer may have needed to make a change to these files:

  • A new updated assembly has been released with a new reference gene track.
  • An update to the URL links on hgGene or hgNear have been made.

Changes to hgGene are kept in src/hg/hgGene/hgGeneData/ in the various otherOrgs.ra files; changes to hgNear are kept in src/hg/near/hgNear/hgNearData in the various columnDb.ra files (other files in these directories run other aspects of the gene sorter and the "known genes"-like description page). These files have a trackDb.ra-like hierarchy and will allow the developer to override the assembly and the URL to outside organizations. It is important that the data in hgGene and hgNear match. Note that we used to override the URL for zebrafish so that it would point to older versions of Ensembl Genes (so we wouldn't have to rebuild the blastTabs every time there was an Ensembl Genes update), however we no longer do that. Now every time Ensembl Genes is updated for zebrafish, the blastTabs are also updated. Also note that at the time of writing, data for hgNear was pointing at the most recent assembly for a species. This means that the links for the gene sorter are broken for older assemblies. Current progress on fixing this issue can be found here: http://redmine.soe.ucsc.edu/issues/247.

Let's say, for example, that the danRer7 Ensembl track is being updated and you are not certain which assemblies' blastTab tables are affected. The following grep from the kent/src/hg/hgGene/hgGeneData directory will make it much easier to determine:

[steve@hgwdev hgGeneData]$ grep -irw danRer7 *
C_elegans/ce6/otherOrgs.ra:db danRer7
D_melanogaster/dm3/otherOrgs.ra:db danRer7
Human/hg19/otherOrgs.ra:db danRer7
Mouse/mm10/otherOrgs.ra:db danRer7
Mouse/mm9/otherOrgs.ra:db danRer7
Rat/rn4/otherOrgs.ra:db danRer7
S_cerevisiae/sacCer3/otherOrgs.ra:db danRer7
S_cerevisiae/sacCer2/otherOrgs.ra:db danRer7

Once again, there are no files for zebrafish for either hgNear or hgGene since zebrafish does not have a gene sorter or a "known genes"-like description page for Ensembl Genes.

Use git to check to see if there has been a change to these files. If there have been changes, you will need to coordinate the pushing of the blastTabs with the CGI push.

How to QA blastTabs

In addition to all normal QA processes, also:

  • Check to see if there was a significant jump in the row count from the old tables to the new tables. This may be an indication that filtering of spurious homologs was not done (see note below). If there was a sudden drop, this may mean that the version that is out on the public site is incorrect and did not filter spurious homologs - ask the developer to check.

Here are some bash loops for checking this:

for prefix in mm rn dr dm ce sc
do
  echo dev ${prefix}BlastTab:
  hgsql -h hgwdev -Ne "select count(*) from ${prefix}BlastTab" hg19
  echo beta ${prefix}BlastTab:
  hgsql -h hgwbeta -Ne "select count(*) from ${prefix}BlastTab" hg19
done

for db in mm9 rn4 dm3 ce6 sacCer3
do
  echo dev $db hgBlastTab:
  hgsql -h hgwdev -Ne "select count(*) from hgBlastTab" $db
  echo beta $db hgBlastTab:
  hgsql -h hgwbeta -Ne "select count(*) from hgBlastTab" $db
done
  • Check that all blastTabs are named correctly, that zebrafish doesn't have any blastTabs and that no blastTabs were unnecessarily updated (run updateTimes.csh).
  • Ask the developer to make sure that the most recent version of BLAST possible was used (at least 2.2.11), as old versions of the program gave odd results.
  • Verify that at least one line of each xxBlastTab table is being displayed properly by checking the gene details page.
    • Note that: it is not possible to make round-trip for a gene from species to species since these tables are generated independently
  • Verify at least one line of each selfBlastTab table (using the gene sorter) by doing the following:
    • Check that the ID column = 100% and the BLASTP E-Value = 0 for the gene itself.
    • Check that the list of genes that sort nearby by the ID column in the gene sorter match those listed in the table
    • Note that:
      • The values in ID column and the BLASTP E-Value column may not match table exactly, as this information is calculated by the CGIs. It should however be relatively close
      • If you click on a Gene Sorter from the hgGene page, the gene you chose may not be displayed if it is marked as a variant of a canonical gene (see knownIsoforms to see gene clusters)
  • Independently check the results from the blastTabs by taking the sequence from a gene and blatting it to another species. See if it matches the results in the xxBlastTab. Note that it may not always work, as the blastTabs are generated using Blast rather than Blat.
  • Makedoc: information on how the blastTabs were created may be kept in multiple places. Check the both the native and other species makedoc. There also may be information in the UCSC Genes makedoc and/or makeDb/doc/blastTab.txt.

Joiner keys:

 joinerCheck -keys identifier=flyBase2004IdDm -database=dm3 all.joiner
 joinerCheck -keys identifier=wormBaseId -database=ce6 all.joiner
 joinerCheck -keys identifier=rgdGene2Id -database=rn4 all.joiner
 joinerCheck -keys identifier=knownGeneId -database=mm9 all.joiner
 joinerCheck -keys identifier=knownGeneId -database=hg19 all.joiner
 joinerCheck -keys identifier=sgdCodingId -database=sacCer3 all.joiner

Note there is no joinerCheck for zebrafish as there are no xxBlastTabs or selfBlastTabs for this organism.

Don't forget that runJoiner.csh will get all relevant identifiers for you:

 hgwdev> runJoiner.csh hg19 rnBlastTab

gives

 found identifiers:
 
 ensemblTranscriptId
 sgdCodingId
 flyBase2004IdDm
 bdgpTranscriptId
 wormBaseId
 knownGeneId ...

Very general overview on how blastTabs are made:

The blastTabs are built using the script: doHgNearBlastp.pl. By default, it creates the selfBlastTabs and xxBlastTabs reciprocally, which means it is very easy to make blastTabs that are not needed. For instance it is easy when making the newest blastTabs for zebrafish, to make xxBlastTabs or selfBlastTabs in the zebrafish database by accident. There are three options to help prevent the creation extra unnecessary blastTabs and to help prevent running over existing blastTabs that do not need to be written over:

-targetOnly => builds only the xxBlastTabs in the target database (e.g. if rn4 is the target and mm9 and dm3 are the queries, it would build rn4.mmBlastTab and rn4.dmBlastTab).

-queryOnly => builds only the xxBlastTabs in the query database(s) (e.g. if rn4 is the target and mm9 and dm3 are the queries, it would build mm9.rnBlastTab and dm3.rnBlastTab). Very useful for Ensembl Gene updates to zebrafish.

-noSelf => suppress the writing of selfBlastTabs.

Also note that the reciprocal-best filtering needs to be done when making blastTabs between species that are *not* closely related (i.e. *not* human <-> mouse, human <-> rat, mouse <-> rat or any of the selfBlastTabs; -bestOnly option? blastRecipBest script?). When they are closely related (human, rat and mouse) then the developer should use syntenic filtering to filter out spurious homologs (run synBlastp.csh on the results from doHgNearBlastp.pl?).

More information on to how the blastTabs were made previously can be found in makeDb/doc/blastTab.txt

Other random notes

The tfBlastTabs were originally built when Jim was making UCSC Genes and didn't want to overwrite the current xxBlastTabs (tf=temporary file). This was not immediately clear and so developers started generating them and/or leaving them even after they were no longer needed. Now that we have separate databases for hgwdev, hgwbeta and the RR, we don't need them and there is no reason to generate them.