Utilities for comparative genomics

From genomewiki
Revision as of 15:47, 23 January 2008 by Tomemerald (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Populating a feature stack is the central task in comparative genomics. Feature stacks collect all information available from all available species on a suitable topic. Ideally the topic, for example evolutionary history of a coding exon, has width in base pairs a fraction of characteristic width of available experimental data. That is, a coding gene will have single exons wholly contained within individual trace reads and multiple exons within gapless GenBank wgs contigs. On the other hand, a segmental duplication of 10 genes, say the one on chr20q/chr8, is very unsuited to a feature stack today because only a couple mammals have an adequate assembly over the intrinsic span of the feature.

Thus it is timely to construct feature stacks for coding exons and the like but premature to consider the comparative genomics of longer features. Consequently all research today that can fully exploit the power of comparative genomics is restricted to computable feature stacks. In other words, if someone can precompute all possible contemporary feature stacks, in effect they have written all possible research papers. This makes them the masters of the comparative genomics universe.

How many feature stacks are there, how difficult is to compute, store, and query them, and what associated precomputed products go with them? Suppose there are 180,000 coding exon and 3 non-coding features per 20,000 gene for 240,000 features of width 500 bp and needed depth of 50 species. That would fit handily within an excel spreadsheet. So the number, storage, and query are non-issues.

Here's an example of a feature stack and the implicit paper that practically writes itself:

Comparative Genomics of DRY motifs in exon 3 of RGR Opsins:
1 RWPYGSDGCQAHGFQGFVTALASICSSAAIAWGRYHHYCT 1 human
1 RWPYGSDGCQAHGFQGFVTALASICSSAAIAWGRYHHYCT 1 macaque
1 RWPYGSGGCQAHGFQGFTTALASICGSAAIAWGRYHHYCT 1 lemur
1 RWPHGSEGCQVHGFQGFATALASICGSAAVAWGRYHHYCT 1 mouse
1 RWPYGSDGCQAHGFQGFATALASICGSAAIAWGRYHHYCT 1 rabbit
1 RWPYGSDGCQAHGFQGFVTALASICSSAAIAWGRYHHYCT 1 horse
1 RWPYGPDGCQAHGFQGFATALASICSSAALAWGRYHHYCT 1 dog
1 RWPYGSGGCQAHGFQGFAAALASICGSAAVAWGRYHHYCT 1 bat
1 RWPYGSDGCQAHGFQGFVTALASICSCAAIAWERYHHYCT 1 elephant
1 HWPYGSGGCQAHGFQGFTVALASICSCAAIAWERYHHYCT 1 tenrec
1 RWPHGSDSCQAHSFQGFATALASISSSAAIAWERYRHHCT 1 sloth
1 RWPYGSGGCQAHGFQGFVTALASISSSAAIAWERCHRHCI 1 armadillo
1 HWPYGAEGCRLHGFQGFATALASISLSAAIGWDRYLRHCS 1 platypus
1 YWPYGSDGCQIHGFHGFLTALTSISSAAAVAWDRHHQYCT 1 lizard
1 YWPYGSEGCQIHGFQGFLTALASISSSAAVAWDRYHHYCT 1 chicken
1 YWPYGSEGCQIHGFQGFVAALSSIGSCAAIAWDRYHQYCT 1 frog
1 YWPYGSDGCQTHGFQGFVTALASIHFIAAIAWDRYHQYCT 1 stickleback
1 YWPYGSDGCQTHGFQGFVTALASIHFVAAIAWDRYHQYCT 1 fugu
1 YWPYGSEGCQTHGFHGFLTALASIHFIAAIAWDRYHQYCT 1 medaka
1 YWPYGSEGCQTHGFHGFLMALASINACAAIAWDRYHQNCS 1 elephantshark
1 EWPFGSIGCQLDAFIGMAPTFISIAGAALVAKDKYYRICK 1 tunicate 


Now half the data needed for a full feature stack is not available at any genome browser but rather at GenBank etc. Whether that data is retrieved by automated pipeline or by one-off queries, there's a universal roadblock in that blast output is terribly formatted for comparative genomics purposes:

  • the blast algorithm is unaware of splice donors and acceptors and ignores user parsing
  • blocks are provided in score order instead of natural query exon order
  • blocks have faux match extensions at the ends (exon dribble) complicating clean ortholog extraction
  • blocks can contain multiple exons when introns are short leading to nonsense in tblastn
  • line width of 60 is way too short, causing string searches to fail because of carriage returns
  • gaps necessitate dashes causing string searches to fail
  • the match genus and species are inconveniently provided

Fixing Blast output in order to populate a comparative genomics feature stack: No utility web tool exists anywhere on the internet that can repair blast output. Consequently one would make a great addition. The needed algorithm would largely carry over to any pipeline program to populate feature stacks genomewide. The example below shows what needs to happen, as limited to one species. In this case, a single tblastn query to GenBank wgs adds 19 mammals to the feature stack.

Here is the original query, 3 exons of bovine rhodpsin marked up for reading frame

>RHO1_bosTau Bos taurus (cow)  
2 GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSR 2
1 YIPEGMQCSCGIDYYTPHEETNNESFVIYMFVVHFIIPLIVIFFCYGQLVFTVKE 0
0 AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSDFGPIFMTIPAFFAKTSAVYNPVIYIMMNKQ 0
Here is the raw output match to marsupial

>gb|AAFR03021222.1| Monodelphis domestica cont3.021221, whole genome shotgun sequence Length=94985

 Score =  149 bits (377),  Expect = 9e-34, Method: Compositional matrix adjust.
 Identities = 71/74 (95%), Positives = 74/74 (100%), Gaps = 0/74 (0%)  Frame = +1

Query  119    AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSDFGPIFMTIPAFFAKTS  178
              AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGS+FGPIFMTIPAFFAK+S
Sbjct  21139  AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSNFGPIFMTIPAFFAKSS  21318

Query  179    AVYNPVIYIMMNKQ  192
              +VYNPVIYIMMNKQ
Sbjct  21319  SVYNPVIYIMMNKQ  21360 

 Score =  117 bits (294),  Expect = 5e-24, Method: Compositional matrix adjust.
 Identities = 54/59 (91%), Positives = 57/59 (96%), Gaps = 0/59 (0%) Frame = +3

Query  1      GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSRYI  59
              GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAI+GVAFTWVMALACA PPL+GWSR +
Sbjct  18648  GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIIGVAFTWVMALACAFPPLIGWSRLV  18824

 Score =  107 bits (268),  Expect = 4e-21, Method: Compositional matrix adjust.
 Identities = 51/59 (86%), Positives = 52/59 (88%), Gaps = 0/59 (0%) rame = +2

Query  55     WSRYIPEGMQCSCGIDYYTPHEETNNESFVIYMFVVHFIIPLIVIFFCYGQLVFTVKEA  113
              + RYIPEGMQCSCGIDYYT   E NNESFVIYMFVVHF IPLIVIFFCYGQLVFTVKE 
Sbjct  20546  FCRYIPEGMQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPLIVIFFCYGQLVFTVKEV  20722
Stage 1: order the output blocks according to input exon order and fix line width:

Query  1      GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSRYI  59
              GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAI+GVAFTWVMALACA PPL+GWSR +
Sbjct  18648  GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIIGVAFTWVMALACAFPPLIGWSRLV  18824

Query  55     WSRYIPEGMQCSCGIDYYTPHEETNNESFVIYMFVVHFIIPLIVIFFCYGQLVFTVKEA  113
              + RYIPEGMQCSCGIDYYT   E NNESFVIYMFVVHF IPLIVIFFCYGQLVFTVKE 
Sbjct  20546  FCRYIPEGMQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPLIVIFFCYGQLVFTVKEV  20722

Query  119    AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSDFGPIFMTIPAFFAKTSAVYNPVIYIMMNKQ  192
              AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGS+FGPIFMTIPAFFAK+S+VYNPVIYIMMNKQ
Sbjct  21139  AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSNFGPIFMTIPAFFAKSSSVYNPVIYIMMNKQ  21360
Stage 2: Find the correct input exons
Query  1      GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSRYI  59
              GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAI+GVAFTWVMALACA PPL+GWSR +
Sbjct  18648  GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIIGVAFTWVMALACAFPPLIGWSRLV  18824
Query  55     WSRYIPEGMQCSCGIDYYTPHEETNNESFVIYMFVVHFIIPLIVIFFCYGQLVFTVKEA  113
              + RYIPEGMQCSCGIDYYT   E NNESFVIYMFVVHF IPLIVIFFCYGQLVFTVKE 
Sbjct  20546  FCRYIPEGMQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPLIVIFFCYGQLVFTVKEV  20722
Query  119    AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSDFGPIFMTIPAFFAKTSAVYNPVIYIMMNKQ  192
              AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGS+FGPIFMTIPAFFAK+S+VYNPVIYIMMNKQ
Sbjct  21139  AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSNFGPIFMTIPAFFAKSSSVYNPVIYIMMNKQ  21360
Stage 3: Transfer the match to marsupial
Query  1      GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSRYI  59
              GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAI+GVAFTWVMALACA PPL+GWSR +
Sbjct  18648  GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIIGVAFTWVMALACAFPPLIGWSRLV  18824
Query  55     WSRYIPEGMQCSCGIDYYTPHEETNNESFVIYMFVVHFIIPLIVIFFCYGQLVFTVKEA  113
              + RYIPEGMQCSCGIDYYT   E NNESFVIYMFVVHF IPLIVIFFCYGQLVFTVKE 
Sbjct  20546  FCRYIPEGMQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPLIVIFFCYGQLVFTVKEV  20722
Query  119    AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSDFGPIFMTIPAFFAKTSAVYNPVIYIMMNKQ  192
              AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGS+FGPIFMTIPAFFAK+S+VYNPVIYIMMNKQ
Sbjct  21139  AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSNFGPIFMTIPAFFAKSSSVYNPVIYIMMNKQ  21360

Export the desired output with modified header

>RHO1_monDom Monodelphis domesticus (opossum)
2 GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIIGVAFTWVMALACAFPPLIGWSR 2
1 YIPEGMQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPLIVIFFCYGQLVFTVKE 0
0 AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSNFGPIFMTIPAFFAKSSSVYNPVIYIMMNKQ 0
fill in the respective feature stacks with marsupial exons (only first is shown):
>RHO1_homSap GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLAGWSR
>RHO1_bosTau GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSR
>RHO1_monDom GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIIGVAFTWVMALACAFPPLIGWSR
>RHO1_ornAna GEIALWSLVVLAIERYIVVCKPMSNFRFGENHAIMGVAFTWIMALACALPPLVGWSR
>RHO1_galGal GEIALWSLVVLAVERYVVVCKPMSNFRFGENHAIMGVAFSWIMAMACAAPPLFGWSR
>RHO1_anoCar GEMGLWSLVVLAVERYVVICKPMSNFRFGETHALIGVSCTWIMALACAGPPLLGWSR
>RHO1_xenTro GEMALWSLVVLAIERYVVVCKPMANFRFGENHAIMGVVFTWIMALSCAAPPLFGWSR
>RHO1_neoFor GIIALWCLVVLAIERYIVVCKPISNFRFGENHAIMGVVFTWIMALACAGPPLFGWSR
>RHO1_latCha GQVALWALVVLAIERYVVVCKPMSNFRFGENHAIMGVIFTWIMALSCAVPPLFGWSR
>RHO1_takRub GEIALWSLVVLAVERYIVVCKPMTNFRFGEKHAIAGLVFTWIMALTCATPPLLGWSR
>RHO1_leuEri GEVGLWCLVVLAIERYMVVCKPMANFRFGSQHAIIGVVFTWIMALSCAGPPLVGWSR
>RHO1_calMil GEIGLWSLVVLAIERYVVVCKPMSNFRFGTNHAIMGVAFTWVMALACAVPPLMGWSR
>RHO1_petMar DEMSLWSLVVLAIERYIVICKPMGNFRFGSTHAYMGVAFTWFMALSCAAPPLVGWSR