Known genes III

From genomewiki
Revision as of 17:04, 18 January 2007 by Hiram (talk | contribs) (→‎Plans)
Jump to navigationJump to search

Goals and History

The UCSC Known Genes track tries to gather together information from many sources into a nonredundant unified view of all genes for which there is solid evidence. The Known Genes track has been through a number of iterations:

  • Known Genes 0 - Mapping of RefSeq mRNAs to the genome
  • Known Genes 1 - UniProt driven. Find RNA corresponding to protien. Map that. Add in DNA based RefSeqs
  • Known Genes 2 - Similar to Known Genes 1, but weeding out mappings that produce, after mapping, bad proteins due to insertions, deletions, and truncations, etc.

With Known Genes 3 we want to restore many of the mappings thrown out in Known Genes 2, fixing them when possible, and marking them as uncertain where a fix is not possible. We also want to design a process that will include noncoding genes, such as Xist and Har1, in the known genes set.

Plans

Here is a possible process for building Known Genes 3, taken from our grant app.

  1. Align all the RNAs in GenBank against the genome with BLAT and high-stringency filters. ESTs will not be included in this starting set. Certain mRNA libraries may be excluded as well.
  2. Cluster the alignments that overlap.
  3. For each exon in the alignment, come to a consensus on the exon boundaries based on all of the RNA alignments. This consensus will allow for alternative 5’ and 3’ ends of the exons if there are clean alignments with good splice sites.
  4. Pick a representative RNA for each splicing variant. When there is a choice of representatives, pick the one that is longest and most similar to the genome.
  5. For any base in an RNA that differs from the aligned DNA base in the reference genome, determine if that RNA base is more likely to be (a) a common allelic variant, (b) a post-transcriptional modification or (c) a rare variant or artifact in the RNA sequence or its alignment to the genome. This determination is made by examining all cDNA alignments (including ESTs) to this gene and to very similar paralogs, consulting dbSNP and other special information sources, and determining common haplotypes for the region in question. In case (c), either fix the alignment or replace the questionable RNA base with the corresponding base from the most similar common haplotype, preferring the reference genome if its haplotype is not an extremely rare one for this region and there is not extremely strong evidence of an error in the reference genome (that would trigger a different action). A record of the original base value and reason for the correction is kept.
  6. Add genes from RefSeq and perhaps other trusted sources that are known purely at the DNA level.
  7. Map UniProt proteins to the corrected RNAs to determine the coding regions, if any. Use bestOrf, pseudogene and evolutionary analysis on the RNAs to determine additional protein coding genes and find suspected errors in UniProt. Report any suspected errors to SwissProt staff.
  8. Separate the resulting gene models into gold/silver/bronze sets as discussed above.