Known genes III
Goals and History
The UCSC Known Genes track tries to gather together information from many sources into a nonredundant unified view of all genes for which there is solid evidence. The Known Genes track has been through a number of iterations:
- Known Genes 0 - Mapping of RefSeq mRNAs to the genome
- Known Genes 1 - UniProt driven. Find RNA corresponding to protien. Map that. Add in DNA based RefSeqs
- Known Genes 2 - Similar to Known Genes 1, but weeding out mappings that produce, after mapping, bad proteins due to insertions, deletions, and truncations, etc. See also online known genes methods description.
With Known Genes 3 we want to restore many of the mappings thrown out in Known Genes 2, fixing them when possible, and marking them as uncertain where a fix is not possible. We also want to design a process that will include noncoding genes, such as Xist and Har1, in the known genes set.
- Broader coverage than KG II
- include some of genes thrown out by KG II
- non-"NM_..." RefSeqs?
- should NP_... proteins from NCBI be included as initial candidates?
- include non-coding genes
- not sure about this, down stream software changes may be extensive (Fan)
- other sources besides Ensembl non-coding genes?
- More accurate gene model than KG II
- would gene-check or something similar still be used?
- and err-flags raised be kept and available to users?
- 3 or more quality classes
- stable UCSC KG gene ID
- nightly updates (?)
- probably KG IV?
Some Historical Discussions
From: Brian Raney [mailto:firstname.lastname@example.org] Sent: Tuesday, November 15, 2005 8:11 AM To: Fan Hsu; Adam Siepel Cc: Browser Staff; Mark Diekhans Subject: Re: question about knownGenes gene symbols Hey Fan et al. I know that my saying this has become tired, but I think KGII is too restrictive since it doesn't include many genes that are known experimentally. Just because the gene model isn't perfect on hg17, doesn't mean the gene doesn't exist. My feeling is that all the old KG genes, plus all of RefSeq should be in KG with the CCDS stuff as annotation. I don't see what the advantage is to not including genes merely because they don't have a start/stop codon. Do we really need to be "better refSeq than refSeq"? If researchers want only genes with no errors, then they can restrict their search to those. brian ----- Original Message ----- From: "Fan Hsu" <email@example.com> To: "Adam Siepel" <firstname.lastname@example.org> Cc: "Browser Staff" <Browseremail@example.com>; "Mark Diekhans" <firstname.lastname@example.org> Sent: Monday, November 14, 2005 5:55 PM Subject: RE: question about knownGenes gene symbols > Hi Adam, > > Yes, most of RefSeqs are in KG now. > > And yes, we use gene-check (used by CCDS) to validate KGs. > The RefSeqs (and other KG candidates) that have one > of the following problems are excluded: > > 1. ORF stop codon > 2. bad frame > 3. no start codon > 4. no stop codon > 5. a few misc criteria, e.g. CDS splicing > > Often at the same position of a rejected RefSeq, > there is a valid GenBank mRNA that became the KG. > > For hg17, about 10% of RefSeqs are rejected. > Compared to RefSeq, UCSC KG has better coverage > and better quality. > > My slogan for UCSC Known Genes is: > > Better RefSeq than RefSeq. :) > > The last two slides of > > http://www.soe.ucsc.edu/~fanhsu/UCSC_Known_Gene_II.pdf > > give you some quantative measure of our latest KG II > process compared to RefSeq, Ensmebl, etc. > > Fan.
From: Adam Siepel [mailto:email@example.com] Sent: Tuesday, November 15, 2005 8:41 AM To: Brian Raney Cc: Fan Hsu; Browser Staff; Mark Diekhans Subject: Re: question about knownGenes gene symbols A few of the checks probably are a little restrictive for some purposes, e.g., requirement of start and stop. One possibility would be to keep the guys in that failed some of the less restrictive checks but to flag them. They could be displayed in a different color and if necessary excluded in certain queries. I know it's a pain maintaining different sets but in my experience most of the time with gene sets you want one of two things: - a "tight" set, which is reasonably comprehensive but also gives you high confidence that each gene is correct. (Useful for training gene predictors, testing gene prediction sensitivity, estimating models of neutral evolution or dN/dS rates) - a "loose" set, which is as comprehensive as possible. (Useful for testing gene prediction specificity, checking whether or not "novel" elements [ultras, new genes, Katie's elements, Jakob's RNA structures, transfrags, etc.] are in or nearby known genes) I would argue for including all of RefSeq in the loose set. It would be very useful to have a version of KG that was a proper superset of RefSeq. Two cents. --A
Here is a lengthy mote I had on KG. Two main points: o relaxing gene-check criteria for KG II can capture a small number of missed RefSeqs, but increase total number of KGs substantially. o a more significant problem is that UniProt often does not have corresponding proteins for RefSeq, at least for mouse. Fan. -----Original Message----- From: Fan Hsu [mailto:firstname.lastname@example.org] Sent: Tuesday, November 15, 2005 12:11 PM To: Jim Kent; Brian Raney Cc: Adam Siepel; Browser Staff; Mark Diekhans Subject: RE: question about knownGenes gene symbols Hi Jim, Brian, and everyone: For mm7, I actually did a complete round of KG build with a relaxed gene-check criteria that does not filter out noStop and noStart candidates. By doing this, I got to retain 143 more RefSeq Genes as valid KG candidates. But this more relaxed criteria resulted more than 10,000 final KG genes, because many GenBank mRNAs also survived this more relaxed criteria. I looked at this set and found many duplicates (very similar but with slight different CDS structures, which is probably due to that I adopted tblastn for protein-GenBank mRNA alignement that gave me slightly better CDS). I do not have exact number of our GenBank mRNA KGs that replace those 143 lost RefSeq genes, my guess is probably more than half. And of those 143 RefSeqs, only 12 of them are "Reviewed" and 4 of them are "Validated". From my conversation with Donna at NCBI, the "Provisional" RefSeqs are pretty much whatever got submitted, so I think they do not offer any advantage over GenBank mRNAs. This means that the pay back of the more relaxed criteria is minimum. So I was faced with the decision whether to increase the KG coverage of RefSeq a tiny bit but also introduce a lot of redundant GenBank mRNA based KGs. As Mark said, I am damned either way. :-) I decided to let go of those RefSeqs. Personally, I feel that the current KG II process is a good balance between coverage (quantity) and quality. For mm7, we have 19K RefSeqs and 31K KGs. As everyone probably can remember, the old KG tracks used to have a lot of duplicates stacking up at one position. BTW, a bigger problem exists outside the gene-check criteria. The current mm7 KG set does left out about 3,000 RefSeqs because there are no corresponding UniProt proteins for them and our current definition of a KG is that it must have a representative mRNA and a representative protein. To include those as part of KG will entail some major overhaul of the the KG II build process (e.g. include protein sequences from NCBI in addition to UniProt) and significant updates to our application software layer (if we do not require a KG always has a protein behind it). I doubt that we want to put mm7 KG on hold and wait until we finish the major over haul. I could retract and start over with the more relaxed criteria to let those noStop and noStart RefSeqs in (and possibly do not allow those noStop and noStart GenBank mRNAs in, which would be kind of discriminatory to me). If we decide to do so, I will have to re-do all down stream KG aliases, pathways, GS tables, etc. This probably would push me back less than 1 week. The bottom line is how we want to position KG? (And how much time we want to allocate to KG build improvements.) If higher coverage is the main goal and we are willing to tolerate much more questionable/troublesome/redundant entries, then I can retool the KG II process. I prefer not to substantially lower the quality of current KG II. My suggestion is to keep what I already built for mm7 KG II (I am almost completely done with mm7 KG/GS/PB). I would like to build a second "XXX KG" track for mm7 to include genes with a more relaxed criteria (without extensive downstream tables and functions, e.g. KG details page), for users who prefer to see a broader set with less confidence and supporting evidence. I had many discussions with Mark and Bob on how we should call this 2nd class KG track. Anyone came up with the best (and accepted) name will win a free lunch from me. Since hg18 is not here yet, we may have room for further discussions of what would be the best for hg18 KG. Fan. -----Original Message----- From: Jim Kent [mailto:email@example.com] Sent: Tuesday, November 15, 2005 8:46 AM To: Brian Raney Cc: Fan Hsu; Adam Siepel; Browser Staff; Mark Diekhans Subject: Re: question about knownGenes gene symbols I agree with Brian here.
Kate brought back from GMOD workshop
- Batch query on gene names (not just accessions).
- Customized reports (subset of KG details page ?)
- Query "grid" and/or common query "templates"
Here is a possible process for building Known Genes 3, taken from our grant app.
- Align all the RNAs in GenBank against the genome with BLAT and high-stringency filters. ESTs will not be included in this starting set. Certain mRNA libraries may be excluded as well.
- Most of this will fall pretty easily out of the GenBank automated builds. I think all that's required here is a little program that takes as input a list of bad libraries (which we'll hand curate), and a list of bad accessions (also hand curated) and creates as output a list of bad accessions. -jk
- I already have a process worked out to remove suspect libraries and accessions. It is being used for the C-list paper and I was hoping to make it part of the next revision of the genbank code. I was just going to discard them, however if we can get someone to fix the long-standing bugs with the search results on accession that didn't align, we could provide information about why these were dropped in the browser (not really sure this is worth it) -markd
- I was also going to implement Robert's best genome match filtering algorithm as part of the mRNA and EST alignment pipeline. Hopeful this will help weed out a lot of problematic alignments to recent duplications -markd
- filtering alignments against pseudogene annotations could be useful. Robert's match filter helps a lot and prevents discarding transcribed pseudogenes. AFAIK, NCBI only puts out pseudogene annotations when they put out a new genome build. I have program to parse these out of their genome annotation file. -markd
- Cluster the alignments that overlap.
- For this we could use clusterRna. Alternatively we might use something based on ExonWalk for this and the next few steps. -jk
- There are some adjacent loci with overlapping UTR, so clutering based on overlapping CDS tends to produce better results. This would require moving clustering after CDS assignment. Delaying clutering and splice form select has the advantage of making it easier for expensive, per-transcript analysis to be run on the cluster. -markd
- For each exon in the alignment, come to a consensus on the exon boundaries based on all of the RNA alignments. This consensus will allow for alternative 5’ and 3’ ends of the exons if there are clean alignments with good splice sites.
- Definitely it's worth spending a day or two reading ExonWalk here, since it does pretty much exactly this -jk
- BLAT still has problems with splice sites when the cDNA has mismatches near splice site. I am not sure how how often this is a problem, although I will try to get some stats on it as part of the analysis I am currently doing. It might be possible to improve these alignments using some post processing or by attempting to realign problem cDNAs that fail splice site checks. I would like to have this part of the genbank pipeline -markd
- Pick a representative RNA for each splicing variant. When there is a choice of representatives, pick the one that is longest and most similar to the genome.
- I'd be inclined to fold this step into the exon-walker-like program too, since the pick criteria are likely to be pretty straightforward, and presumably the exon-walker already has most of the information needed to make the pick loaded already. -jk
- To pick splice variants, we need to know which ones are subject to NMD, which requires knowing the CDS. So splice isoform selection will need to be moved later in the process. -markd
- Quaility metrics on the source of the mRNA should be a consideration in the picking process. The full-length MGC mRNAs tend to be very good and a lot of work has gone into their CDS annotation. -markd
- For any base in an RNA that differs from the aligned DNA base in the reference genome, determine if that RNA base is more likely to be (a) a common allelic variant, (b) a post-transcriptional modification or (c) a rare variant or artifact in the RNA sequence or its alignment to the genome. This determination is made by examining all cDNA alignments (including ESTs) to this gene and to very similar paralogs, consulting dbSNP and other special information sources, and determining common haplotypes for the region in question. In case (c), either fix the alignment or replace the questionable RNA base with the corresponding base from the most similar common haplotype, preferring the reference genome if its haplotype is not an extremely rare one for this region and there is not extremely strong evidence of an error in the reference genome (that would trigger a different action). A record of the original base value and reason for the correction is kept.
- Hmm, it's worth considering doing this as a separate program from the walker, just because it involves database access and inputs that the walker doesn't already have. -jk
- Add genes from RefSeq and perhaps other trusted sources that are known purely at the DNA level.
- Perhaps these actually should be folded into the process sooner, generating fake .psl files for them? This way if there ended up being RNA records as well as DNA records, they could be properly merged. -jk
- I would very much like to include these in the standard pipeline as well, so creating `virtual' mRNAs and aligning them in the normal genbank pipeline would be a straight-forward approach. The RefSeq DNA annotations are part of a genome release; I don't think they available incrementally. The are also various DNA annotations available in Genbank. This is going to be a more difficult thing to sort out. I have already added a hack to grab ORFeome synthetic clones and treat them as mRNAs. They are entered as DNA since they are produced by DNA synthesis. Many of these are complete CDSs except for not including the stop codon. -markd
- Map UniProt proteins to the corrected RNAs to determine the coding regions, if any. Use bestOrf, pseudogene and evolutionary analysis on the RNAs to determine additional protein coding genes and find suspected errors in UniProt. Report any suspected errors to SwissProt staff.
- We could use blat for the protein/RNA mapping I think. Probably better to do it here rather than blatting proteins against the genome and looking for overlaps between the mrna and protein alignments, since blatting protein vs. mRNA is an easier job. I'm not sure what are the best tools to apply here for the psuedogene and evolutionary analysis. -jk
- I found blat to not do as good a job as tblastn for aligning mRNAs to proteins. The blat alignments were sometimes significantly shorter than those produced by tblastn, even with close matches. The kgProtMap table is created by using pslMap to map the protein to mRNA alignments to the genome. These alignments have the advantage over straight protein to genome alignments of handling spliced codons. -markd
- trimming vector sequences before ORF identification would be good -markd
- Is bestorf still the state of the art as far as ORF finders go? A lit search would be worth while. -markd
- For full-length MGC mRNAs, use the annotated CDS unless there is good evidence that it is incorrect, in which case we can feed this information back to MGC. -markd
- protein family homology is also a very good test from validity of a CDS -markd
- we need to be careful of the `TrEMBL' tail chase. Since TrEMBL is create by running an ORF finder on an mRNA, it's very easy for a mRNA-TrEMBL protein pair to be circularly defined. -markd
- Separate the resulting gene models into gold/silver/bronze sets as discussed above.
- Here the gold set is basically CCDS, the silver ones that look good according to some criteria I'd like to leave in Mark's hands, basically a nice full length protein with no weird splicing. The bronze will be everything else including the noncoding. -jk
- Up till now, KGs are represented by actual mRNAs and proteins. It seems the above processing may come up with our own gene structures that differ from any of current major gene (RefSeq) and/or protein (UniProt) DB entries. From Terry's earlier paper, it seems that the base genome is mostly more trust worthy than individual mRNAs. But as we enter personal genomes era, this may change. Assuming KG III ends up being better than RefSeq, is the community/industry ready for yet another standard for genes? (Fan)
- Yes, I think we'll want to be generating our own IDs, and mapping back to previous IDs when possible in new builds. (Jim)
- The RefSeq/CCDS approach of having a stable id with a version works very well. The CCDS approach is closer to known genes, since it is an annotation on the genome, while RefSeq genes are transcripts where the version number only changes when the sequence changes.
- It would be very helpful to start tracking version numbers along with accessions for references to genbank/refseq. Plans are to move to a single database, shared by all genomes, for genbank/refseq data As genbank continues to grow, this may become essential. All references into this database will need to supply a version number. Defining an API to genbank data that hides the details should make this easier.
Lessons learned from KG II
- For each representative protein, allow only one representaive mRNA. Likewise, for each representative mRNA, allow only one representative protein. KG II has the possibility of 1 to M on both directions, which is causing grief on hgGene and PB.
- How realistic is this? How often can a clear decision be made about a pairing? Is this just going to be making that loses information? Maybe this is really a problem with hgGene and PB??
- Log the reason(s)/potential problems for lower quality types (bronze and possibly silver too) to enable users to understand and also make their own judgements. Currently user questions often arise regarding why a RefSeq or some favorite gene do not show up in KG.
- Better processing of the secondary UniProt IDs, there was an odd case involving a secondary UniProt accession that caused a problem.
- Some individual problems/cases reported on KG II could be used to test KG III.
- While mapping mRNAs to proteins, add the logic to flag/process (and possibly filter out) short partial matches. Consider another special logic to look at protein description and if the word "fragment" is found, eliminate it or classify it as lower quality.