VariantAnnotationTool: Difference between revisions
(→CGI: Change section name to Web UI) |
m (→command line: Capitalize title to make it look more like its own section, as it is, and less like a subsection of Web UI.) |
||
Line 143: | Line 143: | ||
** ...? | ** ...? | ||
=== | === Command line === | ||
Since this tool is so highly configurable, instead of cramming so many options into command line arguments, the tool will read a configuration file (possibly stdin) in .ra format. It might look something like this: | Since this tool is so highly configurable, instead of cramming so many options into command line arguments, the tool will read a configuration file (possibly stdin) in .ra format. It might look something like this: |
Revision as of 19:55, 13 February 2012
Introduction
This is a design document for a proposed variant annotation tool, Feature #6152 in redmine.
Numerous existing tools combine variant calls and gene annotations into predicted functional effects of SNPs. For example, an A/G variant at hg19 chr1:15687059-15687059 (rs4661330) is coding-non-synonymous: having an A makes the 919th codon of canonical FHAD1 code for E (GAA), while having a G makes it code for peptide G (GGA). For alt-splice variants, the same variant falls in the 172nd, 107th or 66th codon. Another variant might be intergenic, another may fall in the UTR or splice site of a gene, another variant may be coding but synonymous, etc.
Our tool will of course produce protein-coding effect predictions such as those, but since we have such a rich annotation database, we can relate a variant to many types of data beyond protein-coding gene annotations.
Use cases
MLQ requests
MLQ #3582 "Ideally I'd like to submit a list of variations (in BED format), the reference genome (hg18/19) and get type of effect back"
MLQ #5242 note 17: user has (chr, coord, strand); wants ref. base, known SNP data if any, refseq codon if any
MLQ #6294: "I have a list of genomic positions for putative mutations that I would like to convert into mutant mRNA and peptide sequences. Could you recommend an automated way to do this?"
Novel variant calls from sequencing experiment
The user gets a bunch of short reads and uses a commonly available NGS pipeline to align short reads to the genome and report discovered variants. The pipeline produces a big VCF file of genomic positions and observed variant alleles.
Now they're wondering which variants are the interesting ones. They want to upload the file that their pipeline spit out, and get back some clue about which variants might have a functional effect.
List of rs IDs
The user reads a paper that lists ~20 SNPs associated with some trait, and wants to know more about them: coding? conserved? etc.
User gets BAM file from seq facility
The user is a PI in a lab and the seq data come to her in a BAM file. Would likle to see the read depth as a custom track and also the amino acid diffs for the non-reference alleles in a Custom Track. It could be we simply show any DNA base that does not meatch (even it only one of ten reads), and not try tomake a judgement on how likely it is a real diff.
Command line mode
Standalone binary takes a config file in the .ra format on the command line. The config.ra file contains stanzas describing inputs (file names, database tables etc.; filters; fields to include in output) and outputs (format, filename, options).
Implementation plan
Most of the code will be in library modules. We will develop a command-line tool first, with web UI design in parallel, and then will implement the web UI. This allows more time for web UI design, and a command-line interface makes it easier to develop automated tests for the lib modules. Also, power-users will probably ask for a command-line tool.
Initial prototyping is underway in a shared branch in hgwdev's central repository. To check out your own local branch that tracks the shared branch, do this:
# Update your local git data -- git pull does this too: git fetch # Make your own local annoGrator branch: git checkout --track -b annoGrator origin/annoGrator # See what has been done since this branch was started: git log --stat 010ad06e.. # Return to your main branch: git checkout master
Since your local branch was created with "--track", git push and git pull will use the shared branch origin/annoGrator instead of the main shared branch origin/master.
High-level interfaces: annoGrator.h
A new library module, annoGrator.[ch], will perform the core functions, with details encapsulated in data type-specific objects:
- building up a query object (annoGratorQuery) from specified inputs (subclasses of annoStreamer and annoGrator). Input objects each have their own filters (annoFilterSpec) and output options (annoOutputOption) with public getter and setter methods.
- executing the query: annoGratorQueryExecute()
- writing output: subclasses of annoFormatter
Each input data type will have an associated object (subclass of annoStreamer) that handles the following:
- returning its capabilities (filters, output fields) and current settings
- updating its settings
- getting the next item (annoRow), sorted by position
The primary input (variant calls) needs only the appropriate annoStreamer object, but the other inputs need to know how to combine primary input items with their own contents. So each successive input will have an annoGrator object, which contains an annoStreamer and a method to integrate its contents with a given item from the primary input, including filtering, returning a list of annoRows.
Output will be written by a subclass of annoFormatter; for a given primary input item, it collects annoRows from all annoGrators, and then produces combined output in some format, according to whatever options have been selected.
Of course, all of the really interesting code will be in the details -- for example, the annoGrator subclass instantiated for a genePred (gpGrator?) will predict functional effects on gene models, and the annoFormatter subclass that writes an HTML summary while creating a bigBed+autoSql file for download or custom track instantiation will have all sorts of work to do.
Who's going to do what?
Angie and Brian will divvy up the work -- TBD.
Features
Like many existing tools, we will report the variants' effects on genes (splice-3, coding-non-synon etc.).
UCSC's major enhancements will be
- the incorporation of the many types of data in our database
- presentation of the results: not just loads of data, but also links to browser views
Input
The primary input will be variant calls: fundamentally, genomic position plus observed alleles.
- pgSnp
- VCF
- Other formats, e.g. outputs of popular NGS pipelines?
- 23andMe? :)
- maybe eventually BAM and pileup; but there are established, sophisticated variant callers for BAM, better to take variant calls from those tools.
Other inputs will be annotations to relate to the variant calls. These annotations may be stored in database tables, bigData files, flat files; they might be found in trackDb, custom tracks, or hubs. For TCGA and the cancer group, we need to support the Generic Annotation Format (GAF).
Types of Variation
The architecture must support all known forms of variation with respect to a reference assembly. Initially, the implementation will support single nucleotide variants only. We will work our way up through multi-nucleotide variants, small indels, and ultimately large-scale rearrangements.
Output
- tab-separated file with all/selected fields (with BED+ as an option)
- bigBed with embedded autoSql to define extra columns
- VCF with added INFO column tags
- custom track in GB/TB
- display should show any AA diff from ref.
- probably should show two alleles if input in heterozygous SNPs - can be simply two BED boxes.
- coloring of diffs in amino-acid space, anyway, can use "different codons" as we do with mRNAs now. Should show amino acids downstream of a frameshift in yellow all the way to any in-frame stop. If user is looking at a window on the gene that does not include the actual variant, she is going to want to know that the protein is messed up.
- intermediate level to summarize, sort / rank, and filter findings
- ?highlight in multiple alignment?
- ?ancestral polarization?
- ?binding motif disruption?
Interface
Web UI
This is still wide open -- everyone's input would be very much appreciated!
Main page
Form:
- paste/upload variants
- select annotation sources
- any track including custom tracks?
- select output format/presentation
- custom track in [Genome Browser | Table Browser]
- summary with filters
- mutant sequence (genomic, mRNA, protein)
- go!
followed by brief how-to and link to more detailed doc.
Summary/Filters
Stats: #variants, #variants intersecting each annotation source (further broken down for protein-coding genes)
Form:
- Select annotation source
- Filters
- For protein-coding gene annotations: coding-non-synon, etc.
- For wiggle tracks: min/max threshold
- ...?
Command line
Since this tool is so highly configurable, instead of cramming so many options into command line arguments, the tool will read a configuration file (possibly stdin) in .ra format. It might look something like this:
primarySource ct_myVars sourceType customTrack dataType pgSnp filterSpecs filter1,filter2 outFields allButBin filter filter1 alleleCount == 2 filter filter2 alleleFreq noMatch 0,0 source snp135Common sourceType dbTrack dataType bed 6 + outFields chrom,chromStart,chromEnd,name,strand,observed,exceptions,alleleFreqCount,alleles,alleleNs source pubMatches sourceType dbTrack dataType bed12 outFields name outputFormat tabSep fileName ./myAnnotatedVars.txt.gz
Then the tool might be invoked like this:
annoGrate hg19 config.ra
Name
Not variant Effect Predictor or other names in use.
- Variantizor?
- Predictorator?
- Diffmeister?
- Differizerator?
- Variant Annotator
- Varannosaurus Rex
- Varannozilla
- Global Variant Annotator
- Integrated Variant Annotator (except rings of Broad IGV)
- Variant Annotation Integrator
- Variant Annotatoon Tool ( except VAT is not good acronym)
- VarAnnoGrator
Links to Similar Tools
http://uswest.ensembl.org/info/docs/variation/vep/index.html
http://snpeff.sourceforge.net/faq.html
http://www.ncbi.nlm.nih.gov/variation/tools/reporter