Introduction

This is a design document for a proposed variant annotation tool, Feature #6152 in redmine.

Numerous existing tools combine variant calls and gene annotations into predicted functional effects of SNPs. For example, an A/G variant at hg19 chr1:15687059-15687059 (rs4661330) is coding-non-synonymous: having an A makes the 919th codon of canonical FHAD1 code for E (GAA), while having a G makes it code for peptide G (GGA). For alt-splice variants, the same variant falls in the 172nd, 107th or 66th codon. Another variant might be intergenic, another may fall in the UTR or splice site of a gene, another variant may be coding but synonymous, etc.

Our tool will of course produce protein-coding effect predictions such as those, but since we have such a rich annotation database, we can relate a variant to many types of data beyond protein-coding gene annotations.

Use cases

MLQ requests

MLQ #3582 "Ideally I'd like to submit a list of variations (in BED format), the reference genome (hg18/19) and get type of effect back"

MLQ #5242 note 17: user has (chr, coord, strand); wants ref. base, known SNP data if any, refseq codon if any

MLQ #6294: "I have a list of genomic positions for putative mutations that I would like to convert into mutant mRNA and peptide sequences. Could you recommend an automated way to do this?"

Novel variant calls from sequencing experiment

The user gets a bunch of short reads and uses a commonly available NGS pipeline to align short reads to the genome and report discovered variants. The pipeline produces a big VCF file of genomic positions and observed variant alleles.

Now they're wondering which variants are the interesting ones. They want to upload the file that their pipeline spit out, and get back some clue about which variants might have a functional effect.

List of rs IDs

The user reads a paper that lists ~20 SNPs associated with some trait, and wants to know more about them: coding? conserved? etc.

User gets BAM file from seq facility

The user is a PI in a lab and the seq data come to her in a BAM file. Would likle to see the read depth as a custom track and also the amino acid diffs for the non-reference alleles in a Custom Track. It could be we simply show any DNA base that does not meatch (even it only one of ten reads), and not try tomake a judgement on how likely it is a real diff.

Command line mode

Standalone binary takes a config file in the .ra format on the command line. The config.ra file contains stanzas describing inputs (file names, database tables etc.; filters; fields to include in output) and outputs (format, filename, options).

Implementation plan

Most of the code will be in library modules. We will develop a command-line tool first, with web UI design in parallel, and then will implement the web UI. This allows more time for web UI design, and a command-line interface makes it easier to develop automated tests for the lib modules. Also, power-users will probably ask for a command-line tool.

Shared git branch

Initial prototyping is underway in a shared branch in hgwdev's central repository. To check out your own local branch that tracks the shared branch, do this:

# Update your local git data -- git pull does this too:
git fetch

# Make your own local annoGrator branch:
git checkout --track -b annoGrator origin/annoGrator

# See what has been done since this branch was started:
git log --stat 010ad06e..

# Return to your main branch:
git checkout master

Since your local branch was created with "--track", git pull from within your local branch will use the shared branch origin/annoGrator instead of the main shared branch origin/master. But watch out for git push! That still wants to push both your local annoGrator branch and your master branch! So to be safe, do a git push like this:

git push origin annoGrator

High-level interfaces: anno*.h

Several new library modules, anno*.[ch], will perform the core functions:

annoRow

The basic unit of data interchange between modules: a genomic position plus an array of strings that correspond to columns defined in the data source's autoSql description.

annoColumn

For communication with UI: a column's autoSql definition, and a boolean flag for whether this column should appear in the output (e.g. as a column in tab-separated output, or an attribute in VCF output).

annoFilter

For communication with UI: a specification of a filter on a column's data values: the column's autoSql definition, filtering operation, and current values (threshold, range or wildcard pattern). Also has a flag for "right join" (a la SQL) behavior: i.e. if this filter fails on a secondary table, do we filter out the primary table's variant too, or simply ignore this secondary table row?

annoStreamer

Interface to data source: get autoSql description, get/set filters and columns, get next row of data. Subclasses of this handle details such as whether the data come from a db table, file, etc.

annoGrator

Integrates each row of data from the primary source with data from an internal source, returning zero or more rows of data that overlap the primary source's row. Same external interface as annoStreamer, except nextRow() is replaced by integrate(). Subclasses of this can add integrated data columns, for example a module that integrates variants and genePreds also outputs predicted functional effects.

annoFormatterOption

This is how configuration parameters are passed to annoFormatter: it is an optionSpec plus a value.

annoFormatter

Subclasses of this write output, as tab-separated text, bigBed, custom track, HTML summary etc.

annoGratorQuery

A complete description of a query, constructed from a primary source (annoStreamer), 0 or more annoGrators, and 1 or more annoFormatters. Call annoGratorQueryNew(), annoGratorQuerySetRegion(), annoGratorQueryExecute() and annoGratorQueryFree().

Input modules

annoStreamer subclasses, in rough order of importance/implementation:

annoStreamDb
annoStreamTabFile
annoStreamTabix
annoStreamBigBed
annoStreamWig (including bedGraph, bigWig)
annoStreamBam

Integrators

annoGrator and its subclasses; each contains a streamer plus integration method.

annoGrator (base class: just intersect by position and keep fields intact)
annoGrateGenePredVariant (predict functional changes!)

Output formatters

annoOutTab
annoOutVcf
annoOutBigBed
annoOutCustomTrack
annoOutSummary
annoOutRanking

Who's going to do what?

Angie and Brian will divvy up the work. For starters, Angie is working on basic annoStreamers, annoFormatters and the generic (data-agnostic position-joiner) annoGrator. Brian is working on functional annotation given a variant {genomic position, observed alleles}, genePred and reference transcript sequence. Remaining modules TBD.

Features

Like many existing tools, we will report the variants' effects on genes (splice-3, coding-non-synon etc.).

UCSC's major enhancements will be

the incorporation of the many types of data in our database
presentation of the results: not just loads of data, but also links to browser views

Input

The primary input will be variant calls: fundamentally, genomic position plus observed alleles.

pgSnp
VCF
Other formats, e.g. outputs of popular NGS pipelines?
- 23andMe? :)
- maybe eventually BAM and pileup; but there are established, sophisticated variant callers for BAM, better to take variant calls from those tools.

Other inputs will be annotations to relate to the variant calls. These annotations may be stored in database tables, bigData files, flat files; they might be found in trackDb, custom tracks, or hubs. For TCGA and the cancer group, we need to support the Generic Annotation Format (GAF).

Types of Variation

The architecture must support all known forms of variation with respect to a reference assembly. Initially, the implementation will support single nucleotide variants only. We will work our way up through multi-nucleotide variants, small indels, and ultimately large-scale rearrangements.

Output

tab-separated file with all/selected fields (with BED+ as an option)
bigBed with embedded autoSql to define extra columns
VCF with added INFO column tags
custom track in GB/TB
- display should show any AA diff from ref.
- probably should show two alleles if input in heterozygous SNPs - can be simply two BED boxes.
- coloring of diffs in amino-acid space, anyway, can use "different codons" as we do with mRNAs now. Should show amino acids downstream of a frameshift in yellow all the way to any in-frame stop. If user is looking at a window on the gene that does not include the actual variant, she is going to want to know that the protein is messed up.
intermediate level to summarize, sort / rank, and filter findings
?highlight in multiple alignment?
?ancestral polarization?
?binding motif disruption?

Interface

Web UI

This is still wide open -- everyone's input would be very much appreciated!

Main page

Form:

paste/upload variants
select annotation sources
- any track including custom tracks?
select output format/presentation
- custom track in [Genome Browser | Table Browser]
- summary with filters
- mutant sequence (genomic, mRNA, protein)
go!

followed by brief how-to and link to more detailed doc.

Summary/Filters

Stats: #variants, #variants intersecting each annotation source (further broken down for protein-coding genes)