VariantAnnotationTool: Difference between revisions

From Genecats
Jump to navigationJump to search
Line 46: Line 46:
* Other formats, e.g. outputs of popular NGS pipelines?
* Other formats, e.g. outputs of popular NGS pipelines?
** 23andMe?  :)
** 23andMe?  :)
?Do we support indel and multiple base polymorphisms?


== Output ==
== Output ==

Revision as of 22:03, 19 January 2012

Introduction

This is a design document for a proposed variant annotation tool, Feature #6152 in redmine.

Numerous existing tools combine variant calls and gene annotations into predicted functional effects of SNPs. For example, an A/G variant at hg19 chr1:15687059-15687059 (rs4661330) is coding-non-synonymous: having an A makes the 919th codon of canonical FHAD1 code for E (GAA), while having a G makes it code for peptide G (GGA). For alt-splice variants, the same variant falls in the 172nd, 107th or 66th codon. Another variant might be intergenic, another may fall in the UTR or splice site of a gene, another variant may be coding but synonymous, etc.

Our tool will of course produce protein-coding effect predictions such as those, but since we have such a rich annotation database, we can relate a variant to many types of data beyond protein-coding gene annotations.

Use cases

MLQ requests

MLQ #3582 "Ideally I'd like to submit a list of variations (in BED format), the reference genome (hg18/19) and get type of effect back"

MLQ #5242 note 17: user has (chr, coord, strand); wants ref. base, known SNP data if any, refseq codon if any

MLQ #6294: "I have a list of genomic positions for putative mutations that I would like to convert into mutant mRNA and peptide sequences. Could you recommend an automated way to do this?"

Novel variant calls from sequencing experiment

The user gets a bunch of short reads and uses a commonly available NGS pipeline to align short reads to the genome and report discovered variants. The pipeline produces a big file of genomic positions and observed variant alleles.

Now they're wondering which variants are the interesting ones. They want to upload the file that their pipeline spit out, and get back some clue about which variants might have a functional effect.

List of rs IDs

The user reads a paper that lists ~20 SNPs associated with some trait, and wants to know more about them: coding? conserved? etc.

Command line mode

Standalone binary takes input and output file names on the command line. Does CGI call this, or is most the code in libraries?

Implementation plan: who's going to do what?

Features

Like many existing tools, we will report the variants' effects on genes (splice-3, coding-non-synon etc.).

UCSC's major enhancements will be

  1. the incorporation of the many types of data in our database
  2. presentation of the results: not just loads of data, but also links to browser views

Input

  • pgSnp
  • VCF
  • Other formats, e.g. outputs of popular NGS pipelines?
    • 23andMe?  :)

?Do we support indel and multiple base polymorphisms?

Output

  • tab-separated file
   format?
  • custom track in GB/TB
  • intermediate level to summarize and filter findings

Interface

CGI

Main page

Form:

  • paste/upload variants
  • select annotation sources
    • any track including custom tracks?
  • select output format/presentation
    • custom track in [Genome Browser | Table Browser]
    • summary with filters
    • mutant sequence (genomic, mRNA, protein)
  • go!

followed by brief how-to and link to more detailed doc.

Summary/Filters

Stats: #variants, #variants intersecting each annotation source (further broken down for protein-coding genes)

Form:

  • Select annotation source
  • Filters
    • For protein-coding gene annotations: coding-non-synon, etc.
    • For wiggle tracks: min/max threshold
    • ...?

command line