This page expands on the UCSC Genome Browser poster presented by Brooke Rhead at Biology of Genomes 2013 [1]. The poster presents a first look at two new UCSC Genome Browser features for assessing variation. Both features will be released to the public website in the coming months.

Poster: New variation resources at the UCSC Genome Browser

Files: .pptx, PDF
Abstract: .txt

Variant Annotation Integrator

See the development version. This link has sample data from our Personal Genome SNP format page and the sample data from the poster already loaded as custom tracks. (Both tracks are in pgSnp format.)

Overview

In order to assist researchers in annotating and prioritizing thousands of variant calls from sequencing projects, we are developing the Variant Annotation Integrator (VAI) and anticipate a first public release by the end of June 2013. There are several existing tools that can annotate variant calls with predicted functional effects on protein-coding genes and regulatory regions, for example Ensembl's Variant Effect Predictor (VEP). However, these tools are usually restricted to one or two sources of gene annotations and a limited set of additional annotation sources. The VAI will offer much broader choices from the full UCSC database and user-provided custom tracks.

The first release of the VAI will include a simple user interface for selecting variants to annotate as well as the most commonly used annotation sources: protein-coding genes, regulatory regions, predictions from tools such as SIFT and PolyPhen2 provided by the Database of Non-Synonymous Functional Predictions (dbNSFP), and already-discovered variants from dbSNP. The simple user interface will also provide several options for filtering variants based on annotations. A link to an advanced user interface will enable sophisticated users to add annotation sources from the full database.

Usage tips

The Variant Annotation Integrator (VAI) needs a custom track in VCF or pgSnp format to do anything. Eventually the tool will have a place to paste in input data instead of uploading a custom track, and it will also accept a list of rsIDs.

Right now there are no navigation links from other parts of the Genome Browser to the VAI. If you navigate away from it in the demo version and need to get back to it, change your URL to http://hgwdev-demo1.soe.ucsc.edu/cgi-bin/hgVai.

Select Genome Assembly: This section contains the standard Genome Browser controls for selecting a particular assembly. Limiting the region to something other than "genome" will limit the output results to only the position specified. Custom tracks in VCF format that are loaded into track hubs are also supported as input.

Select Variants: Select your VCF or pgSnp custom track.

Select Genes: Choose the gene set you would like to use to calculate variant effects. Any gene set hosted by the UCSC Genome Browser can be used.

Select More Annotations (optional): Here you can choose to add annotations from dbNSFP, to include the dbSNP identifier for your variant if it exists, or to include annotations from a selection of UCSC Genome Browser tracks. The more advanced future release will allow you to include annotations from pretty much any UCSC Genome Browser track, including other custom tracks.

Define Filters: Choose from a selection of pre-defined filters. The later version will allow filtering on any column in the data sources chosen.

Select Output Format: With no extra annotations selected, the output looks similar to Ensembl's VEP, with variants, genes, and predicted functional effects (missense, intron. etc) listed. Adding optional annotations results in additional columns in the output. The header lines include a timestamp and the database used. Eventually VCF format will be supported as an output format (with additional tag=value pairs in the INFO column), as well as a tab-separated format where users can select which columns they would like to be included in the output.

The "More options..." button: This will eventually go to the more advanced user interface.

Common Gene Haplotype Alleles

See the development version. Click on any protein-coding gene in the UCSC Genes track and scroll to the Common Gene Haplotype Alleles section. (The feature is currently implemented only on GRCh37/hg19 protein-coding genes.)

Phase 1 of the 1000 Genomes Project included 1092 individual genomes. For each protein-coding gene in the UCSC Genes track, variant data from the 2,184 (per autosome) phased chromosomes have been distilled into distinct haplotype alleles, or distinct sets of variants found on at least one of the 1000 Genomes subject chromosomes.

Usage tips

By default, only non-synonymous, common variants are displyed. Common variants occur in at least 1% of 1000 Genome subject chromosomes. Including all variants in the display will generate the list of all haplotypes found in 1000 Genomes participants, though many of these haplotypes may have no protein coding effect. Note that haplotype and homozygous frequency calculations depend upon which variants are included.

By default, only common haplotype alleles are displayed. Common haplotypes occur in at least 1% of 1000 Genome subject chromosomes.

There may be no "reference haplotype" (made of entirely reference variants) represented in the 1000 Genomes data. If there is, it will be marked as "reference" in the table of haplotypes.

When the full sequence is displayed, columns with variants are highlighted by green vertical lines. The effects of variants are highlighted by bolded red letters. Synonymous changes are only evident when DNA bases are displayed. Each haplotype allele sequence is generated from GRCh37/hg19 reference DNA, with 1000 Genomes Project DNA variants spliced in, then translated into amino acids.

All columns are sortable. Sorting on a variant while the full sequence is displayed will highlight that variant with a vertical blue line.

Hovering your mouse over numbers in the "haplotype frequency" and "homozygous frequency" columns will show you the actual count of alleles (e.g., "N=370 of 2184").

Hovering your mouse over some buttons displays hints.

Clicking on non-reference variants in the summary section takes you to the corresponding track details pages of the 1000G Ph1 Vars track.

Clicking the "Display population" button will show the distribution of each haplotype allele among major population groups. Optionally display the distribution of each allele among the groups defined by the 1000 Genomes Project.

By default, scoring is hidden. Three types of scores are provided to help users find haplotype alleles that occur more or less frequently than expected or that have unusual distributions in populations. See definitions below.

Scoring definitions

Hap score: The haplotype score is the normalized (-log10) probability of finding exactly N subject chromosomes with this haplotype, given the proportions of individual variants. The score is normalized by dividing by the total number of variants. Normalization allows comparing the scores between genes with many variants and those with few. The score will be positive if the haplotype is more frequent than expected by chance and negative if less frequent.

Hom score: The homozygous score is the (-log10) probability of finding exactly N individuals with this haplotype on both chromosomes, given the actual frequency of the haplotype in subject chromosomes. The score will be positive if the haplotype is found homozygous in more and negative when in fewer individuals than expected. Negative values might suggest that the haplotype is deleterious when homozygous. scores

Pop score (only visible when population distributions are displayed): The population skew score is the variance between population groups divided by N, the number occurrences of the haplotype. The most frequently occurring haplotypes will potentially have larger scores, but if N is small, a skew in population distribution is not unexpected.

How to get help

Search for answers in our mail list archives: http://genome.ucsc.edu/contacts.html
Email a new question to our actively monitored list genome@soe.ucsc.edu
OpenHelix's free training materials: http://www.openhelix.com/downloads/ucsc/ucsc_home.shtml

BoG2013VariationPoster

Contents

Poster: New variation resources at the UCSC Genome Browser

Variant Annotation Integrator

Overview

Usage tips

Common Gene Haplotype Alleles

Usage tips

Scoring definitions

How to get help

Other posters about the UCSC Genome Browser

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

related sites

hosted projects

Tools