New variation resources at the UCSC Genome Browser

Brooke Rhead, Angie S. Hinrichs, Timothy R. Dreszer, Brian J. Raney, Robert M. Kuhn, Ann S. Zweig, Donna Karolchik, W. James Kent

Center For Biomolecular Science and Engineering, University of California Santa Cruz, Santa Cruz, CA, 95064

The UCSC Genome Browser (http://genome.ucsc.edu) is an integrated tool set for visualizing and analyzing both publicly available and user-generated genomic sequence annotations on a variety of organisms. The Genome Browser offers a number of variation and phenotype annotation tracks based on data from sources such as dbSNP, the 1000 Genomes Project, OMIM, DECIPHER, COSMIC, and the GWAS Catalog. Additionally, users can upload their own variant annotations. However, the proliferation of genomic sequence data from many individuals has prompted development of new tools for assessing variation in the Genome Browser.

Several existing tools use gene annotations to predict the functional impact of newly discovered variants. We are developing an interactive tool, the Variant Annotation Integrator, which can add not only gene-based functional predictions to uploaded variants, but also data from almost any annotation track in the Genome Browser. In addition to producing output that combines information from many tracks, the new tool will support filtering of variants using data from other tracks and enhanced display of results in the Genome Browser. A command-line version of the tool will allow offline processing of data sets that are too large for web queries.

We also intend to facilitate the interpretation of genomic sequence data from the 1000 Genomes Project through organization and display of the data at the genes level. For each gene in a gene set, variants from the ~2,000 phased chromosomes will be distilled into distinct haplotype alleles. For each haplotype allele, we will display its frequency in the 1000 Genomes population and whether it occurs homozygously in that population. Unexpected frequencies of occurrence may then be used to identify alleles that have undergone positive or negative selection. Predicted protein sequence for common gene haplotype alleles will also be displayed, allowing differences between alleles to be used to identify important structural and active-site variability that give rise to human phenotypic diversity.