GI2013

From genomewiki
Revision as of 13:56, 1 November 2013 by Brianraney (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Assembly Data Hubs support viewing any sequence on the UCSC Genome Browser.

Poster: File:GI2013.pdf

Brian J Raney1, Ngan Nguyen1, Tim R Dreszer 1, Galt P Barber1, Hiram Clawson1, Pauline A Fujita1, Donna Karolchik1, Ann S Zweig1, Benedict Paten1, William J Kent1 1University of California Santa Cruz, CBSE, Santa Cruz, CA, 95064

Assembly Data Hubs allow anyone to use the UCSC Genome Browser to view their own sequences with associated annotation, without the requirement that UCSC support a browser on that sequence. An Assembly Data Hub is a set of Internet-accessible data files that define the reference sequence to be used for a browser instance, as well as all the data files that define the annotation for that sequence. User sequences can be as complex as whole genome assemblies, or just a few scaffolds from a re-sequencing project. The end user maintains control over the sequence and the annotations, which can be updated at any time.

Assembly Data Hubs are an extension to the Track Data Hubs feature which allows user-level annotation on existing reference sequences. Track Data Hubs are text files with metadata, and indexed files with the sequence annotations, which are all Internet-accessible. They need not exist on the same system, but can be distributed widely. Track hub annotations are stored as compressed binary indexed files in BigBed, BigWig, BAM, HAL, or VCF/tabix format. When a hub track is displayed in the Genome Browser, only the relevant data needed to support the view of the current genomic region is transmitted to UCSC, rather than the entire file. All transmitted data is cached on a UCSC server so future access is local to UCSC servers.

As an example of the ease of creation and the usefulness of Assembly Data Hubs we present the Cactus Alignment Pipeline, which we use to create an E. coli Reference Assembly in which 57 Escherichia coli and 9 Shigella complete genomes are aligned and a consensus reference is created. An assembly hub is created that supports the viewing of all input sequences and the consensus sequence on the UCSC Genome Browser. Each sequence is annotated with an automated pipeline that maps gene annotation from well-annotated genomes to the other genomes, and also generates useful annotations such as GC content and Mappability.

Source code for the BigWig, BigBed and Genome Browser software is implemented in C and supported on Linux, and is freely available for noncommercial use at http://hgdownload.cse.ucsc.edu/admin/jksrc.zip

Binaries for the BigWig and BigBed creation and parsing utilities may be downloaded from http://hgdownload.cse.ucsc.edu/admin/exe/.

BAM and VCF/tabix utilities are available from http://samtools.sourceforge.net/ and http://vcftools.sourceforge.net/.

The UCSC Genome Browser is publicly accessible at http://genome.ucsc.edu.

General information about setting up a track data hub is found at http://genome.ucsc.edu/goldenPath/help/hgTrackHubHelp.html.