GI2012TrackHubsPoster

From genomewiki
Revision as of 21:44, 27 August 2012 by Brianraney (talk | contribs)
Jump to navigationJump to search

This page contains links related to the UCSC Genome Browser poster presented by Brian Raney at Genome Informatics 2012 [1])

Remote Data Track Storage for Viewing on the UCSC Genome Browser

Introduction

Track Data Hubs are useful for projects that generate large amounts of genome-wide data sets. For smaller data sets the Custom Track mechanism of displaying data in the genome browser is often easier. However when a project has more than a half dozen wiggle plots or other tracks to display, the Data Hub allows the tracks to be organized into composite (grouped) tracks. This makes it possible to show data for a large collection of tissues and experimental conditions in an elegant way, similar to how the ENCODE data is displayed at UCSC.

Because the data files remain on the remote server, we have eliminated the need to transfer large data sets across the Internet. Labs and individual users format their data sets using one of the browser binary data types (bigBed, bigWig3 or Binary Alignment Map4 (BAM)), make the files available on an Internet-connected computer (a.k.a. Track Data Hub), then register the data collection with UCSC. The Genome Browser's Data Hub Portal lists registered data collections from around the world of interest to all types of scientists. Genome Browser users select data sets to display on the Portal page, then access the data hub tracks in the same way that they currently access the native annotation tracks: through the track controls. If for some reason the data tracks from a remote site are not available, the Genome Browser provides an informative error message in the space where the track would normally appear.

This distributed data model allows Genome Browser users to view data sets from scientists worldwide using the familiar Genome Browser interface. Support for Track Data Hubs is in development, and a prototype version of a Data Hub for the Epigenomics Roadmap Project is available on http://genome-preview.ucsc.edu.

Formatting your data

To share data with hundreds of thousands of Genome Browser users, the data files must be in one of the following formats: bigBed, bigWig, BAM (and soon, VCF5 indexed by tabix6). All data files must be placed on a a web-accessible server (http or ftp). A few supporting files should also be in place, including (for example):

myHub/ - directory for the hub as a whole

 hub.txt – text file containing a short description of the hub
 genomes.txt – text file containing a list of genome assemblies 
 hg19/ - directory for a recent human assembly
   trackDb.txt – text file containing track display details
                 including names, colors, data types, etc.
   dnase.html – HTML file describing a DNAse track to users
   dnaseSignal.bw – wiggle plot of DNAse Signal
   dnaseReads.bam – BAM file of DNAse Reads


Data Hubs can be added to the Data Hub Portal page by contacting the Genome Browser mailing list at genome@soe.ucsc.edu.

Parallel downloading, remote access, and caching make for good performance

The technology behind the data hubs, and their optimization, is what enables good performance in the browser. Threads allow hub tracks to be downloaded in parallel, avoiding an otherwise large serial internet latency that tends to increase both with distance to the remote hub and with the number of hub tracks visible. The performance of the cache becomes very important once the data has been fetched. Optimizations include a read-ahead buffer for the cache data that speeds up reading by a factor of 30. Browser users typically visit a few favorite genes, and only those small pieces of the large remote data file are fetched. The cache contains a sparse data file as well as a bitmap file that keeps track of cached blocks. Unix stores files as sparse files, which take up no space for unwritten parts of the file. This results in great disk-space efficiency in the cache. Temporarily inaccessible data hub resources are handled with reasonable timeouts and informative error messages.

Assembly Data hubs

To address the increasing need for researchers to annotate sequence for which UCSC does not provide an annotation database, we will be introducing Assembly Data hubs. These will allow researchers to include the underlying reference sequence, as well as data tracks that annotate that sequence. Sequence will be stored in the UCSC twoBit format, and the annotation tracks will be stored in the same manner as Track Data Hubs.

How to get help

Other posters about the UCSC Genome Browser

  • Visually integrating genomic data in the UCSC Genome Browser. Hinrichs AS et al. HGV 2011 genomewiki page .pptx, PDF
  • UCSC Genome Browser Data Hubs. Zweig AS et al. Biology of Genomes, 2011 PDF
  • Genome-wide ENCODE Data at UCSC. Rosenbloom KR et al. ASHG, 2010. PPT
  • UCSC Genome Browser Tool Suite. Hinrichs AS et al. Genomics of Common Disease, 2008: .ppt, PDF