GI2012TrackHubsPoster
This page contains links related to the UCSC Genome Browser poster presented by Brian Raney at Genome Informatics 2012 [1])
Remote Data Track Storage for Viewing on the UCSC Genome Browser
Introduction
Track Data Hubs are useful for projects that generate large amounts of genome-wide data sets. For smaller data sets the Custom Track mechanism of displaying data in the genome browser is often easier. However when a project has more than a half dozen wiggle plots or other tracks to display, the Data Hub allows the tracks to be organized into composite (grouped) tracks. This makes it possible to show data for a large collection of tissues and experimental conditions in an elegant way, similar to how the ENCODE data is displayed at UCSC.
Because the data files remain on the remote server, we have eliminated the need to transfer large data sets across the Internet. Labs and individual users format their data sets using one of the browser binary data types (bigBed, bigWig3 or Binary Alignment Map4 (BAM)), make the files available on an Internet-connected computer (a.k.a. Track Data Hub), then register the data collection with UCSC. The Genome Browser's Data Hub Portal lists registered data collections from around the world of interest to all types of scientists. Genome Browser users select data sets to display on the Portal page, then access the data hub tracks in the same way that they currently access the native annotation tracks: through the track controls. If for some reason the data tracks from a remote site are not available, the Genome Browser provides an informative error message in the space where the track would normally appear.
This distributed data model allows Genome Browser users to view data sets from scientists worldwide using the familiar Genome Browser interface.
Formatting your data
To share data with hundreds of thousands of Genome Browser users, the data files must be in one of the following formats: bigBed, bigWig, BAM or VCF indexed by tabix. All data files must be placed on a a web-accessible server (http or ftp). A few supporting files should also be in place. Here is the basic directory structure used for a track data hub:
myHub/ - directory for the hub as a whole
hub.txt – text file containing a short description of the hub genomes.txt – text file containing a list of genome assemblies hg19/ - directory for a recent human assembly trackDb.txt – text file containing track display details including names, colors, data types, etc. dnase.html – HTML file describing a DNAse track to users dnaseSignal.bw – wiggle plot of DNAse Signal dnaseReads.bam – BAM file of DNAse Reads
To request that your data hub be added to the list of public hubs at UCSC, contact the Genome Browser mailing list at genome@soe.ucsc.edu.
Parallel downloading, remote access, and caching make for good performance
The technology behind the data hubs, and their optimization, is what enables good performance in the browser. Threads allow hub tracks to be downloaded in parallel, avoiding an otherwise large serial internet latency that tends to increase both with distance to the remote hub and with the number of hub tracks visible. The performance of the cache becomes very important once the data has been fetched. Optimizations include a read-ahead buffer for the cache data that speeds up reading by a factor of 30. Browser users typically visit a few favorite genes, and only those small pieces of the large remote data file are fetched. The cache contains a sparse data file as well as a bitmap file that keeps track of cached blocks. Unix stores files as sparse files, which take up no space for unwritten parts of the file. This results in great disk-space efficiency in the cache. Temporarily inaccessible data hub resources are handled with reasonable timeouts and informative error messages.
Assembly Data hubs
To address the increasing need for researchers to annotate sequence for which UCSC does not provide an annotation database, we will be introducing Assembly Data hubs. These will allow researchers to include the underlying reference sequence, as well as data tracks that annotate that sequence. Sequence will be stored in the UCSC twoBit format, and the annotation tracks will be stored in the same manner as Track Data Hubs.
How to get help
- Search for answers in our mail list archives: http://genome.ucsc.edu/contacts.html
- Email a new question to our actively monitored list genome@soe.ucsc.edu
- OpenHelix's free training materials: http://www.openhelix.com/downloads/ucsc/ucsc_home.shtml
Other posters about the UCSC Genome Browser
- Visually integrating genomic data in the UCSC Genome Browser. Hinrichs AS et al. HGV 2011 genomewiki page .pptx, PDF
- UCSC Genome Browser Data Hubs. Zweig AS et al. Biology of Genomes, 2011 PDF
- Genome-wide ENCODE Data at UCSC. Rosenbloom KR et al. ASHG, 2010. PPT
- UCSC Genome Browser Tool Suite. Hinrichs AS et al. Genomics of Common Disease, 2008: .ppt, PDF