GI2013: Difference between revisions

From genomewiki
Jump to navigationJump to search
(Created page with "GI2013")
 
No edit summary
Line 1: Line 1:
GI2013
Assembly Data Hubs support viewing any sequence on the UCSC Genome Browser.
 
Brian J Raney1, Ngan Nguyen1, Tim R Dreszer 1, Galt P Barber1, Hiram
Clawson1, Pauline A Fujita1, Donna Karolchik1, Ann S Zweig1, Benedict
Paten1, William J Kent1
1University of California Santa Cruz, CBSE, Santa Cruz, CA, 95064
 
Assembly Data Hubs allow anyone to use the UCSC Genome Browser to view
their own sequences with associated annotation, without the
requirement that UCSC support a browser on that sequence. An Assembly
Data Hub is a set of Internet-accessible data files that define the
reference sequence to be used for a browser instance, as well as all
the data files that define the annotation for that sequence. User
sequences can be as complex as whole genome assemblies, or just a few
scaffolds from a re-sequencing project. The end user maintains control
over the sequence and the annotations, which can be updated at any
time.
 
Assembly Data Hubs are an extension to the Track Data Hubs feature
which allows user-level annotation on existing reference sequences.
Track Data Hubs are text files with metadata, and indexed files with
the sequence annotations, which are all Internet-accessible. They need
not exist on the same system, but can be distributed widely. Track hub
annotations are stored as compressed binary indexed files in BigBed,
BigWig, BAM, HAL, or VCF/tabix format. When a hub track is displayed
in the Genome Browser, only the relevant data needed to support the
view of the current genomic region is transmitted to UCSC, rather than
the entire file. All transmitted data is cached on a UCSC server so
future access is local to UCSC servers.
 
As an example of the ease of creation and the usefulness of Assembly
Data Hubs we present the Cactus Alignment Pipeline, which we use to
create an E. coli Reference Assembly in which 57 Escherichia coli and
9 Shigella complete genomes are aligned and a consensus reference is
created. An assembly hub is created that supports the viewing of all
input sequences and the consensus sequence on the UCSC Genome Browser.
Each sequence is annotated with an automated pipeline that maps gene
annotation from well-annotated genomes to the other genomes, and also
generates useful annotations such as GC content and Mappability.
 
Source code for the BigWig, BigBed and Genome Browser software is
implemented in C and supported on Linux, and is freely available for
noncommercial use at
http://hgdownload.cse.ucsc.edu/admin/jksrc.zip
 
Binaries for the BigWig and BigBed creation and parsing utilities may
be downloaded from http://hgdownload.cse.ucsc.edu/admin/exe/.
 
BAM and VCF/tabix utilities are available from
http://samtools.sourceforge.net/ and http://vcftools.sourceforge.net/.
 
The UCSC Genome Browser is publicly accessible at http://genome.ucsc.edu.
 
General information about setting up a track data hub is found at
http://genome.ucsc.edu/goldenPath/help/hgTrackHubHelp.html.

Revision as of 20:46, 25 October 2013

Assembly Data Hubs support viewing any sequence on the UCSC Genome Browser.

Brian J Raney1, Ngan Nguyen1, Tim R Dreszer 1, Galt P Barber1, Hiram Clawson1, Pauline A Fujita1, Donna Karolchik1, Ann S Zweig1, Benedict Paten1, William J Kent1 1University of California Santa Cruz, CBSE, Santa Cruz, CA, 95064

Assembly Data Hubs allow anyone to use the UCSC Genome Browser to view their own sequences with associated annotation, without the requirement that UCSC support a browser on that sequence. An Assembly Data Hub is a set of Internet-accessible data files that define the reference sequence to be used for a browser instance, as well as all the data files that define the annotation for that sequence. User sequences can be as complex as whole genome assemblies, or just a few scaffolds from a re-sequencing project. The end user maintains control over the sequence and the annotations, which can be updated at any time.

Assembly Data Hubs are an extension to the Track Data Hubs feature which allows user-level annotation on existing reference sequences. Track Data Hubs are text files with metadata, and indexed files with the sequence annotations, which are all Internet-accessible. They need not exist on the same system, but can be distributed widely. Track hub annotations are stored as compressed binary indexed files in BigBed, BigWig, BAM, HAL, or VCF/tabix format. When a hub track is displayed in the Genome Browser, only the relevant data needed to support the view of the current genomic region is transmitted to UCSC, rather than the entire file. All transmitted data is cached on a UCSC server so future access is local to UCSC servers.

As an example of the ease of creation and the usefulness of Assembly Data Hubs we present the Cactus Alignment Pipeline, which we use to create an E. coli Reference Assembly in which 57 Escherichia coli and 9 Shigella complete genomes are aligned and a consensus reference is created. An assembly hub is created that supports the viewing of all input sequences and the consensus sequence on the UCSC Genome Browser. Each sequence is annotated with an automated pipeline that maps gene annotation from well-annotated genomes to the other genomes, and also generates useful annotations such as GC content and Mappability.

Source code for the BigWig, BigBed and Genome Browser software is implemented in C and supported on Linux, and is freely available for noncommercial use at http://hgdownload.cse.ucsc.edu/admin/jksrc.zip

Binaries for the BigWig and BigBed creation and parsing utilities may be downloaded from http://hgdownload.cse.ucsc.edu/admin/exe/.

BAM and VCF/tabix utilities are available from http://samtools.sourceforge.net/ and http://vcftools.sourceforge.net/.

The UCSC Genome Browser is publicly accessible at http://genome.ucsc.edu.

General information about setting up a track data hub is found at http://genome.ucsc.edu/goldenPath/help/hgTrackHubHelp.html.