The Assembly Hub function is new technology in the UCSC Genome Browser as of early 2013 which allows you to display your novel genome sequence using the UCSC Genome Browser
To display your novel genome sequence, you use a web server at your institution to supply your files to the UCSC Genome Browser (please note that hosting hub files on HTTP tends to work even better than FTP). You then establish a hierarchy of directories and files to host your novel genome sequence. For example:
myHub/ - directory to organize your files on this hub hub.txt – primary reference text file to define the hub, refers to: genomes.txt – definitions for each genome assembly on this hub newOrg1/ - directory of files for this specific genome assembly newOrg1.2bit – ‘2bit’ file constructed from your fasta sequence description.html – information about this assembly for users trackDb.txt – definitions for tracks on this genome assembly groups.txt – definitions for track groups on this assembly bigWig and bigBed files – data for tracks on this assembly external track hub data tracks can be displayed on this assembly
The URL to reference this hub would be: http://yourLab.yourInstitution.edu/myHub/hub.txt
You can view a working example hierarchy of files at: Plants
The initial file hub.txt is the primary URL reference for your assembly hub. The format of the file:
hub hubName shortLabel genome longLabel Comment describing this hub contents genomesFile genomes.txt email contactEmail@institution.edu
The shortLabel is the name that will appear in the genome pull-down menu at the UCSC gateway page. Example: Plants
The genomesFile is a reference to the next definition file in this chain that will describe the assemblies and tracks available at this hub. Typically genomes.txt is at the same directory level as this hub.txt, however it can also be a relative path reference to a different directory level.
The email address provides users a contact point for queries related to this assembly hub.
The genomes.txt file provides the references to the genome assemblies and tracks available at this assembly hub. The example file indicates the typical contents:
genome ricCom1 trackDb ricCom1/trackDb.txt groups ricCom1/groups.txt description July 2011 Castor bean twoBitPath ricCom1/ricCom1.2bit organism Ricinus communis defaultPos EQ973772:1000000-2000000 orderKey 4800 scientificName Ricinus communis htmlPath ricCom1/description.html
There can be multiple assembly definitions in this single file. Separate these stanzas with blank lines. The references to other files are relative path references. In this example there is a sub-directory here called ricCom1 which contains the files for this specific assembly.
- The genome name is the equivalent to the UCSC database name. The genome browser displays this database name in title pages in the genome browser.
- The trackDb refers to a file which defines the tracks to place on this genome assembly. The format of this file is described in the Track Hub help reference documentation.
- The groups refers to a file which defines the track groups on this genome browser. Track groups are the sections of related tracks grouped together under the primary genome browser graphics display image.
- The description will be displayed for user information on the gateway page and most title pages of this genome assembly browser. It is the name displayed in the assembly pull-down menu on the browser gateway page.
- The twoBitPath refers to the .2bit file containing the sequence for this assembly. Typically this file is constructed from the original fasta files for the sequence using the kent program faToTwoBit
- The organism string is displayed along with the description on most title pages in the genome browser. Adjust your names in organism and description until they are appropriate. This example is very close to what the genome browser normally displays. This organism name is the name that appears in the genome pull-down menu on the browser gateway page.
- The defaultPos specifies the default position the genome browser will open when a user first views this assembly. This is usually selected to highlight a popular gene or region of interest in the genome assembly.
- The orderKey is used with other genome definitions at this hub to order the pull-down menu ordering the genome pull-down menu.
- The htmlPath refers to an html file that is used on the gateway page to display information about the assembly.
The .2bit file is constructed from the fasta sequence for the assembly. The kent source program faToTwoBit is used to construct this file, for example:
faToTwoBit ricCom1.fa ricCom1.2bit
Use the twoBitInfo to verify the sequences in this assembly and create a chrom.sizes file which is not used in the hub, but is useful in later processing to construct the big* files:
twoBitInfo ricCom1.2bit stdout | sort -k2rn > ricCom1.chrom.sizes
The .2bit commands can function with the .2bit file at a URL:
twoBitInfo -udcDir=. http://genome-test.cse.ucsc.edu/~hiram/hubs/Plants/ricCom1/ricCom1.2bit stdout | sort -k2nr > ricCom1.chrom.sizes
Sequence can be extracted from the .2bit file with the twoBitToFa command, for example:
twoBitToFa -seq=chrCp -udcDir=. http://genome-test.cse.ucsc.edu/~hiram/hubs/Plants/ricCom1/ricCom1.2bit stdout > ricCom1.chrCp.fa
The groups.txt file defines the grouping of track controls under the primary genome browser image display. The example referenced here has the usual definitions as found in the UCSC Genome Browser.
Each group is defined, for example the Mapping group:
name map label Mapping priority 2 defaultIsClosed 0
- The name is used in the trackDb.txt track definition group, to assign a particular track to this group.
- The label is displayed on the genome browser as the title of this group of track controls
- The priority orders this track group with the other track groups
- The defaultIsClosed determines if this track group is expanded or closed by default. Values to use are 0 or 1
It helps to have a cluster super computer to process the genomes to construct tracks. It can be done for small genomes on single computers that have multiple cores. The process for each track is unique. Please note the continuing document: Browser Track Construction for a discussion of constructing tracks for your assembly hub.