Browser Track Construction: Difference between revisions
(add assembly and gap tracks) |
(add category tags) |
||
(2 intermediate revisions by the same user not shown) | |||
Line 130: | Line 130: | ||
* [[Window Masker]] | * [[Window Masker]] | ||
* [[TRF Simple Repeats]] | * [[TRF Simple Repeats]] | ||
==Other tracks== | |||
Please note specific details for processing additional tracks: | |||
* [[Genscan]] | |||
* [[CPG Islands]] | |||
[[Category:Technical FAQ]] | |||
[[Category:User Developed Scripts]] | |||
[[Category:Assembly/Track Hubs]] |
Latest revision as of 21:01, 23 April 2013
General comments
The file system layout in the examples here is typical for the UCSC genome browser build. The consistency of the file directory hierarchy allows automatic scripts to perform many of the functions. Therefore, we keep required files in the same location for each browser build to allow the tools to find what they need.
All browser builds are kept under one single directory hierarchy which is on a file system that is also shared with the cluster computer system. This allows work to take place on a large memory work horse system, and also allow the same files to be used during cluster runs for that type of processing.
For example, keeping all browser builds in /data/genomes/ with each specific browser build in a directory of the database name, for example,
- /data/genomes/hg19/
- /data/genomes/mm10/
- /data/genomes/ricCom1/
2bit file construction
The usual genome browser construction at UCSC starts with the official released files from the NCBI FTP site:
ftp://ftp.ncbi.nih.gov/genbank/genomes/Eukaryotes/
Download the NCBI files into:
- /data/genomes/ricCom1/genbank/
rsync works with the NCBI FTP site:
rsync -a -P rsync://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/plants/Ricinus_communis/JCVI_RCG_1.1/ /data/genomes/ricCom1/genbank/
This is a typical unplaced scaffold assembly with one additional non-nuclear chrCp plastid. There is a UCSC script in the source tree:
unplacedScaffolds.pl
which processes the genbank files into UCSC style fasta and AGP files. The actual work in the script is simple, it removes the .1 from the accession identifiers in the fasta and AGP files. The bulk of the script merely maintains the naming scheme and file system hierarchy for follow-on tool processing. It creates the directory /data/genomes/ricCom1/ucsc/ and leaves the files ricCom1.ucsc.fa.gz and ricCom1.ucsc.agp files there.
The extra non-nuclear plastid is manually added to the unplaced AGP file and the chrCp fasta sequence included with the unplaced fasta file to construct the .2bit file:
cd /data/genomes/ricCom1/ucsc faToTwoBit ricCom1.ucsc.fa.gz chrCp.fa ../ricCom1.unmasked.2bit cd /data/genomes/ricCom1 twoBitInfo ricCom1.unmasked.2bit stdout | sort -k2nr > chrom.sizes
The chrom.sizes file is useful for later processing.
Assembly and gap tracks
The assembly and gap tracks are constructed directly from the agp file:
mkdir /data/genomes/ricCom1/bed/assemblyTrack cd /data/genomes/ricCom1/bed/assemblyTrack grep -v "^#" ../../ucsc/ricCom1.ucsc.agp | awk '$5 != "N"' | awk '{printf "%s\t%d\t%d\t%s\t0\t%s\n", $1, $2, $3, $6, $9}' | sort -k1,1 -k2,2n > ricCom1.assembly.bed grep -v "^#" ../../ucsc/ricCom1.ucsc.agp | awk '$5 == "N"' | awk '{printf "%s\t%d\t%d\t%s\n", $1, $2, $3, $8}' | sort -k1,1 -k2,2n > ricCom1.gap.bed bedToBigBed -verbose=0 ricCom1.assembly.bed ../../chrom.sizes ricCom1.assembly.bb bedToBigBed -verbose=0 ricCom1.gap.bed ../../chrom.sizes ricCom1.gap.bb
Track hub trackDb.txt entries:
track assembly_ longLabel Assembly shortLabel Assembly priority 10 visibility pack colorByStrand 150,100,30 230,170,40 color 150,100,30 altColor 230,170,40 bigDataUrl bbi/ricCom1.assembly.bb type bigBed 6 html ../trackDescriptions/assembly url http://www.ncbi.nlm.nih.gov/nuccore/$$ urlLabel NCBI Nucleotide database group map
track gap_ longLabel Gap shortLabel Gap priority 11 visibility dense color 0,0,0 bigDataUrl bbi/ricCom1.gap.bb type bigBed 4 group map html ../trackDescriptions/gap
GC Percent
The calculation of the GC Percent track does not require masked sequence. You can construct this track directly from the unmasked.2bit file:
mkdir /data/genomes/ricCom1/bed/gc5Base cd /data/genomes/ricCom1/bed/gc5Base hgGcPercent -wigOut -doGaps -file=stdout -win=5 -verbose=0 ricCom1 \ ../../ricCom1.unmasked.2bit | gzip -c > ricCom1.gc5Base.wigVarStep.gz wigToBigWig ricCom1.gc5Base.wigVarStep.gz ../../chrom.sizes ricCom1.gc5Base.bw
Track hub trackDb.txt file entry:
track gc5Base_ shortLabel GC Percent longLabel GC Percent in 5-Base Windows group map priority 23.5 visibility full autoScale Off maxHeightPixels 128:36:16 graphTypeDefault Bar gridDefault OFF windowingFunction Mean color 0,0,0 altColor 128,128,128 viewLimits 30:70 type bigWig 0 100 bigDataUrl bbi/ricCom1.gc5Base.bw html ../trackDescriptions/gc5Base
Repeat Masking
Repeat masking is necessary for almost all other track construction. UCSC uses Repeat Masker for the repeat masking track, TRF for simple repeats, and the NCBI WindowMasker for the WIndow Masker tracks.
The procedure is usually:
- break up the genome sequence into manageable pieces
- cluster run to mask each piece
- reassemble the results to original assembly coordinates
A choice is to be made whether the Repeat Masker or Window Masker result is used to mask the sequence. If Repeat Masker has repeats in the RM libraries for this species and can mask a good amount of the genome, it can be used for the masking. If RM only masks a small percent of the genome, then UCSC uses Window Masker to mask the sequence. In this Ricinus communis example, RM only found %3 in repeats, where Window Masker found almost %45 in repeats. Thus the Window Masker result, plus the TRF repeats of period less than or equal to 12 are used to mask the sequence resulting in %45.95 of the sequence masked. UCSC uses the larger masking result to avoid problems in processing during pair-wise alignments. When there isn't enough masking is aligned sequences, the repeats overwhelm the alignment results.
Please note specific details for running each of the masking operations:
Other tracks
Please note specific details for processing additional tracks: