Browser Track Construction: Difference between revisions

Revision as of 23:04, 18 April 2013

General comments

The file system layout in the examples here is typical for the UCSC genome browser build. The consistency of the file directory hierarchy allows automatic scripts to perform many of the functions. Therefore, we keep required files in the same location for each browser build to allow the tools to find what they need.

All browser builds are kept under one single directory hierarchy which is on a file system that is also shared with the cluster computer system. This allows work to take place on a large memory work horse system, and also allow the same files to be used during cluster runs for that type of processing.

For example, keeping all browser builds in /data/genomes/ with each specific browser build in a directory of the database name, for example,

/data/genomes/hg19/
/data/genomes/mm10/
/data/genomes/ricCom1/

2bit file construction

The usual genome browser construction at UCSC starts with the official released files from the NCBI FTP site:

ftp://ftp.ncbi.nih.gov/genbank/genomes/Eukaryotes/

Download the NCBI files into:

/data/genomes/ricCom1/genbank/

rsync works with the NCBI FTP site:

rsync -a -P rsync://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/plants/Ricinus_communis/JCVI_RCG_1.1/ /data/genomes/ricCom1/genbank/

This is a typical unplaced scaffold assembly with one additional non-nuclear chrCp plastid. There is a UCSC script in the source tree:

unplacedScaffolds.pl

which processes the genbank files into UCSC style fasta and AGP files. The actual work in the script is simple, it removes the .1 from the accession identifiers in the fasta and AGP files. The bulk of the script merely maintains the naming scheme and file system hierarchy for follow-on tool processing.

The extra non-nuclear plastid is manually added to the unplaced AGP file and the chrCp fasta sequence included with the unplaced fasta file to construct the .2bit file:

cd /data/genomes/ricCom1/ucsc
faToTwoBit ricCom1.ucsc.fa.gz chrCp.fa ../ricCom.unmasked.2bit

Browser Track Construction: Difference between revisions

Revision as of 23:04, 18 April 2013

General comments

2bit file construction

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

related sites

hosted projects

Tools

@@ Line 1: / Line 1: @@
+==General comments==
+The file system layout in the examples here is typical for the UCSC genome browser build.
+The consistency of the file directory hierarchy allows automatic scripts to perform many
+of the functions.  Therefore, we keep required files in the same location for each browser
+build to allow the tools to find what they need.
+All browser builds are kept under one single directory hierarchy which is on a file system
+that is also shared with the cluster computer system.  This allows work to take place
+on a large memory work horse system, and also allow the same files to be used
+during cluster runs for that type of processing.
+For example, keeping all browser builds in '''/data/genomes/''' with each specific
+browser build in a directory of the database name, for example,
+* '''/data/genomes/hg19/'''
+* '''/data/genomes/mm10/'''
+* '''/data/genomes/ricCom1/'''
 ==2bit file construction==
 The usual genome browser construction at UCSC starts with the ''official'' released files from the NCBI FTP site:
-[ftp://ftp.ncbi.nih.gov/genbank/genomes/Eukaryotes/ ftp://ftp.ncbi.nih.gov/genbank/genomes/Eukaryotes/]
+ [ftp://ftp.ncbi.nih.gov/genbank/genomes/Eukaryotes/ ftp://ftp.ncbi.nih.gov/genbank/genomes/Eukaryotes/]
+Download the NCBI files into:
+* '''/data/genomes/ricCom1/genbank/'''
+rsync works with the NCBI FTP site:
+ rsync -a -P rsync://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/plants/Ricinus_communis/JCVI_RCG_1.1/ /data/genomes/ricCom1/genbank/
+This is a typical unplaced scaffold assembly with one additional non-nuclear '''chrCp''' plastid.
+There is a UCSC script in the source tree:
+ [http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob;f=src/hg/utils/automation/unplacedScaffolds.pl unplacedScaffolds.pl]
+which processes the genbank files into UCSC style fasta and AGP files.  The actual work in the script is simple, it removes the '''.1''' from
+the accession identifiers in the fasta and AGP files.  The bulk of the script merely maintains the naming scheme and
+file system hierarchy for follow-on tool processing.
+The extra non-nuclear plastid is manually added to the unplaced AGP file and the chrCp fasta sequence included with the
+unplaced fasta file to construct the '''.2bit''' file:
+ cd /data/genomes/ricCom1/ucsc
+ faToTwoBit ricCom1.ucsc.fa.gz chrCp.fa ../ricCom.unmasked.2bit