DoBlastzChainNet.pl: Difference between revisions
(adding genome sequences) |
|||
Line 94: | Line 94: | ||
rsync -avzP \ | rsync -avzP \ | ||
rsync://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/dm6.chrom.sizes . | rsync://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/dm6.chrom.sizes . | ||
ls -og | |||
-rw-rw-r--. 1 36969050 Aug 28 2014 dm6.2bit | |||
-rw-rw-r--. 1 45055 Aug 28 2014 dm6.chrom.sizes | |||
Genome sequences from the NCBI/Entrez/Genbank system can be found via the | |||
assembly_summary.txt text listing information files, for example '''invertebrate''' genomes: | |||
wget -O /tmp/invertebrate.assembly_summary.txt 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/invertebrate/assembly_summary.txt' | |||
Looking for the Anopheles genome: | |||
<pre> | |||
<nowiki> | |||
grep -w Anopheles /tmp/invertebrate.assembly_summary.txt | |||
GCF_000005575.2 PRJNA163 SAMN02952903 AAAB00000000.1 representative genome 180454 7165 Anopheles gambiae str. PEST strain=PEST latest Chromosome Major Full 2006/10/16 AgamP3 The International Consortium for the Sequencing of Anopheles Genome GCA_000005575.1 different ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/575/GCF_000005575.2_AgamP3 | |||
</nowiki> | |||
</pre> | |||
Note the Assembly identification from the ftp path '''GCF_000005575.2_AgamP3''', | |||
working with that sequence: | |||
<pre> | |||
<nowiki> | |||
mkdir /data/genomes/dm6/trackData/GCF_000005575.2_AgamP3 | |||
cd /data/genomes/dm6/trackData/GCF_000005575.2_AgamP3 | |||
wget --timestamping 'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/575/GCF_000005575.2_AgamP3/GCF_000005575.2_AgamP3_genomic.fna.gz' | |||
wget --timestamping \ | |||
'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/575/GCF_000005575.2_AgamP3/GCF_000005575.2_AgamP3_assembly_report.txt' | |||
ls -og | |||
total 79908 | |||
-rw-rw-r--. 1 811394 Mar 7 2017 GCF_000005575.2_AgamP3_assembly_report.txt | |||
-rw-rw-r--. 1 81010275 Jun 15 2016 GCF_000005575.2_AgamP3_genomic.fna.gz | |||
</nowiki> | |||
</pre> | |||
The assembly_report.txt file is useful to have for the meta-data information it has about the assembly. | |||
The '''fna.gz''' file needs to be in '''2bit''' format for the processing system, and the '''chrom.sizes''' made from the '''2bit''': | |||
faToTwoBit GCF_000005575.2_AgamP3_genomic.fna.gz GCF_000005575.2_AgamP3.2bit | |||
twoBitInfo GCF_000005575.2_AgamP3.2bit stdout | sort -k2,2nr > GCF_000005575.2_AgamP3.chrom.sizes | |||
ls -og | |||
total 156132 | |||
-rw-rw-r--. 1 77912208 Apr 6 03:48 GCF_000005575.2_AgamP3.2bit | |||
-rw-rw-r--. 1 138303 Apr 6 03:48 GCF_000005575.2_AgamP3.chrom.sizes | |||
-rw-rw-r--. 1 811394 Mar 7 2017 GCF_000005575.2_AgamP3_assembly_report.txt | |||
-rw-rw-r--. 1 81010275 Jun 15 2016 GCF_000005575.2_AgamP3_genomic.fna.gz | |||
[[Category:Cluster FAQ]] | [[Category:Cluster FAQ]] | ||
[[Category:Technical FAQ]] | [[Category:Technical FAQ]] |
Revision as of 03:54, 6 April 2018
Prerequisites
This discussion assumes you are familiar with Unix shell command line programming and scripting. You will be encountering and interacting with csh/tcsh, bash, perl, and python scripting languages. You will need at least one computer with several CPU cores, preferably a multiple compute cluster system or equivalent in a cloud computing environment.
Parasol Job Control System
The scripts and programs used here expect to find the Parasol_job_control_system in place and operational.
Install scripts and kent command line utilities
This is a bit of a kludge at this time (April 2018), we are working on a cleaner distribution of these scripts. As was mentioned in the Parasol_job_control_system setup, the kent command line binaries and these scripts are going to reside in /data/bin/ and /data/scripts/. This is merely a style custom to keep scripts separate from binaries, this is not strictly necessary to keep them separate.
mkdir -p /data/scripts /data/bin chmod 755 /data/scripts /data/bin rsync -a rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/ /data/bin/ git archive --remote=git://genome-source.soe.ucsc.edu/kent.git \ --prefix=kent/ HEAD src/hg/utils/automation \ | tar vxf - -C /data/scripts --strip-components=5 \ --exclude='kent/src/hg/utils/automation/incidentDb' \ --exclude='kent/src/hg/utils/automation/configFiles' \ --exclude='kent/src/hg/utils/automation/ensGene' \ --exclude='kent/src/hg/utils/automation/genbank' \ --exclude='kent/src/hg/utils/automation/lastz_D' \ --exclude='kent/src/hg/utils/automation/openStack'
PATH setup
Add or verify the two directories /data/bin and /data/scripts are added to the shell PATH environment. This can be added simply to the .bashrc file in the your home directory:
echo 'export PATH=/data/bin:/data/scripts:$PATH' >> $HOME/.bashrc
Then, source that file to add that to this current shell:
. $HOME/.bashrc
Verify you see those pathnames on the PATH variable:
echo $PATH /data/bin:/data/scripts:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/centos/.local/bin:/home/centos/bin
This entire discussion assumes the bash shell is the user's unix shell.
Working directory hierarchy
It is best to organize your work in a directory hierarchy. For example maintain all your genome sequences in:
/data/genomes/ /data/genomes/hg38/ /data/genomes/mm10/ /data/genomes/dm6/ /data/genomes/ce11/ ... etc ...
Where those database directories can have the 2bit files, chrom.sizes, and track construction directories, for example:
/data/genomes/dm6/dm6.2bit /data/genomes/dm6/dm6.chrom.sizes /data/genomes/dm6/trackData/
Such organizations are a personal preference custom. However you do this, keep it consistent to make it easier to use scripts on multiple sequences.
Obtain genome sequences
Genome sequences from the U.C. Santa Cruz Genomics Institute can be obtained directly from the hgdownload server via rsync. For example
mkdir /data/genomes/dm6 cd /data/genomes/dm6 rsync -avzP \ rsync://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/dm6.2bit . rsync -avzP \ rsync://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/dm6.chrom.sizes . ls -og -rw-rw-r--. 1 36969050 Aug 28 2014 dm6.2bit -rw-rw-r--. 1 45055 Aug 28 2014 dm6.chrom.sizes
Genome sequences from the NCBI/Entrez/Genbank system can be found via the assembly_summary.txt text listing information files, for example invertebrate genomes:
wget -O /tmp/invertebrate.assembly_summary.txt 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/invertebrate/assembly_summary.txt'
Looking for the Anopheles genome:
grep -w Anopheles /tmp/invertebrate.assembly_summary.txt GCF_000005575.2 PRJNA163 SAMN02952903 AAAB00000000.1 representative genome 180454 7165 Anopheles gambiae str. PEST strain=PEST latest Chromosome Major Full 2006/10/16 AgamP3 The International Consortium for the Sequencing of Anopheles Genome GCA_000005575.1 different ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/575/GCF_000005575.2_AgamP3
Note the Assembly identification from the ftp path GCF_000005575.2_AgamP3, working with that sequence:
mkdir /data/genomes/dm6/trackData/GCF_000005575.2_AgamP3 cd /data/genomes/dm6/trackData/GCF_000005575.2_AgamP3 wget --timestamping 'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/575/GCF_000005575.2_AgamP3/GCF_000005575.2_AgamP3_genomic.fna.gz' wget --timestamping \ 'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/575/GCF_000005575.2_AgamP3/GCF_000005575.2_AgamP3_assembly_report.txt' ls -og total 79908 -rw-rw-r--. 1 811394 Mar 7 2017 GCF_000005575.2_AgamP3_assembly_report.txt -rw-rw-r--. 1 81010275 Jun 15 2016 GCF_000005575.2_AgamP3_genomic.fna.gz
The assembly_report.txt file is useful to have for the meta-data information it has about the assembly. The fna.gz file needs to be in 2bit format for the processing system, and the chrom.sizes made from the 2bit:
faToTwoBit GCF_000005575.2_AgamP3_genomic.fna.gz GCF_000005575.2_AgamP3.2bit twoBitInfo GCF_000005575.2_AgamP3.2bit stdout | sort -k2,2nr > GCF_000005575.2_AgamP3.chrom.sizes ls -og total 156132 -rw-rw-r--. 1 77912208 Apr 6 03:48 GCF_000005575.2_AgamP3.2bit -rw-rw-r--. 1 138303 Apr 6 03:48 GCF_000005575.2_AgamP3.chrom.sizes -rw-rw-r--. 1 811394 Mar 7 2017 GCF_000005575.2_AgamP3_assembly_report.txt -rw-rw-r--. 1 81010275 Jun 15 2016 GCF_000005575.2_AgamP3_genomic.fna.gz