CSHL 2015 Computational and Comparative Genomics: Difference between revisions
Line 234: | Line 234: | ||
</pre> | </pre> | ||
===construct hub file structure=== | ===construct hub file structure for WEB delivery=== | ||
On a directory where Apache can serve up files, and thus make all this available via | On a directory where Apache can serve up files, and thus make all this available via |
Revision as of 00:01, 3 November 2015
transfer data from student's laptop to CSHL
Transferring data to class: On student's laptop where the data exists, verify enough disk space for this operation: $ cd # cd with no argument will go to HOME $ df -h . # verify disk space available in this directory == on this filesystem Filesystem Size Used Avail Capacity iused ifree %iused Mounted on /dev/disk1 233Gi 226Gi 6.1Gi 98% 59386162 1592652 97% / # looks like 6 Gb free ^^^^^ Go to the directory of data to transfer $ cd oenothera Measure the amount of data to package: $ du -hsc * 2.1G transcriptomes 2.1G total Total data is 2.1 Gb, the tar image compression will help. Generate compressed tar image of this directory: $ tar -cvzf $HOME/toCSHL.tgz ./ tar command arguments: c - create tar file v - verbose, show what is being packaged z - compress (gzip) while making tar image f - file name of tar image to construct ./ - package up everyting in this directory Take a look at the resulting file: $ cd # return to home directory where the result file is $ ls -l *.tgz -rw-rw-r-- 1 hclawson staff 770380941 Oct 29 22:26 toCSHL.tgz It is now only 735 Mb of compressed data: $ du -hsc *.tgz 735M toCSHL.tgz Transfer this file to the workstation at CSHL $ scp -p toCSHL.tgz hclawson@ecg15.cshl.edu:. scp option '-p' means preserve date/time stamps on the file so it will appear identical in the copy. Magic hand-waving here since there are various pathways through the networking here from wifi laptop connections to the class workstations. Talk with Dan for correct connection procedures. Now, on the desktop machines for the class, in the home directory, unpack the tar image here: $ mkdir oenothera $ cd oenothera $ tar xvzf ../toCSHL.tgz $ ls -l total 80 drwxr-xr-x 1 hclawson staff 330 Oct 29 21:47 transcriptomes
survey names in sequences
For this procedure, it would be good to record the name to name translation so that the original names could be used in the assembly track and thus be available to reference from the genome browser view back to this set of short named contigs..
To use the UCSC genome browser to view this work, it is helpful to reduce the very long names in the transcriptome fasta sequence that were constructed by the assembler. A pattern is seen in the names that suggests a substitution algorithm. They all start with: >Locus_<sequenceNumber>_otherBusiness or >NODE_<sequenceNumber>_otherBusiness The <sequenceNumber> identifiers appear to be unique within each fasta sequence, thus, the Locus_ or NODE_ can be replaced with a name related to the transcript, and the _otherBusiness can be discarded. $ awk -F'_' '{print $1}' all.contig.names.txt | sort | uniq -c 1985474 >Locus 1228519 >NODE As a test, constructing fasta with those short names: $ cd ~/oenothera/transcriptomes/assemblies find . -type f | sed -e 's#^./##;' | grep fasta | while read F do B=`basename ${F}` D=`dirname ${F}` id=`echo $D | sed -e 's/-.*//;'` printf "%s %s\n" "${id}" "${B}" 1>&2 sed -e "s#^>Locus_#>${id}.#; s#^>NODE_#>${id}.#; s#_.*##;" ${F} done | gzip -c > $HOME/all.contigs.fa.gz # verify nothing lost (using kent command line programs from ~/bin/) # from the original source $ faSize */*fasta* 1354344256 bases (51622 N's 1354292634 real 1351051824 upper 3240810 lower) in 3213993 sequences in 63 files # to the short name contigs: $ cd $ $ faSize all.contigs.fa.gz 1354344256 bases (51622 N's 1354292634 real 1351051824 upper 3240810 lower) in 3213993 sequences in 1 files # same numbers, nothing lost
construct artifical assemblies of each transcriptome
It would actually be best to order the contigs by size, largest ones first. This would look more convenient in the browser display where all these contigs are going to appear as if they were all on one single chromosome.
It is convenient for alignment processing and genome browser display to concatenate all the contigs from one transcriptome into a single sequence, inserting 100 bases of 'N' between each contig. Using the following perl script on each set of transcriptomes: ================================================= #!/usr/bin/env perl use strict; use warnings; sub usage { printf STDERR "usage: ./fakeAssembly [name] [fastaFile] > name.assembly.fa\n"; } my $argc = scalar(@ARGV); if ($argc != 2) { usage; exit 255; } sub outputGap { for (my $i = 0; $i < 100; ++$i) { printf "N"; } printf "\n"; } my $name = shift; my $fastaFile = shift; my $contigsDone = 0; open (FH, "<$fastaFile") or die "can not read $fastaFile"; while (my $line = <FH>) { chomp $line; if ($line =~ m/^>/) { if (0 == $contigsDone) { printf ">%s\n", $name } else { outputGap; } ++$contigsDone; } else { printf "%s\n", $line; } } close (FH);
Running that perl script on each directory that has a transcriptome fasta file:
#!/bin/bash mkdir $HOME/fakeAssemblies find . -type d | grep -v "^.$" | sed -e 's#^./##;' | while read D do fasta=`ls $D | grep fasta` name=`echo $D | awk -F'-' '{print $1}'` species=`echo $D | cut -d_ -f2-` faName="${name}_${species}" printf "%s %s\n" "${faName}" "$fasta" 1>&2 ./fakeAssembly.pl "${name}_${species}" ${D}/${fasta} | gzip -c > $HOME/fakeAssemblies/${faName}.fa.gz done
construct assembly hub files
Convert all the short-named artificial assemblies of each transcriptome into a single 2bit file:
faToTwoBit fakeAssemblies/*.fa.gz oenothera.2bit
The ideal agp file should have been constructed above when all the contigs were artificially stuffed into single artifical chromosomes to correctly relate these contigs with their short names back to the original long named contigs. In place of the ideal agp, we can onstruct a fake agp file for this assembly, and from that, the gap and assembly bigBed files for those tracks on the browser:
#!/bin/bash export asmName="oenothera" twoBitInfo ${asmName}.2bit stdout | sort -k2nr > ${asmName}.chrom.sizes twoBitToFa ${asmName}.2bit stdout | hgFakeAgp stdin ${asmName}.agp grep -v "^#" ${asmName}.agp | awk '$5 != "N" && $5 != "U"' \ | awk '{printf "%s\t%d\t%d\t%s\t0\t%s\n", $1, $2-1, $3, $6, $9}' \ | sort -k1,1 -k2,2n > ${asmName}.assembly.bed grep -v "^#" ${asmName}.agp | awk '$5 == "N" || $5 == "U"' \ | awk '{printf "%s\t%d\t%d\t%s\n", $1, $2-1, $3, $8}' \ | sort -k1,1 -k2,2n > ${asmName}.gap.bed bedToBigBed -extraIndex=name -verbose=0 \ ${asmName}.assembly.bed ${asmName}.chrom.sizes ${asmName}.assembly.bb bedToBigBed -verbose=0 ${asmName}.gap.bed ${asmName}.chrom.sizes \ ${asmName}.gap.bb
A simple track to construct is the gc5Base track to show GC content in the genome browser. Construct the bigWig file:
export asmName="oenothera" hgGcpercent -wigOut -doGaps -file=stdout -win=5 -verbose=0 test \ ${asmName}.2bit | gzip -c > ${asmName}.wigVarStep.gz wigToBigWig "${asmName}.wigVarStep.gz" "${asmName}.chrom.sizes" \ "${asmName}.gc5Base.bw" rm -f ${asmName}.gc5Base.wigVarStep.gz
construct hub file structure for WEB delivery
On a directory where Apache can serve up files, and thus make all this available via URLs to the internet to become an assembly hub:
mkdir myPublicHubs cd myPublicHubs # construct file "hub.txt" printf "%s\n" \ "hub cshl2015 shortLabel Oenothera longLabel CSHL Computational and Comparative Genomics genomesFile genomes.txt email hiram@soe.ucsc.edu descriptionUrl Oenothera.html" > hub.txt # construct "genomes.txt" file as referred to from the "hub.txt" file, # complete with blat server setups: printf "%s\n" \ "genome oenothera trackDb oenothera/trackDb.txt groups groups.txt description Oct. 2015 Oenothera transcriptomes twoBitPath oenothera/oenothera.2bit organism Oenothera defaultPos OEOA_oakesiana_661:25293016-25295016 orderKey 4700 scientificName Oenothera transcriptomes htmlPath oenothera/description.html transBlat localhost 76877 blat localhost 76879" > genomes.txt # construct "Oenothera.html" file, this is the ''gateway'' page contents in # genome browser. It should fully explain what this browser is showing, # where the data came from, references to appropriate data sources and # papers about this assembly. This example is too brief to satisify # these requirements: printf "%s\n" \ "<HEAD><TITLE>Oenothera transcriptome hub</TITLE> <BODY> <P> Transcriptomes in one artificial assembly for the plant species <a href="https://en.wikipedia.org/wiki/Oenothera" target=_blank>Oenothera</A> </P> <BODY></HTML>" > Oenothera.html
the big files directory for sequence and tracks
In a directory under your myHubs where you constructed your hub.txt file which leads to references to these big files for the browser to display
mkdir myHubs/oenothera cd myHubs/oenothera
Copy to this directory the files you constructed above:
$ ls -ogrt -rw-r--r-- 1 449894575 Oct 30 08:55 oenothera.2bit -rw-r--r-- 1 200283371 Oct 30 12:19 oenothera.assembly.bb -rw-r--r-- 1 23959217 Oct 30 12:20 oenothera.gap.bb -rw-r--r-- 1 881705751 Oct 30 12:56 oenothera.gc5Base.bw -rw-rw-r-- 1 159 Nov 2 14:33 description.html -rw-rw-r-- 1 1014 Nov 2 14:33 trackDb.txt