CSHL 2015 Computational and Comparative Genomics

From genomewiki
Jump to navigationJump to search

transfer data from student's laptop to CSHL

Transferring data to class:

On student's laptop where the data exists, verify enough disk space
for this operation:

$ cd        # cd with no argument will go to HOME
$ df -h .   # verify disk space available in this directory == on this filesystem
Filesystem   Size   Used  Avail Capacity  iused   ifree %iused  Mounted on
/dev/disk1  233Gi  226Gi  6.1Gi    98% 59386162 1592652   97%   /

# looks like 6 Gb free    ^^^^^

Go to the directory of data to transfer
$ cd oenothera

Measure the amount of data to package:

$ du -hsc *

2.1G    transcriptomes
2.1G    total

Total data is 2.1 Gb, the tar image compression will help.
Generate compressed tar image of this directory:

$ tar -cvzf $HOME/toCSHL.tgz ./

tar command arguments:
   c - create tar file
   v - verbose, show what is being packaged
   z - compress (gzip) while making tar image
   f - file name of tar image to construct
   ./ - package up everyting in this directory

Take a look at the resulting file:

$ cd      # return to home directory where the result file is
$ ls -l *.tgz
-rw-rw-r--  1 hclawson  staff  770380941 Oct 29 22:26 toCSHL.tgz

It is now only 735 Mb of compressed data:

$ du -hsc *.tgz
735M    toCSHL.tgz

Transfer this file to the workstation at CSHL

$ scp -p toCSHL.tgz hclawson@ecg15.cshl.edu:.

scp option '-p' means preserve date/time stamps on the file so it will
appear identical in the copy.

Magic hand-waving here since there are various pathways through the
networking here from wifi laptop connections to the class workstations.
Talk with Dan for correct connection procedures.

Now, on the desktop machines for the class, in the home directory,
unpack the tar image here:
$ mkdir oenothera
$ cd oenothera
$ tar xvzf ../toCSHL.tgz
$ ls -l
total 80
drwxr-xr-x  1 hclawson  staff   330 Oct 29 21:47 transcriptomes

survey names in sequences

kent command line programs are used during the construction of these files, you can obtain these set of commands from a variety of locations: userApps/README

For this procedure, it would be good to record the name to name translation so that the original names could be used in the assembly track and thus be available to reference from the genome browser view back to this set of short named contigs..

To use the UCSC genome browser to view this work, it is helpful to reduce the very long names in the transcriptome fasta sequence
that were constructed by the assembler. A pattern is seen in the names that suggests a substitution algorithm. They all start with:
>Locus_<sequenceNumber>_otherBusiness
or
>NODE_<sequenceNumber>_otherBusiness
The <sequenceNumber> identifiers appear to be unique within each fasta sequence, thus, the Locus_ or NODE_ can be replaced with
a name related to the transcript, and the _otherBusiness can be discarded.
$ awk -F'_' '{print $1}' all.contig.names.txt | sort | uniq -c
1985474 >Locus
1228519 >NODE
As a test, constructing fasta with those short names:
$ cd ~/oenothera/transcriptomes/assemblies
find . -type f | sed -e 's#^./##;' | grep fasta | while read F
do
  B=`basename ${F}`
  D=`dirname ${F}`
  id=`echo $D | sed -e 's/-.*//;'`
  printf "%s %s\n" "${id}" "${B}" 1>&2
  sed -e "s#^>Locus_#>${id}.#; s#^>NODE_#>${id}.#; s#_.*##;" ${F}
done | gzip -c > $HOME/all.contigs.fa.gz

# verify nothing lost (using kent command line programs from ~/bin/)
# from the original source

$ faSize */*fasta*
1354344256 bases (51622 N's 1354292634 real 1351051824 upper 3240810 lower) in 3213993 sequences in 63 files

# to the short name contigs:

$ cd
$ $ faSize all.contigs.fa.gz
1354344256 bases (51622 N's 1354292634 real 1351051824 upper 3240810 lower) in 3213993 sequences in 1 files

# same numbers, nothing lost

construct artifical assemblies of each transcriptome

It would actually be best to order the contigs by size, largest ones first. This would look more convenient in the browser display where all these contigs are going to appear as if they were all on one single chromosome.

It is convenient for alignment processing and genome browser display to
concatenate all the contigs from one transcriptome into a single sequence,
inserting 100 bases of 'N' between each contig.  Using the following perl script
on each set of transcriptomes:

=================================================
#!/usr/bin/env perl

use strict;
use warnings;

sub usage {
  printf STDERR "usage: ./fakeAssembly [name] [fastaFile] > name.assembly.fa\n";
}

my $argc = scalar(@ARGV);
if ($argc != 2) {
  usage;
  exit 255;
}

sub outputGap {
  for (my $i = 0; $i < 100; ++$i) {
    printf "N";
  }
  printf "\n";
}

my $name = shift;
my $fastaFile = shift;

my $contigsDone = 0;
open (FH, "<$fastaFile") or die "can not read $fastaFile";
while (my $line = <FH>) {
  chomp $line;
  if ($line =~ m/^>/) {
    if (0 == $contigsDone) {
       printf ">%s\n", $name
    } else {
      outputGap;
    }
    ++$contigsDone;
  } else {
    printf "%s\n", $line;
  }
}
close (FH);

Running that perl script on each directory that has a transcriptome fasta file:

#!/bin/bash

mkdir $HOME/fakeAssemblies

find . -type d | grep -v "^.$" | sed -e 's#^./##;'  | while read D
do
  fasta=`ls $D | grep fasta`
  name=`echo $D | awk -F'-' '{print $1}'`
  species=`echo $D | cut -d_ -f2-`
  faName="${name}_${species}"
  printf "%s %s\n" "${faName}" "$fasta" 1>&2
  ./fakeAssembly.pl "${name}_${species}" ${D}/${fasta} | gzip -c > $HOME/fakeAssemblies/${faName}.fa.gz
done

construct assembly hub files

Convert all the short-named artificial assemblies of each transcriptome into a single 2bit file:

faToTwoBit fakeAssemblies/*.fa.gz oenothera.2bit

The ideal agp file should have been constructed above when all the contigs were artificially stuffed into single artifical chromosomes to correctly relate these contigs with their short names back to the original long named contigs. In place of the ideal agp, we can onstruct a fake agp file for this assembly, and from that, the gap and assembly bigBed files for those tracks on the browser:

#!/bin/bash

export asmName="oenothera"

twoBitInfo ${asmName}.2bit stdout | sort -k2nr > ${asmName}.chrom.sizes

twoBitToFa ${asmName}.2bit stdout | hgFakeAgp stdin ${asmName}.agp

grep -v "^#" ${asmName}.agp | awk '$5 != "N" && $5 != "U"' \
   | awk '{printf "%s\t%d\t%d\t%s\t0\t%s\n", $1, $2-1, $3, $6, $9}' \
      | sort -k1,1 -k2,2n > ${asmName}.assembly.bed

grep -v "^#" ${asmName}.agp | awk '$5 == "N" || $5 == "U"' \
   | awk '{printf "%s\t%d\t%d\t%s\n", $1, $2-1, $3, $8}' \
      | sort -k1,1 -k2,2n > ${asmName}.gap.bed

bedToBigBed -extraIndex=name -verbose=0 \
    ${asmName}.assembly.bed ${asmName}.chrom.sizes ${asmName}.assembly.bb

bedToBigBed -verbose=0 ${asmName}.gap.bed ${asmName}.chrom.sizes \
      ${asmName}.gap.bb

A simple track to construct is the gc5Base track to show GC content in the genome browser. Construct the bigWig file:

export asmName="oenothera"

hgGcpercent -wigOut -doGaps -file=stdout -win=5 -verbose=0 test \
  ${asmName}.2bit | gzip -c > ${asmName}.wigVarStep.gz

wigToBigWig "${asmName}.wigVarStep.gz" "${asmName}.chrom.sizes" \
         "${asmName}.gc5Base.bw"

rm -f ${asmName}.gc5Base.wigVarStep.gz

construct hub file structure for WEB delivery

On a directory where Apache can serve up files, and thus make all this available via URLs to the internet to become an assembly hub:

mkdir myPublicHubs
cd myPublicHubs

# construct file "hub.txt"
printf "%s\n" \
"hub cshl2015
shortLabel Oenothera
longLabel CSHL Computational and Comparative Genomics
genomesFile genomes.txt
email hiram@soe.ucsc.edu
descriptionUrl Oenothera.html" > hub.txt

# construct "genomes.txt" file as referred to from the "hub.txt" file,
# complete with blat server setups:
printf "%s\n" \
"genome oenothera
trackDb oenothera/trackDb.txt
groups groups.txt
description Oct. 2015 Oenothera transcriptomes
twoBitPath oenothera/oenothera.2bit
organism Oenothera
defaultPos OEOA_oakesiana_661:25293016-25295016
orderKey 4700
scientificName Oenothera transcriptomes
htmlPath oenothera/description.html
transBlat localhost 76877
blat localhost 76879" > genomes.txt

# construct "Oenothera.html" file.  This file is referenced from the hub.txt file
# This identifies this track hub contents,
# it should contain at least a reference to the author of this hub for
# users to have a contact.  This file is shown from the track hub management
# page on the genome browser: http://genome.ucsc.edu/cgi-bin/hgHubConnect
# This example is too brief to satisfy these requirements:
printf "%s\n" \
"<HEAD><TITLE>Oenothera transcriptome hub</TITLE>
<BODY>
<P>
Transcriptomes in one artificial assembly for the plant species
<a href="https://en.wikipedia.org/wiki/Oenothera" target=_blank>Oenothera</A>
</P>
<BODY></HTML>" > Oenothera.html

The groups.txt file can be a copy from existing assembly hubs. It merely defines the categories below the genome browser graphic to arrange tracks into similar top functions. An example: groups.txt file.

populate the big files directory for sequence and tracks

In a directory under your myHubs where you constructed your hub.txt file which leads to references to these big files for the browser to display

mkdir myHubs/oenothera
cd myHubs/oenothera

Copy to this directory the files you constructed above:

$ ls -ogrt
-rw-r--r-- 1 449894575 Oct 30 08:55 oenothera.2bit
-rw-r--r-- 1      1936 Oct 30 12:18 oentothera.chrom.sizes
-rw-r--r-- 1 200283371 Oct 30 12:19 oenothera.assembly.bb
-rw-r--r-- 1  23959217 Oct 30 12:20 oenothera.gap.bb
-rw-r--r-- 1 881705751 Oct 30 12:56 oenothera.gc5Base.bw
-rw-rw-r-- 1       159 Nov  2 14:33 description.html
-rw-rw-r-- 1      1014 Nov  2 14:33 trackDb.txt

The description.html file describes this assembly. It should contain references to data sources for this assembly, pointers to NCBI resources that explain the assembly and its origin and its official accession identifiers in the NCBI assembly

It only needs to be a fragment of HTML code since it is shown within a container page. This example is not sufficient to satisfy these requirements, you can use template contents for this from any of the genome browser description pages for the hosted assemblies:

 printf "%s\n" \
"<p>
CSHL Computational and Comparative Genomics class project
</p>
<p>
This 'assembly' contains transcriptions from sub-species of the plant Oenothera
</p>" > description.html

The trackDb.txt file defines the tracks to be seen in this genome browser:

track assembly
longLabel Assembly 
shortLabel Assembly 
priority 10
visibility pack
colorByStrand 150,100,30 230,170,40
color 150,100,30
altColor 230,170,40
bigDataUrl oenothera.assembly.bb
type bigBed 6
group map
html assembly

track gap
longLabel Gap 
shortLabel Gap 
priority 11
visibility dense
color 0,0,0 
bigDataUrl oenothera.gap.bb
type bigBed 4
group map
html gap

track gc5Base
shortLabel GC Percent
longLabel GC Percent in 5-Base Windows
group map
priority 23.5
visibility full
autoScale Off
maxHeightPixels 128:36:16
graphTypeDefault Bar
gridDefault OFF
windowingFunction Mean
color 0,0,0
altColor 128,128,128
viewLimits 30:70
type bigWig 0 100
bigDataUrl oenothera.gc5Base.bw
html ../trackDescriptions/gc5Base

start gfServer

The specifications in the genomes.txt file mention a gfServer for this assembly hub. To start the gfServer processes for this assembly hub, in a directory somewhere else, this doesn't have to be in your hub hierarchy, but it will need a copy of the 2bit file in the same directory.

export asmName="oenothera"
gfServer -log=${asmName}.gfServer.trans.log -ipLog -canStop start \
   localhost 76877 -trans -mask ${asmName}.2bit &
gfServer -log=${asmName}.gfServer.log -ipLog -canStop start \
   localhost 76879 -stepSize=5 ${asmName}.2bit &

map arabidopsis genes to this transcript assembly

Since there is a gfServer running for this assembly hub, you can use it to map protein or DNA sequences to these transcripts. This is not the fastest method that can be used for this, but it will prove that your gfServer is operating correctly.

Do this work back in your working directory where you were building the bigBed/bigWig files for assembly, gap and gc5Base tracks.

Fetch the GCA_000001735.1_TAIR10_protein.faa.gz protein sequence file from NCBI:

 wget --timestamping \
 ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/plant/Arabidopsis_thaliana/all_assembly_versions/GCA_000001735.1_TAIR10/GCA_000001735.1_TAIR10_protein.faa.gz

Map this protein sequence .faa file with gfClient to your gfServer, warning this takes about nine hours. There are much faster means for this procedure using blat locally:

export faa="GCA_000001735.1_TAIR10_protein.faa.gz"
gfClient -dots=1000 -t=dnax -q=prot -minIdentity=50 genome-test.gi.ucsc.edu \
   76877 . ${faa} tair10.gfClient.psl

Filter those results for higher quality alignments:

pslCheck -pass=passed.pslCheck.tair10.gfClient.psl -prot tair10.gfClient.psl 2> /dev/null
pslFilter passed.pslCheck.tair10.gfClient.psl tair10.filter.psl
pslToBed tair10.filter.psl stdout | sort -k1,1 -k2,2n > tair10Genes.bed
bedToBigBed -extraIndex=name tair10Genes.bed oentothera.chrom.sizes tair10Genes.bb

Place the resulting tair10.bb file in your WEB directory with the other big files:

-rw-r--r-- 1  24703538 Nov  1 13:50 tair10Genes.bb

and add this section to your trackDb.txt file:

track tair10Genes
shortLabel TAIR10 genes
longLabel Arabidopsis TAIR10 genes mapped by blat to these transcriptomes
group genes
priority 40
visibility pack
type bigBed 12 .
colorByStrand 0,0,128 128,0,0
color 0,0,128
altColor 128,0,0
bigDataUrl tair10Genes.bb
searchIndex name

survey gene density

Since these are other species genes mapping to these transcripts, there will be a pattern of overlapping matches due to paralog/ortholog/homolog relationships. Let's survey the mapping density to see the overlaps of matches in the same locations:

bedItemOverlapCount -chromSize=oenothera.chrom.sizes test tair10Genes.bed > mapping.overlaps.bedGraph

Column 4 of that file is a count of number of overlaps in each individual transcript contig, overall stats can be found by ave:

ave -col=4 mapping.overlaps.bedGraph
Q1 1.000000
median 2.000000
Q3 4.000000
average 3.739669
min 1.000000
max 80.000000
count 323911
total 1211320.000000
standard deviation 5.154169

Some protein sequences have overlapping matches to a count of 80