CSHL 2015 Computational and Comparative Genomics: Difference between revisions

From genomewiki
Jump to navigationJump to search
(nowiki better than tt)
(no, nowiki doesn't do what pre can do)
Line 2: Line 2:


=== transfer data from student's laptop to CSHL ===
=== transfer data from student's laptop to CSHL ===
<nowiki>
<pre>
Transferring data to class:
Transferring data to class:


Line 76: Line 76:
-rw-r--r--  1 hclawson  staff  1027 Oct 29 21:46 script oneothera.txt~
-rw-r--r--  1 hclawson  staff  1027 Oct 29 21:46 script oneothera.txt~
drwxr-xr-x  1 hclawson  staff  330 Oct 29 21:47 transcriptomes
drwxr-xr-x  1 hclawson  staff  330 Oct 29 21:47 transcriptomes
</nowiki>
</pre>


=== survey names in sequences ===
=== survey names in sequences ===
<nowiki>
<pre>
To use the UCSC genome browser to view this work, it is helpful to reduce the very long names in the transcriptome fasta sequence that were constructed by the assembler. A pattern is seen in the names that suggests a substitution algorithm. They all start with:
To use the UCSC genome browser to view this work, it is helpful to reduce the very long names in the transcriptome fasta sequence that were constructed by the assembler. A pattern is seen in the names that suggests a substitution algorithm. They all start with:
>Locus_<sequenceNumber>_otherBusiness
>Locus_<sequenceNumber>_otherBusiness
Line 112: Line 112:


# same numbers, nothing lost
# same numbers, nothing lost
</nowiki>
</pre>

Revision as of 18:09, 2 November 2015

Class Project

transfer data from student's laptop to CSHL

Transferring data to class:

On student's laptop where the data exists, verify enough disk space
for this operation:

$ cd        # cd with no argument will go to HOME
$ df -h .   # verify disk space available in this directory == on this filesystem
Filesystem   Size   Used  Avail Capacity  iused   ifree %iused  Mounted on
/dev/disk1  233Gi  226Gi  6.1Gi    98% 59386162 1592652   97%   /

# looks like 6 Gb free    ^^^^^

Go to the directory of data to transfer
$ cd oentothera

Measure the amount of data to package:

$ du -hsc *
8.0K    commandos linux
8.0K    oenothera r.txt
8.0K    oenothera r.txt~
8.0K    script oneothera.txt
8.0K    script oneothera.txt~
2.1G    transcriptomes
2.1G    total

Total data is 2.1 Gb, the tar image compression will help.
Generate compressed tar image of this directory:

$ tar -cvzf $HOME/toCSHL.tgz ./

tar command arguments:
   c - create tar file
   v - verbose, show what is being packaged
   z - compress (gzip) while making tar image
   f - file name of tar image to construct
   ./ - package up everyting in this directory

Take a look at the resulting file:

$ cd      # return to home directory where the result file is
$ ls -l *.tgz
-rw-rw-r--  1 hclawson  staff  770380941 Oct 29 22:26 toCSHL.tgz

It is now only 735 Mb of compressed data:

$ du -hsc *.tgz
735M    toCSHL.tgz

Transfer this file to the workstation at CSHL

$ scp -p toCSHL.tgz hclawson@ecg15.cshl.edu:.

scp option '-p' means preserve date/time stamps on the file so it will
appear identical in the copy.

Magic hand-waving here since there are various pathways through the
networking here from wifi laptop connections to the class workstations.
Talk with Dan for correct connection procedures.

Now, on the desktop machines for the class, in the home directory,
unpack the tar image here:
$ mkdir oentothera
$ cd oentothera
$ tar xvzf ../toCSHL.tgz
$ ls -l
total 80
-rw-r--r--  1 hclawson  staff  1090 Oct 29 21:46 commandos linux
-rw-r--r--  1 hclawson  staff  4763 Oct 29 21:46 oenothera r.txt
-rw-r--r--  1 hclawson  staff  4698 Oct 29 21:46 oenothera r.txt~
-rw-r--r--  1 hclawson  staff  2887 Oct 29 21:46 script oneothera.txt
-rw-r--r--  1 hclawson  staff  1027 Oct 29 21:46 script oneothera.txt~
drwxr-xr-x  1 hclawson  staff   330 Oct 29 21:47 transcriptomes

survey names in sequences

To use the UCSC genome browser to view this work, it is helpful to reduce the very long names in the transcriptome fasta sequence that were constructed by the assembler. A pattern is seen in the names that suggests a substitution algorithm. They all start with:
>Locus_<sequenceNumber>_otherBusiness
or
>NODE_<sequenceNumber>_otherBusiness
The <sequenceNumber> identifiers appear to be unique within each fasta sequence, thus, the Locus_ or NODE_ can be replaced with a name related to the transcript, and the _otherBusiness can be discarded.
$ awk -F'_' '{print $1}' all.contig.names.txt | sort | uniq -c
1985474 >Locus
1228519 >NODE
As a test, constructing fasta with those short names:
$ cd ~/oentothera/transcriptomes/assemblies
find . -type f | sed -e 's#^./##;' | grep fasta | while read F
do
  B=`basename ${F}`
  D=`dirname ${F}`
  id=`echo $D | sed -e 's/-.*//;'`
  printf "%s %s\n" "${id}" "${B}" 1>&2
  sed -e "s#^>Locus_#>${id}.#; s#^>NODE_#>${id}.#; s#_.*##;" ${F}
done | gzip -c > $HOME/all.contigs.fa.gz

# verify nothing lost (using kent command line programs from ~/bin/)
# from the original source

$ faSize */*fasta*
1354344256 bases (51622 N's 1354292634 real 1351051824 upper 3240810 lower) in 3213993 sequences in 63 files

# to the short name contigs:

$ cd
$ $ faSize all.contigs.fa.gz
1354344256 bases (51622 N's 1354292634 real 1351051824 upper 3240810 lower) in 3213993 sequences in 1 files

# same numbers, nothing lost