Making a hub for a cell browser: Difference between revisions

From Genecats
Jump to navigationJump to search
(In progress creation of the page)
 
(Fixing up a few of the options sections)
Line 5: Line 5:
The script assumes a certain directory structure when it creates the trackDb stanzas.  
The script assumes a certain directory structure when it creates the trackDb stanzas.  


In your dataset directory, create a ‘hub’ directory where all of the hub-related files will live. In that hub directory, you will then create a directory for each [https://genome.ucsc.edu/goldenPath/help/trackDb/trackDbHub.html#compositeTrack composite/parent] track:  
In your dataset directory, create a ‘hub’ directory where all of the hub-related files will live. In that hub directory, you will then create a directory for each [https://genome.ucsc.edu/goldenPath/help/trackDb/trackDbHub.html#compositeTrack composite/parent] track (directory names should be all lowercase):  
<pre>
<pre>
cb_dataset_dir/
cb_dataset_dir/
Line 18: Line 18:
</pre>
</pre>


Dividing the individual tracks into composite/parent tracks will vary from dataset to dataset. For example, in the collection mouse-brain-cutandtag, individual tracks were divided into a composite track for each dataset in the collection (e.g. h3k27ac, h3k27me3, h3k27me3-cell-lines, h3k36me3, h3k4me3, olig2, rad21) as this was what was requested by the authors. In neuro-degen-atac, individual tracks were grouped according to their corresponding metadata field (e.g. broad-celltypes, clusters, neuronal-celltypes, neuronal-clusters). If you’re not sure how to group the tracks ask Max and/or the contributors.  
Dividing the individual tracks into composite/parent tracks will vary from dataset to dataset. For example, in the collection <code>mouse-brain-cutandtag</code>, individual tracks were divided into a composite track for each dataset in the collection (e.g. h3k27ac, h3k27me3, h3k27me3-cell-lines, h3k36me3, h3k4me3, olig2, rad21) as this was what was requested by the authors. In neuro-degen-atac, individual tracks were grouped according to their corresponding metadata field (e.g. broad-celltypes, clusters, neuronal-celltypes, neuronal-clusters). If you’re not sure how to group the tracks ask Max and/or the contributors.  


Finally, it’s best to make symlinks to the track files in the orig directory to prevent the unnecessary duplication of large amounts of files. See human-enhancer-atlas/hub and fetal-chromatin-atlas/hub as examples, where the bigWigs alone were 212 GB and 138 GB, respectively. (/hive has a ton of storage, but it's good to not waste space unnecessarily.)
Finally, it’s best to make symlinks to the track files in the orig directory to prevent the unnecessary duplication of large amounts of files. See <code>human-enhancer-atlas/hub</code> and <code>fetal-chromatin-atlas/hub</code> as examples, where the bigWigs alone were 212 GB and 138 GB, respectively. (/hive has a ton of storage, but it's good to not waste space unnecessarily.)


==Running the script==
==Running the script==


For makeCbHub, at the very least, all you need is a directory of big* files.  
For makeCbHub, at the very least, all you need is a directory of big* files. This is the required argument <code>fileDir</code>.  


For example, use the commands below to generate the trackDb stanzas for a single composite track in the mouse-brain-cutandtag dataset:
For example, use the commands below to generate the trackDb stanzas for a single composite track in the mouse-brain-cutandtag dataset:
Line 68: Line 68:
====Composite track labels====
====Composite track labels====


Normally, the directory names under the required argument fileDir are used as the labels for the composite/parent tracks in the trackDb. The option -d/–datasetList allows you specify the casing used for those labels.
The option <code>-d/--datasetList</code> serves two purposes:
# If <code>fileDir</code> contains multiple dirs, you can specify which of those you want to build trackDb stanzas for
# By default, the directory names under <code>fileDir</code> are used as the labels for the composite/parent tracks in the trackDb. However, these are required to be all lowercased (e.g. h3k27ac, bw, or clusters). This option allows one to specify the casing used for the short/long labels.  
 
<pre>
makeCbHub -d “Rad21 Olig2” bw/
makeCbHub -d “Rad21 Olig2” bw/


Line 79: Line 82:
autoScale group
autoScale group
type bigWig
type bigWig
...


track olig2
track olig2
Line 88: Line 91:
autoScale group
autoScale group
type bigWig
type bigWig
...
 
</pre>


This command assumes that in bw/, there are two directories: rad21 and olig2, but it will use Rad21 and Olig2 as the shortLabel/longLabel for those composites in the trackDb.  
This command assumes that in bw/ (<code>fileDir</code>), there are two directories: rad21 and olig2, but it will use Rad21 and Olig2 as the shortLabel/longLabel for those composites in the trackDb.  


====Individual track labels====
====Individual track labels====


The -s/–shortLabel and -l/–longLabel options allow you to do something similar except for the individual tracks in the composites. By default the script uses the file names as the labels, which can be pretty messy:
The options <code>-s/--shortLabel</code> and <code>-l/–longLabel</code> allow you to control the short and long labels of the individual tracks in the composites. By default the script uses the file names as the labels, which, depending on how the files are named, can be pretty messy:


    ...
     track bw_P21208_1004_OPC_Ctr_RND1_peaks
     track bw_P21208_1004_OPC_Ctr_RND1_peaks
     parent bw on
     parent bw on
Line 105: Line 109:
     bigDataUrl bw/P21208_1004_OPC_Ctr_RND1_peaks.bw
     bigDataUrl bw/P21208_1004_OPC_Ctr_RND1_peaks.bw
     visibility dense
     visibility dense
    ...
However, the short and long label option allows one to control those:
 
However, if we rebuild the trackDb with these options:
<pre>
cd /hive/data/inside/cells/datasets/olg-eae-ms/eae-multiomics/hub
makeCbHub -s shortLabels.tsv -l acronyms.sorted.tsv bw/
</pre>
output:


    ...
     track bw_P21208_1004_OPC_Ctr_RND1_peaks
     track bw_P21208_1004_OPC_Ctr_RND1_peaks
     parent bw on
     parent bw on
Line 116: Line 127:
     bigDataUrl bw/P21208_1004_OPC_Ctr_RND1_peaks.bw
     bigDataUrl bw/P21208_1004_OPC_Ctr_RND1_peaks.bw
     visibility dense
     visibility dense
    ...
The shortLabels file contains two columns.
# file name
# short label
Here's the line from the <code>shortLabels.tsv</code> used in the example above:
P21208_1004_OPC_Ctr_RND1_peaks.bw      OPC_Ctr
The longLabels file follows the same format as the 'acronymFile' setting that can be used in the cellbrowser.conf (and will mostly likely be the same file). The two columns are:
# short label
# long label
Here's the line from the <code>acronyms.sorted.tsv</code> used in the example above:


The shortLabels file contains two columns: (1) file name, (2) short label:
OPC_Ctr Control oligodendrocyte precursor cells
P21208_1005_MOL12_EAE_RND2_peaks.bw    MOL12_EAE


The longLabels file is equivalent to the acronyms file that can be used in the cell browser, with (1) shortLabel, and (2) being the desired long label:
'''Note''': the shortLabel/longLabel files can be csv or tsv format and their file names need to end with csv or tsv (e.g. shortLabels.tsv).
COP    Committed Oligodendrocyte precursor cells


====Colors====
====Colors====

Revision as of 18:45, 23 June 2022

This page will go over how to use the makeCbHub script to build a track hub from bigWig, bigBed, and other big* files provided by a submitter.

File organization

The script assumes a certain directory structure when it creates the trackDb stanzas.

In your dataset directory, create a ‘hub’ directory where all of the hub-related files will live. In that hub directory, you will then create a directory for each composite/parent track (directory names should be all lowercase):

cb_dataset_dir/
    |--> hub/
        |--> track_set_A/
            |--> track_A1.bw
            |--> track_A2.bw
            |--> etc…
        |--> track_set_B
            |--> track_B1.bw
            |--> etc…

Dividing the individual tracks into composite/parent tracks will vary from dataset to dataset. For example, in the collection mouse-brain-cutandtag, individual tracks were divided into a composite track for each dataset in the collection (e.g. h3k27ac, h3k27me3, h3k27me3-cell-lines, h3k36me3, h3k4me3, olig2, rad21) as this was what was requested by the authors. In neuro-degen-atac, individual tracks were grouped according to their corresponding metadata field (e.g. broad-celltypes, clusters, neuronal-celltypes, neuronal-clusters). If you’re not sure how to group the tracks ask Max and/or the contributors.

Finally, it’s best to make symlinks to the track files in the orig directory to prevent the unnecessary duplication of large amounts of files. See human-enhancer-atlas/hub and fetal-chromatin-atlas/hub as examples, where the bigWigs alone were 212 GB and 138 GB, respectively. (/hive has a ton of storage, but it's good to not waste space unnecessarily.)

Running the script

For makeCbHub, at the very least, all you need is a directory of big* files. This is the required argument fileDir.

For example, use the commands below to generate the trackDb stanzas for a single composite track in the mouse-brain-cutandtag dataset:

cd /hive/data/inside/cells/datasets/mouse-brain-cutandtag/hub
makeCbHub olig2

Output:
track olig2
compositeTrack on
shortLabel olig2
longLabel olig2
visibility dense
autoScale group
type bigWig

     track olig2_cluster_non_oligo
     parent olig2 on
     shortLabel cluster_non_oligo
     longLabel cluster_non_oligo
     type bigWig 0.000000 2358.294189
     autoScale group
     bigDataUrl olig2/cluster_non_oligo.bw
     visibility dense

     track olig2_cluster_oligo
     parent olig2 on
     shortLabel cluster_oligo
     longLabel cluster_oligo
     type bigWig 0.000000 280.645325
     autoScale group
     bigDataUrl olig2/cluster_oligo.bw
     visibility dense


As you can see, it works, though it wouldn't be particularly pretty to look at the Genome Browser. The labels are not very human-friendly and both tracks will be colored the same, default color, black.

Options to customize output

The six optional arguments for makeCbHub allow you greater control over what’s put into these trackDb stanzas, including shortLabels, longLabels, and colors.

Composite track labels

The option -d/--datasetList serves two purposes:

  1. If fileDir contains multiple dirs, you can specify which of those you want to build trackDb stanzas for
  2. By default, the directory names under fileDir are used as the labels for the composite/parent tracks in the trackDb. However, these are required to be all lowercased (e.g. h3k27ac, bw, or clusters). This option allows one to specify the casing used for the short/long labels.
makeCbHub -d “Rad21 Olig2” bw/

track rad21
compositeTrack on
shortLabel Rad21
longLabel Rad21
visibility dense
autoScale group
type bigWig
...

track olig2
compositeTrack on
shortLabel Olig2
longLabel Olig2
visibility dense
autoScale group
type bigWig
...

This command assumes that in bw/ (fileDir), there are two directories: rad21 and olig2, but it will use Rad21 and Olig2 as the shortLabel/longLabel for those composites in the trackDb.

Individual track labels

The options -s/--shortLabel and -l/–longLabel allow you to control the short and long labels of the individual tracks in the composites. By default the script uses the file names as the labels, which, depending on how the files are named, can be pretty messy:

    ...
    track bw_P21208_1004_OPC_Ctr_RND1_peaks
    parent bw on
    shortLabel P21208_1004_OPC_Ctr_RND1_peaks
    longLabel P21208_1004_OPC_Ctr_RND1_peaks
    type bigWig 0.000000 160.000000
    autoScale group
    bigDataUrl bw/P21208_1004_OPC_Ctr_RND1_peaks.bw
    visibility dense
    ...

However, if we rebuild the trackDb with these options:

cd /hive/data/inside/cells/datasets/olg-eae-ms/eae-multiomics/hub
makeCbHub -s shortLabels.tsv -l acronyms.sorted.tsv bw/

output:

    ...
    track bw_P21208_1004_OPC_Ctr_RND1_peaks
    parent bw on
    shortLabel OPC_Ctr
    longLabel OPC_Ctr - Control oligodendrocyte precursor cells
    type bigWig 0.000000 160.000000
    autoScale group
    bigDataUrl bw/P21208_1004_OPC_Ctr_RND1_peaks.bw
    visibility dense
    ...

The shortLabels file contains two columns.

  1. file name
  2. short label

Here's the line from the shortLabels.tsv used in the example above:

P21208_1004_OPC_Ctr_RND1_peaks.bw       OPC_Ctr

The longLabels file follows the same format as the 'acronymFile' setting that can be used in the cellbrowser.conf (and will mostly likely be the same file). The two columns are:

  1. short label
  2. long label

Here's the line from the acronyms.sorted.tsv used in the example above:

OPC_Ctr Control oligodendrocyte precursor cells

Note: the shortLabel/longLabel files can be csv or tsv format and their file names need to end with csv or tsv (e.g. shortLabels.tsv).

Colors

Finally, the -c/–color option allows you to color each of the tracks. It is equivalent to the color file that can be used with the cell browser, meaning that column 1 is the short label and column 2 is the color likely in hexcode format, though RGB tuple is also acceptable:

Combine all of these settings together to get: