Managing cellbrowser.conf tag values for multiple datasets: Difference between revisions
(First draft of page, more to come) |
(adding details about getTagsVals, adding cell browser category) |
||
Line 40: | Line 40: | ||
</pre> | </pre> | ||
The script will translate the sets of comma-separated values to | The script will translate the sets of comma-separated values to the format needed in the cellbrowser.conf | ||
If the | If a tag in the header already exists in the cellbrowser.conf for a dataset listed, then nothing will be changed. If you want the values in the file to replace those current in the cellbrowser.conf, then you will need to add the <code>-u/–update</code> option when running addTags. | ||
Here’s an example that you can run for 5 datasets: | Here’s an example that you can run for 5 datasets: | ||
<pre> | <pre> | ||
</pre> | </pre> | ||
== Getting tag values== | == Getting tag values== | ||
You can get the values for a single tag for a list of datasets using <code>getTagVals</code>. In the input file, the first column should be a list of dataset names. Any subsequent columns in the file will be ignored and carried over to the output file. There may be several reasons you might want to get | |||
Here's a example you can run to see the body_parts values for 10 datasets: | |||
<pre> | |||
cd /hive/data/inside/cells/exampleDatasets | |||
getTagVals rr.datasets.10.tsv body_parts | |||
</pre> | |||
Which outputs: | |||
<pre> | |||
cortex-dev body_parts | |||
xena oral cavity, placenta, decidua, blood, cord blood, retina, spleen, lung, esophagus, pancreas, brain, cotex, hippocampus | |||
adultPancreas pancreas | |||
autism brain | |||
lifespan-nasal-atlas respiratory system, nasal, lung | |||
lung-airway lung | |||
mouse-hsc blood, bone marrow | |||
dental-cells teeth | |||
adult-testis testis | |||
gbm brain</pre> | |||
As noted above, any columns in input file past the dataset list will be carried over to the output. For example, if this is the starting file: | |||
<pre> | |||
dataset diseases | |||
shalek-alexandria-project Granuloma, Healthy Control, HIV-1, Chronic Rhinosinusitis, Type 2 Inflammatory Disease | |||
allen-celltypes Healthy | |||
adultPancreas Healthy | |||
gbm Glioblastoma | |||
quake-gbm Glioblastoma, Healthy Control | |||
organoids Healthy | |||
chi-10x-mouse-cardiomyocytes Healthy | |||
cortex-atac Healthy | |||
organoidreportcard Healthy | |||
</pre> | |||
And you run this command: | |||
<pre> | |||
cd /hive/data/inside/cells/exampleDatasets | |||
getTagVals rr.datasets.disease.10.tsv "body_parts organisms" | |||
</pre> | |||
You can see that the extra columns are just carried over to the output: | |||
<pre> | |||
dataset diseases body_parts organisms | |||
shalek-alexandria-project Granuloma, Healthy Control, HIV-1, Chronic Rhinosinusitis, Type 2 Inflammatory Disease Human (H. sapiens), Mouse (M. musculus), Rhesus macaque (M. mulatta) | |||
allen-celltypes Healthy brain Human (H. sapiens), Mouse (M. musculus), Rhesus macaque (M. mulatta) | |||
adultPancreas Healthy pancreas Human (H. sapiens) | |||
gbm Glioblastoma brain Human (H. sapiens) | |||
quake-gbm Glioblastoma, Healthy Control brain Human (H. sapiens) | |||
organoids Healthy organoid, brain Human (H. sapiens), Rhesus macaque (M. mulatta), Chimp (P. troglodytes) | |||
chi-10x-mouse-cardiomyocytes Healthy heart Mouse (M. musculus) | |||
cortex-atac Healthy brain Human (H. sapiens) | |||
organoidreportcard Healthy organoid, brain Human (H. sapiens) | |||
</pre> | |||
==Getting a list of datasets on the RR== | |||
Both of these scripts require a list of datasets as input. We maintain a list of datasets on the RR in the directory <code>/hive/data/inside/cells</code> in the file <code>rr.datasets.txt</code>. This file looks like this: | |||
<pre> | |||
2018-09-10 cortex-dev Cortex development 4261 | |||
2019-10-07 xena/zeisel2015 Zeisel '15 Mouse cortex & hippocampus 3005 | |||
2019-10-07 xena/darmanis2015 Darmanis'15 Brain 290 | |||
2019-10-07 xena/head-neck Head and Neck Cancer 3608 | |||
2019-10-07 xena/hca-cerebral-organoids HCA Cerebral Organoids 35513 | |||
2019-10-07 xena/hca-fetal-maternal HCA Fetal Maternal 6273 | |||
2019-10-07 xena/hca-pancreas HCA Human Pancreas 581 | |||
2021-05-17 xena/hca-hematopoietic-profiling HCA Hematopoetic Profiling 47953 | |||
2019-10-07 xena/hca-tissue-stability HCA Tissue Stability 14704 | |||
2019-10-07 xena/hca-immune-census HCA Census of Immune Cells 714342 | |||
</pre> | |||
Column is the release date, column 2 is the short name, column 3 is the short label, and column 4 is the cell count. | |||
It can't directly be used as input, but with some bash commands, you can quickly turn it into one: | |||
<pre> | |||
grep -v "/" ../rr.datasets.txt | cut -f2 -d$'\t' > my_dataset_list.tsv | |||
</pre> | |||
This will remove those entires with '/' in their short name as those are subdatasets in a collection and we don't need to worry about tags in those datasets (at least not as of 7/2022). | |||
The resulting file will look like: | |||
<pre> | |||
head my_dataset_list.tsv | |||
cortex-dev | |||
xena | |||
adultPancreas | |||
autism | |||
shalek-alexandria-project | |||
lifespan-nasal-atlas | |||
lung-airway | |||
mouse-hsc | |||
dental-cells | |||
adult-testis | |||
</pre> | |||
You can then use this as a starting point for an input file to <code>addTags</code> and <code>getTagVals</code> | |||
[[Category:Cell Browser]] |
Revision as of 16:19, 20 July 2022
The cellbrowser.conf is made up of a series of tag/value pairs. For example, in the line “body_parts=[‘brain’]”, “body_parts” is the tag and “[‘brain’]” is the value. This page goes over how to manage (add/update/query) these tags across some/all of the datasets we host.
Adding/updating tags
Most of the tag/value pairs point the cell browser to the various input files (e.g. ). However, some of them are needed to control the filter options, and include:
body_parts
diseases
projects
organisms
sources
life_stages
domains
There may be more added in the future. For new datasets, you should be configuring these settings as you add the datasets. If new tags are added, it may be necessary to add and backfill these settings in the cellbrowser.conf files of hundreds of datasets. This type of mass update can be managed using the script addTags
.
The input for addTags is an N-column, tab-separated file. The first column is the dataset name and following columns are the values you want added or updated. A header line is required and each column needs an entry in the header line that tells the script what tag the values are associated with. Here’s the first few lines of the file that were used to add disease, projects, and organisms tags when they were added:
dataset body_parts organisms projects diseases adultPancreas pancreas Human (H. sapiens) CIRM Healthy aging-brain brain Mouse (M. musculus) Healthy aging-human-skin skin Human (H. sapiens) Healthy mouse-esophagus esophagus Mouse (M. musculus) Healthy tabula-muris-senis all Mouse (M. musculus) Tabula Muris Consortium Healthy tabulamuris all Mouse (M. musculus) Tabula Muris Consortium Healthy tabula-sapiens all Human (H. sapiens) Tabula Muris Consortium Healthy gtex8 all Human (H. sapiens) GTEx Healthy adult-brain-vasc brain Human (H. sapiens) Healthy
As you can see the labels in the header are the names of the tags that the values in that column are associated with. When addTags
is run, for example, it will add an 'organisms' tag line to the cellbrowser.conf for the aging-brain dataset and set the value to ["Mouse (M. musculus)"] meaning the final line would be: organisms=["Mouse (M. musculus)"]
.
If you have multiple values for a tag, separate items by a comma.
dental-cells teeth Human (H. sapiens), Mouse (M. musculus) Healthy lepto-metastasis brain, spinal cord Human (H. sapiens) Leptomeningeal Melanoma stanford-czb-hlca lung Human (H. sapiens) Lung Cancer, Healthy Control teichmann-asthma lung Human (H. sapiens) Asthma, Healthy Control gut-cell-atlas gut, colon, ileum, duojejunum Human (H. sapiens) Human Cell Atlas, hca Crohn's Disease, Healthy Control, Healthy lifespan-nasal-atlas respiratory system, nasal, lung Human (H. sapiens) Influenza, Healthy Control
The script will translate the sets of comma-separated values to the format needed in the cellbrowser.conf
If a tag in the header already exists in the cellbrowser.conf for a dataset listed, then nothing will be changed. If you want the values in the file to replace those current in the cellbrowser.conf, then you will need to add the -u/–update
option when running addTags.
Here’s an example that you can run for 5 datasets:
Getting tag values
You can get the values for a single tag for a list of datasets using getTagVals
. In the input file, the first column should be a list of dataset names. Any subsequent columns in the file will be ignored and carried over to the output file. There may be several reasons you might want to get
Here's a example you can run to see the body_parts values for 10 datasets:
cd /hive/data/inside/cells/exampleDatasets getTagVals rr.datasets.10.tsv body_parts
Which outputs:
cortex-dev body_parts xena oral cavity, placenta, decidua, blood, cord blood, retina, spleen, lung, esophagus, pancreas, brain, cotex, hippocampus adultPancreas pancreas autism brain lifespan-nasal-atlas respiratory system, nasal, lung lung-airway lung mouse-hsc blood, bone marrow dental-cells teeth adult-testis testis gbm brain
As noted above, any columns in input file past the dataset list will be carried over to the output. For example, if this is the starting file:
dataset diseases shalek-alexandria-project Granuloma, Healthy Control, HIV-1, Chronic Rhinosinusitis, Type 2 Inflammatory Disease allen-celltypes Healthy adultPancreas Healthy gbm Glioblastoma quake-gbm Glioblastoma, Healthy Control organoids Healthy chi-10x-mouse-cardiomyocytes Healthy cortex-atac Healthy organoidreportcard Healthy
And you run this command:
cd /hive/data/inside/cells/exampleDatasets getTagVals rr.datasets.disease.10.tsv "body_parts organisms"
You can see that the extra columns are just carried over to the output:
dataset diseases body_parts organisms shalek-alexandria-project Granuloma, Healthy Control, HIV-1, Chronic Rhinosinusitis, Type 2 Inflammatory Disease Human (H. sapiens), Mouse (M. musculus), Rhesus macaque (M. mulatta) allen-celltypes Healthy brain Human (H. sapiens), Mouse (M. musculus), Rhesus macaque (M. mulatta) adultPancreas Healthy pancreas Human (H. sapiens) gbm Glioblastoma brain Human (H. sapiens) quake-gbm Glioblastoma, Healthy Control brain Human (H. sapiens) organoids Healthy organoid, brain Human (H. sapiens), Rhesus macaque (M. mulatta), Chimp (P. troglodytes) chi-10x-mouse-cardiomyocytes Healthy heart Mouse (M. musculus) cortex-atac Healthy brain Human (H. sapiens) organoidreportcard Healthy organoid, brain Human (H. sapiens)
Getting a list of datasets on the RR
Both of these scripts require a list of datasets as input. We maintain a list of datasets on the RR in the directory /hive/data/inside/cells
in the file rr.datasets.txt
. This file looks like this:
2018-09-10 cortex-dev Cortex development 4261 2019-10-07 xena/zeisel2015 Zeisel '15 Mouse cortex & hippocampus 3005 2019-10-07 xena/darmanis2015 Darmanis'15 Brain 290 2019-10-07 xena/head-neck Head and Neck Cancer 3608 2019-10-07 xena/hca-cerebral-organoids HCA Cerebral Organoids 35513 2019-10-07 xena/hca-fetal-maternal HCA Fetal Maternal 6273 2019-10-07 xena/hca-pancreas HCA Human Pancreas 581 2021-05-17 xena/hca-hematopoietic-profiling HCA Hematopoetic Profiling 47953 2019-10-07 xena/hca-tissue-stability HCA Tissue Stability 14704 2019-10-07 xena/hca-immune-census HCA Census of Immune Cells 714342
Column is the release date, column 2 is the short name, column 3 is the short label, and column 4 is the cell count.
It can't directly be used as input, but with some bash commands, you can quickly turn it into one:
grep -v "/" ../rr.datasets.txt | cut -f2 -d$'\t' > my_dataset_list.tsv
This will remove those entires with '/' in their short name as those are subdatasets in a collection and we don't need to worry about tags in those datasets (at least not as of 7/2022).
The resulting file will look like:
head my_dataset_list.tsv cortex-dev xena adultPancreas autism shalek-alexandria-project lifespan-nasal-atlas lung-airway mouse-hsc dental-cells adult-testis
You can then use this as a starting point for an input file to addTags
and getTagVals