Managing cellbrowser.conf tag values for multiple datasets

From Genecats
Jump to navigationJump to search

The cellbrowser.conf is made up of a series of tag/value pairs. For example, in the line “body_parts=[‘brain’]”, “body_parts” is the tag and “[‘brain’]” is the value. This page goes over how to manage (add/update/query) these tags across some/all of the datasets we host.

Adding/updating tags

Most of the tag/value pairs point a cell browser to the various input files (e.g. exprMatrix or meta). However, some of them are needed to control the filter options, and include:

  • body_parts
  • diseases
  • projects
  • organisms
  • sources
  • life_stages
  • domains

There may be more added in the future. For new datasets, you should be configuring these settings as you add the datasets. If new tags are added, it may be necessary to add and backfill these settings in the cellbrowser.conf files of hundreds of datasets. This type of mass update can be managed using the script addTags.

The input for addTags is an N-column, tab-separated file. The first column is the dataset name and following columns are the values you want added or updated. A header line is required and each column needs an entry in the header line that tells the script what tag the values are associated with. Here’s the first few lines of the file that were used to add disease, projects, and organisms tags when they were added:

dataset            body_parts organisms           projects                diseases
adultPancreas      pancreas   Human (H. sapiens)  CIRM                    Healthy
aging-brain        brain      Mouse (M. musculus)                         Healthy
aging-human-skin   skin       Human (H. sapiens)                          Healthy
mouse-esophagus    esophagus  Mouse (M. musculus)                         Healthy
tabula-muris-senis all        Mouse (M. musculus) Tabula Muris Consortium Healthy
tabulamuris        all        Mouse (M. musculus) Tabula Muris Consortium Healthy
tabula-sapiens     all        Human (H. sapiens)  Tabula Muris Consortium Healthy
gtex8              all        Human (H. sapiens)  GTEx                    Healthy
adult-brain-vasc   brain      Human (H. sapiens)                          Healthy

As you can see the labels in the header are the names of the tags that the values in that column are associated with. When addTags is run, for example, it will add an 'organisms' tag line to the cellbrowser.conf for the aging-brain dataset and set the value to ["Mouse (M. musculus)"] meaning the final line would be: organisms=["Mouse (M. musculus)"].

If you have multiple values for a tag, separate items by a comma.

dental-cells         teeth                           Human (H. sapiens), Mouse (M. musculus)                       Healthy
lepto-metastasis     brain, spinal cord              Human (H. sapiens)                                            Leptomeningeal Melanoma
stanford-czb-hlca    lung                            Human (H. sapiens)                                            Lung Cancer, Healthy Control
teichmann-asthma     lung                            Human (H. sapiens)                                            Asthma, Healthy Control
gut-cell-atlas       gut, colon, ileum, duojejunum   Human (H. sapiens)                      Human Cell Atlas, hca Crohn's Disease, Healthy Control, Healthy
lifespan-nasal-atlas respiratory system, nasal, lung Human (H. sapiens)                                            Influenza, Healthy Control

The script will translate the sets of comma-separated values to the format needed in the cellbrowser.conf

If a tag in the header already exists in the cellbrowser.conf for a dataset listed, then nothing will be changed. If you want the values in the file to replace those current in the cellbrowser.conf, then you will need to add the -u/–update option when running addTags.

Here’s an example that you can run for 5 datasets:


Getting tag values

You can get the values for a single tag for a list of datasets using getTagVals. In the input file, the first column should be a list of dataset names. Any subsequent columns in the file will be ignored and carried over to the output file. There may be several reasons you might want to get

Here's a example you can run to see the body_parts values for 10 datasets:

cd /hive/data/inside/cells/exampleDatasets
getTagVals rr.datasets.10.tsv body_parts

Which outputs:

cortex-dev           body_parts
xena                 oral cavity, placenta, decidua, blood, cord blood, retina, spleen, lung, esophagus, pancreas, brain, cotex, hippocampus
adultPancreas        pancreas
autism               brain
lifespan-nasal-atlas respiratory system, nasal, lung
lung-airway          lung
mouse-hsc            blood, bone marrow
dental-cells         teeth
adult-testis         testis
gbm                  brain


As noted above, any columns in input file past the dataset list will be carried over to the output. For example, if this is the starting file:

dataset                      diseases
shalek-alexandria-project    Granuloma, Healthy Control, HIV-1, Chronic Rhinosinusitis, Type 2 Inflammatory Disease
allen-celltypes              Healthy
adultPancreas                Healthy
gbm                          Glioblastoma
quake-gbm                    Glioblastoma, Healthy Control
organoids                    Healthy
chi-10x-mouse-cardiomyocytes Healthy
cortex-atac                  Healthy
organoidreportcard           Healthy

And you run this command:

cd /hive/data/inside/cells/exampleDatasets
getTagVals rr.datasets.disease.10.tsv "body_parts organisms"

You can see that the extra columns are just carried over to the output:

dataset                      diseases                                                                               body_parts      organisms
shalek-alexandria-project    Granuloma, Healthy Control, HIV-1, Chronic Rhinosinusitis, Type 2 Inflammatory Disease                 Human (H. sapiens), Mouse (M. musculus), Rhesus macaque (M. mulatta)
allen-celltypes              Healthy                                                                                brain           Human (H. sapiens), Mouse (M. musculus), Rhesus macaque (M. mulatta)
adultPancreas                Healthy                                                                                pancreas        Human (H. sapiens)
gbm                          Glioblastoma                                                                           brain           Human (H. sapiens)
quake-gbm                    Glioblastoma, Healthy Control                                                          brain           Human (H. sapiens)
organoids                    Healthy                                                                                organoid, brain Human (H. sapiens), Rhesus macaque (M. mulatta), Chimp (P. troglodytes)
chi-10x-mouse-cardiomyocytes Healthy                                                                                heart           Mouse (M. musculus)
cortex-atac                  Healthy                                                                                brain           Human (H. sapiens)
organoidreportcard           Healthy                                                                                organoid, brain Human (H. sapiens)

Getting a list of datasets on the RR

Both of these scripts require a list of datasets as input. We maintain a list of datasets on the RR in the directory /hive/data/inside/cells in the file rr.datasets.txt. This file looks like this:

2018-09-10 cortex-dev                       Cortex development                        4261
2019-10-07 xena/zeisel2015                  Zeisel '15 Mouse cortex & hippocampus 3005
2019-10-07 xena/darmanis2015                Darmanis'15 Brain                         290
2019-10-07 xena/head-neck                   Head and Neck Cancer                      3608
2019-10-07 xena/hca-cerebral-organoids      HCA Cerebral Organoids                    35513
2019-10-07 xena/hca-fetal-maternal          HCA Fetal Maternal                        6273
2019-10-07 xena/hca-pancreas                HCA Human Pancreas                        581
2021-05-17 xena/hca-hematopoietic-profiling HCA Hematopoetic Profiling                47953
2019-10-07 xena/hca-tissue-stability        HCA Tissue Stability                      14704
2019-10-07 xena/hca-immune-census           HCA Census of Immune Cells                714342

Column is the release date, column 2 is the short name, column 3 is the short label, and column 4 is the cell count.

It can't directly be used as input, but with some bash commands, you can quickly turn it into one:

grep -v "/" ../rr.datasets.txt | cut -f2 -d$'\t' > my_dataset_list.tsv

This will remove those entires with '/' in their short name as those are subdatasets in a collection and we don't need to worry about tags in those datasets (at least not as of 7/2022).

The resulting file will look like:

head my_dataset_list.tsv

cortex-dev
xena
adultPancreas
autism
shalek-alexandria-project
lifespan-nasal-atlas
lung-airway
mouse-hsc
dental-cells
adult-testis


You can then use this as a starting point for an input file to addTags and getTagVals