Finding archive datasets for the Cell Browser

From Genecats
Jump to navigationJump to search

Wrangling data from archives into the Cell Browser is not too common anymore. This page will cover how you might go about finding new datasets that should be relatively easy to import should the need arise.

GEO

GEO contains a massive amount of datasets, however, it can be difficult with just basic searches to find those should be relatively easy to import into the Cell Browser. Using the advanced search option, you can create a search like this:

((((RDS[Supplementary Files]) OR Loom[Supplementary Files]) OR h5ad[Supplementary Files]) OR Rdata[Supplementary Files]) OR Robj[Supplementary Files]

Which will show any datasets that have RDS, Loom, h5ad, Rdata, or Robj files in the 'Supplementary Files' section.

Here's a URL that combines that supplementary file filter plus the word 'single-cell' to find single-cell datasets on GEO with those files types: https://www.ncbi.nlm.nih.gov/gds/?term=single-cell+((((RDS%5BSupplementary+Files%5D)+OR+Loom%5BSupplementary+Files%5D)+OR+h5ad%5BSupplementary+Files%5D)+OR+Rdata%5BSupplementary+Files%5D)+OR+Robj%5BSupplementary+Files%5D

It still takes work to go through that list and find interesting ones that would make sense for the Cell Browser, but it can be a useful starting point. You can download files to something like /hive/users/$USER/cb/temp and explore them with h5adMetaInfo or rdsMetaInfo (or Python or R) to see if they contain all the elements needed to make a cell browser (matrix, cell type annotations, coordinates). Not all RDS/Rdata/Robj files are Seurat files so be sure to double-check those ones.

HCA DCP

The HCA DCP can be another great source of datasets to import into the Cell Browser, though it contains many datasets curated and imported from GEO, so you should pay attention to the GEO accessions associated with a project in the DCP before going through the work to import it.

Start by going to https://data.humancellatlas.org/explore/projects, then under 'File' selecting 'File Types' like h5ad, rds, rdata, and any variations in capitalization of those. Click the project title to see the project details and then look for the Cell Browser icon under the "Analysis Portals" section or by searching for a GSE or other accession across all the desc.conf files:

cd /hive/data/inside/cells/datasets ; find . -name desc.conf | xargs grep your_accession_here  

Many of the datasets in the HCA DCP don't have the files necessary to import them into the Cell Browser, but those filters settings should narrow it down to the ones that should include them. You can download files to something like /hive/users/$USER/cb/temp and explore them with h5adMetaInfo or rdsMetaInfo (or Python or R) to see if they contain all the elements needed to make a cell browser (matrix, cell type annotations, coordinates). Not all RDS/Rdata/Robj files are Seurat files so be sure to double-check those ones.