This page is intended to serve as a set of guidelines for wrangling a dataset into the Cell Browser, both those from archives (e.g. GEO) or those submitted to us by an external collaborator (aka live wrangling). It should be noted that this list is not comprehensive; there may be things that need to be done that aren’t covered here, or you might skip steps that aren’t relevant to your current dataset.

For a collection, most steps will apply to each dataset in that collection, however, a few (e.g. ‘Respond to submitters’) only apply to the collection as a whole.

Respond to submitters

This step only applies to ‘live wrangled’ datasets. Researchers will email us at cells@ucsc.edu requesting that we host their data. When you respond to them, do these NN things in your response (unless they’ve already mentioned them):

Let them know we can host it
Point them to the submission guidelines
Ask if this is for a publication (so you can gauge their timeline)
Ask if they want the dataset hidden

It’s best to respond to these emails within 24-48 hrs of receiving them.

Example emails/responses:

Zhiwei Li, NN dataset

Their email:

Dear Sir or Madam，

I have a single cell dataset of mouse lung in allergic asthma, and the relevant paper is accepted to be published in Allergy,

I have set a UCSC cell browser in my local computer, and I want share the single cell data to the the website http://cells.ucsc.edu for public access, please tell me how to do it. Thank you.

Best wishes,

Dr. Zhiwei Li

Our response:

Hello, Zhiwei.

We would be happy to host your data on the UCSC Cell Browser. Please take a look at our submission guidelines: https://cellbrowser.readthedocs.io/en/master/submission.html. Let us know if you have any questions.

Thank you!

Angela Ting, NN dataset

Their email:

To whom this may concern,

We are preparing to resubmit our manuscript containing normal human ureter single-cell data to Developmental Cell (https://www.biorxiv.org/content/10.1101/2021.12.22.473889v1). The raw data and expression matrix have already been accepted by GEO, but we would like to deposit this data with UCSC cell browser to enable convenient access/utilization by the broader scientific community.

Please advise.

Our response

Hi, Angela.

We'd be happy to host your data on the Cell Browser. Please review this page for more information about submitting data: https://cellbrowser.readthedocs.io/en/master/submission.html. After you've prepared everything for submission, feel free to share the required files and we can get started on the import. Let us know if you have any questions about the process!

Thanks!

Make a directory

Make a directory with the dataset short name in /hive/data/inside/cells/datasets. The submitters should have supplied you with one since it’s mentioned on the submission guidelines page. If not, you can ask them if they had a short name in mind and share the short name requirements with them. You will most likely have to adjust their suggested name.

If it’s a dataset you’re wrangling from the archives, you will have to think of a short name that captures the main idea of the dataset while adhering to our requirements.

Short name requirements:

4 words or less
All lowercase
Separate words with “-”
Aim for 20 characters or less

Some common shortenings/contractions we use:

dev for developing
org for organoids
vasc for vascular

Some examples of good short names:

tabula-sapiens
mouse-dev-brain
mouse-gastrulation
hgap
covid19-brain

(You may notice that there are quite a few datasets that don’t seem to follow these guidelines. These were created before we established these rules and are ‘grandfathered’ in. You can’t change a short name once it’s been published to the main site.)

Download files

Within the directory made in the last step, make an ‘orig’ directory - place ‘original’ files there. The files downloaded to orig should remain (mostly) unchanged from those you downloaded.

The easiest way of downloading a file is via aria2c:

aria2c https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133345/suppl/GSE133345_Annotations_of_all_1231_embryonic_cells_updated_0620.txt.gz
aria2c -o TS_germ_line.h5ad.gz 'https://figshare.com/ndownloader/files/34702051'

In this second example, the -o option allows us to specify a name for the final file, rather than wget’s default of assigning the name based on the last part of the URL ('34702051' in this case).

If you have multiple files, place all of the URLs into a single file and use the ‘-i’ option:

aria2c -i my_files.lst

The utility rclone is another option for downloading files, though it does take some effort to set up. See our internal instructions. Once you have it set up, it is fairly easy to use (quite similar to .

If all else fails, you may need to download files to your computer and then upload those to hgwdev using scp:

scp <files> <uname>@hgwdev.gi.ucsc.edu:/hive/data/inside/cells/datasets/<dname>/orig

(If you do need to go this route, it’s probably best to do this while on the UCSC network to save your own bandwidth.)

Wrangling process

Respond to submitters

Make a directory

Download files

Navigation menu

Page actions

Page actions

Personal tools

Genecats Wiki Navigation

Search

Media Wiki Navigation

Tools