Advanced Cell Browser Topics: Difference between revisions

From Genecats
Jump to navigationJump to search
(→‎Generating coordinates using cbScanpy: adding some comments on bulk rna-seq)
Line 3: Line 3:
Rarely, you will be wrangling a dataset where you need to generate the layout coordinates. This is most easily done with cbScanpy.
Rarely, you will be wrangling a dataset where you need to generate the layout coordinates. This is most easily done with cbScanpy.


These steps and configurations can also be used when importing a bulk RNA-seq dataset, such as [https://gtex8.cells.ucsc.edu GTEx v8] or [https://treehouse.cells.ucsc.edu Treehouse Cancer Compendium]. For bulk RNA-seq, you have to turn off all of the filtering steps and the expression values and number of genes expressed are significantly higher than in single-cell RNA-seq. Normal filtering done by scanpy would basically toss out all of the samples.  
These steps and configurations can also be used when importing a bulk RNA-seq dataset, such as [https://gtex8.cells.ucsc.edu GTEx v8] or [https://treehouse.cells.ucsc.edu Treehouse Cancer Compendium]. For bulk RNA-seq, you have to turn off all of the filtering steps since the expression values and number of genes expressed are significantly higher than in single-cell RNA-seq. Normal filtering done by scanpy would basically toss out all of the samples.  


===Setting up your scanpy.conf ===
===Setting up your scanpy.conf ===

Revision as of 15:09, 17 June 2022

Generating coordinates using cbScanpy

Rarely, you will be wrangling a dataset where you need to generate the layout coordinates. This is most easily done with cbScanpy.

These steps and configurations can also be used when importing a bulk RNA-seq dataset, such as GTEx v8 or Treehouse Cancer Compendium. For bulk RNA-seq, you have to turn off all of the filtering steps since the expression values and number of genes expressed are significantly higher than in single-cell RNA-seq. Normal filtering done by scanpy would basically toss out all of the samples.

Setting up your scanpy.conf

Create one with the default values by running:

cbScanpy –init 

We recommend turning off most of the cell filtering steps as we assume that the authors/submitters have already done the appropriate filtering and the default settings for these filters can be overzealous (e.g. removing 75% or more of the cells in some cases). Make the following changes to the scanpy.conf:

doTrimCells=False
doFilterMito=False
doFilterGenes=False

Context-dependent changes

Some changes only make sense depending on the the particulars of your dataset.

Are the values in your matrix already normalized/logged? (If values include decimals and the max value is low, e.g. 6.0-10.0, then it probably is. Then set

doExp=True

Does your dataset have more than 20,000 cells? Only run UMAP:

doLayouts=[“umap”]

Running cbScanpy

Once you have your scanpy.conf set up, it’s time to actually run cbScanpy:

cbScanpy -e orig/<expr_mat_file> -m orig/<meta_file> -o . -n <short_name> --skipMatrix --inCluster=<field_name>

If your scanpy.conf is not in the same directory as where you’re running cbScanpy you’ll need to specify that with the ‘-c’ option. We're skipping the matrix export as the input matrix is often already in the proper format.

After that completes, run cbBuild and check out the results in the Cell Browser. Hopefully things separate out into relatively distinct clusters. If not, you can try adjusting the settings in scanpy.conf and trying again or asking Max or the submitters/authors for input.

Setting up rclone

Installation

You can install this using conda in a new environment or one of your existing ones (e.g. scanpyenv). Here we’ll set it up in a separate environment.

Conda create Conda activate Conda install

Get it working with…

Google Drive Box Link to full list?

Downloading a file Other gotchas for cb work?

   File has to be in your drive/box/whatever. 
   Cant remember is there a way to download a public file?

Wrangling a bulk RNA dataset

Renaming a dataset

Note: a dataset’s shortname should (almost) never be changed after being pushed to the main site. People bookmark things and URLs make their way into publications and we want to try our hardest not to break those.

These steps allow you to change a dataset’s shortname, but not have to go through the often lengthy process of rebuilding a dataset from scratch.

First, rename the directory in datasets:

cd /hive/data/inside/cells/datasets/ 

mv {old_name} {new_name}

Then, rename the directory in htdocs-cells

cd /usr/local/apache/htdocs-cells/

mv {old_name} {new_name}

Best to do this soon after you rename the other dir so that the same mv command isn’t too far back in your history.

Finally, rebuild the dataset: cd - # Note that this will take you back to the last directory you were in cbBuild -o alpha

This is necessary because the old name is still present in various dataset.json files, so rebuilding will replace the old names with the new ones.