Generating coordinates using cbScanpy

From Genecats
Jump to navigationJump to search

Occasionally, you need to generate layout coordinates for a dataset you are wrangling. This is most easily done with cbScanpy, which takes in an expression matrix and some metadata in order to generate tSNE, UMAP, etc coordinates.

These steps and configurations can also be used when importing a bulk RNA-seq dataset, such as GTEx v8 or Treehouse Cancer Compendium. For bulk RNA-seq, you have to turn off all of the filtering steps since the expression values and number of genes expressed are significantly higher than in single-cell RNA-seq. Normal filtering done by scanpy would basically toss out all of the samples.

Setting up your scanpy.conf

Create one with the default values by running:

cbScanpy –init 

We recommend turning off most of the cell filtering steps as we assume that the authors/submitters have already done the appropriate filtering and the default settings for these filters can be overzealous (e.g. removing 75% or more of the cells in some cases). Make the following changes to the scanpy.conf:

doTrimCells=False
doFilterMito=False
doFilterGenes=False

Context-dependent changes

Some changes only make sense depending on the the particulars of your dataset.

Are the values in your matrix already normalized/logged? (If values include decimals and the max value is low, e.g. 6.0-10.0, then it probably is. Then set

doExp=True

Does your dataset have more than 20,000 cells? Only run UMAP:

doLayouts=[“umap”]

Running cbScanpy

Once you have your scanpy.conf set up, it’s time to actually run cbScanpy:

cbScanpy -e orig/<expr_mat_file> -m orig/<meta_file> -o . -n <short_name> --skipMatrix --inCluster=<field_name>

If your scanpy.conf is not in the same directory as where you’re running cbScanpy you’ll need to specify that with the ‘-c’ option. We're skipping the matrix export as the input matrix is often already in the proper format.

After that completes, run cbBuild and check out the results in the Cell Browser. Hopefully things separate out into relatively distinct clusters. If not, you can try adjusting the settings in scanpy.conf and trying again or asking Max or the submitters/authors for input.