Advanced Cell Browser Topics: Difference between revisions

From Genecats
Jump to navigationJump to search
(Removed bulk section + changes to renaming section.)
No edit summary
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
==Generating coordinates using cbScanpy==
Content of this page has moved to other, individual pages:


Rarely, you will be wrangling a dataset where you need to generate the layout coordinates. This is most easily done with cbScanpy.
[[Generating_coordinates_using_cbScanpy]]


These steps and configurations can also be used when importing a bulk RNA-seq dataset, such as [https://gtex8.cells.ucsc.edu GTEx v8] or [https://treehouse.cells.ucsc.edu Treehouse Cancer Compendium]. For bulk RNA-seq, you have to turn off all of the filtering steps since the expression values and number of genes expressed are significantly higher than in single-cell RNA-seq. Normal filtering done by scanpy would basically toss out all of the samples.
[[Setting_up_rclone_for_the_Cell_Browser]]


===Setting up your scanpy.conf ===
[[Renaming_a_Cell_Browser_dataset]]
Create one with the default values by running:
cbScanpy –init
 
We recommend turning off most of the cell filtering steps as we assume that the authors/submitters have already done the appropriate filtering and the default settings for these filters can be overzealous (e.g. removing 75% or more of the cells in some cases). Make the following changes to the scanpy.conf:
 
doTrimCells=False
doFilterMito=False
doFilterGenes=False
 
====Context-dependent changes====
Some changes only make sense depending on the the particulars of your dataset.
 
Are the values in your matrix already normalized/logged? (If values include decimals and the max value is low, e.g. 6.0-10.0, then it probably is. Then set
doExp=True
 
Does your dataset have more than 20,000 cells? Only run UMAP:
doLayouts=[“umap”]
 
===Running cbScanpy===
Once you have your scanpy.conf set up, it’s time to actually run cbScanpy:
 
cbScanpy -e orig/<expr_mat_file> -m orig/<meta_file> -o . -n <short_name> --skipMatrix --inCluster=<field_name>
 
If your scanpy.conf is not in the same directory as where you’re running cbScanpy you’ll need to specify that with the ‘-c’ option. We're skipping the matrix export as the input matrix is often already in the proper format.
 
After that completes, run cbBuild and check out the results in the Cell Browser. Hopefully things separate out into relatively distinct clusters. If not, you can try adjusting the settings in scanpy.conf and trying again or asking Max or the submitters/authors for input.
 
==Setting up rclone==
 
rclone is 'rsync for cloud services' and essentially allows you to download files from cloud storage providers via the command line.
 
===Installation===
You should install rclone using conda in a new environment:
 
conda create rclone python=3.9
conda activate rclone
conda install -c conda-forge mamba
mamba install -c conda-forge rclone
 
===Setup===
Once you have rclone installed, you will have to go through some steps to get it to work with Box, Google Drive, Dropbox, and any other online services submitters might use. Would recommend just setting up new profiles (or ‘remotes’ as rclone calls them) as needed.
 
Here are links on setting up rclone with two of the most common cloud storage provides Cell Browser submitters use:
* [https://rclone.org/box/ Box]
* [https://rclone.org/drive/ Google Drive]
 
In both cases, you’ll essentially keep the default settings for everything except for this step:
 
Use auto config?
* Say Y if not sure
* Say N if you are working on a remote or headless machine
 
At this step, say ‘No’ as hgwdev is a remote machine. You will be given a URL to go to on your personal computer. To finish this step of the setup, you will need rclone installed on your personal computer. If you’re on a Mac with OSX, it’s easiest to do so with [https://brew.sh/ homebrew], which after you install that, you can do: brew install rclone. They also provide executables for Windows: https://rclone.org/downloads/.
 
===Downloading a file===
To download a file using rclone, it has to be in your Drive/Box/whatever. You can’t just provide it with a link to a public file. (Though who knows, maybe there is a way to do that!)
 
The command to download a file from Drive/Box/whatever is:
rclone copy /home/source remote:path/to/file.name
 
Where ‘remote’ is the name you gave that particular remote/profile during setup (e.g. work_gdrive) and ‘path/to/file.name’ is the path to the file you are downloading (e.g. cb_files/file.h5ad). The path will be relative to the top-level directory of that storage provider.
 
==Renaming a dataset==
 
These steps allow you to change a dataset’s short name, but not have to go through the often lengthy process of rebuilding a dataset from scratch. It assumes that you are carrying out the steps on hgwdev to change the short name for a dataset on cells-test.
 
'''Note''': a dataset’s short name should pretty much never be changed after being pushed to the main site. People bookmark things and URLs make their way into publications and we want to try our hardest not to break those.
 
First, rename the directory in datasets:
 
cd /hive/data/inside/cells/datasets/
mv {old_name} {new_name}
 
Then, rename the directory in htdocs-cells
 
cd /usr/local/apache/htdocs-cells/
mv {old_name} {new_name}
 
Best to do this soon after you rename the other dir so that the same mv command isn’t too far back in your history.
 
Finally, rebuild the dataset:
cd /usr/local/apache/htdocs-cells/{new_name}
cbBuild -o alpha
 
This is necessary because the old name is still present in various dataset.json files, so rebuilding will ensure that all instances of the old name are replaced with the new ones. If it's a collection, you'll need to rebuild each of the child datasets.

Latest revision as of 18:22, 3 August 2022