Cell Browser best practices: Difference between revisions

From Genecats
Jump to navigationJump to search
(→‎Finding a paper associated with a bioRxiv pub: adjusting formatting of section)
 
(18 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== '''Best Practices''' ==
==Best Practices==


'''Formatting configuration files'''
===Formatting configuration files===


*Typically you should keep a maximum of 80-120 characters per line; you can use <code>gqgq</code> in VIM in visual mode to auto format a paragraph into multiple ~80 character lines
*Typically you should keep a maximum of 80-120 characters per line; you can use <code>gqgq</code> in VIM in visual mode to auto format a paragraph into multiple ~80 character lines
*For special characters, please refer to HTML character encoding: https://ascii.cl/htmlcodes.htm
*For special characters, please refer to HTML character encoding: https://ascii.cl/htmlcodes.htm


==cellbrowser.conf==


== '''cellbrowser.conf''' ==
===Putting things into cellbrowser-confs repo===


Put things into cellbrowser-confs repo [[Commit cellbrowser/desc.conf files]](http://genomewiki.ucsc.edu/genecats/index.php/Wrangling_process#Commit_cellbrowser.2Fdesc.conf_files)
From inside a dataset directory:


<pre>git add dataset-name
<pre>git add desc.conf cellbrowser.conf
git commit -m “message”
git commit -m “message”
git push</pre>
git push</pre>


'''Naming datasets'''
Only do this for public datasets. If this is a collection, commit the files for each dataset in the collection. For additional help you can refer to [[Wrangling_process#Commit_cellbrowser.2Fdesc.conf_files | Commit cellbrowser/desc.conf files]].


Dataset names should be all lowercase, using 4 words or less, and less than 20 characters and separated by hyphens.
===Naming datasets===
The names need to be lowercased because the Cell Browser (website) code converts all names lowercase.
 
Dataset names should be:
 
*all lowercase
*4 words or less
*less than 20 characters and separated by hyphens
 
The names need to be lowercase because the Cell Browser website code converts all names lowercase.
There are only a few exceptions for early datasets (e.g. [https://cells-test.gi.ucsc.edu/?ds=adultPancreas  adultPancreas]).
There are only a few exceptions for early datasets (e.g. [https://cells-test.gi.ucsc.edu/?ds=adultPancreas  adultPancreas]).


'''Layout Coordinates'''
===Layout Coordinates===


Capitalize <code>"UMAP"</code> and <code>"tSNE"</code>.
*Capitalize <code>"UMAP"</code> and <code>"tSNE"</code>.
*Remove extra layout coordinates (e.g. PCA or Harmony)  because the cbImportTools export all of the possible layouts and they export only the only the first two coordinates. The CB can only handle two coordinates and so these layouts often look like a clump of cells.  


Remove extra layout coordinates (e.g. PCA or Harmony)  because the cbImportTools export all of the possible layouts and they export only the only the first two coordinates. The CB can only handle two coordinates and so these layouts often look like a clump of cells.  
The following two images are examples of PCA plots. For reference, the first image is from the "lung-airway" dataset and the second image is from the "hoc" dataset.


The following two images are examples of PCA plots. For reference, the first image is from the "lung-airway" dataset and the second image is from the "hoc" dataset.
[[File:lungairway_pca.png | x400px]]
[[File:lungairway_pca.png]]
[[File:hoc_pca.png | x500px]]
[[File:hoc_pca.png]]


'''Finding a paper associated with a bioRxiv pub'''
===Finding a paper associated with a bioRxiv pub===


Sometimes you will have to go back and edit the paper citation for a dataset.  
Sometimes you will have to go back and edit the paper citation for a dataset.  


1. Get the bioRxiv URL for your dataset, e.g. https://www.biorxiv.org/content/10.1101/2020.06.30.174391v1
2. Copy this bit of the URL: 10.1101/2020.06.30.174391
3. And feed it to the bioRxiv API: https://api.biorxiv.org/details/biorxiv/. Here's an example command to call the bioRxiv api via <code>curl</code> and then use <code>jq</code> to just get the key 'published' from the JSON response:
<pre>
<pre>
In /hive/data/inside/cells/datasets run
curl https://api.biorxiv.org/details/biorxiv/10.1101/2020.06.30.174391 2>/dev/null | jq '.[] | .[] | .published'
</pre>
4. If it's published, it'll have a DOI (e.g. 10.1038/s41593-021-00872-y); otherwise it'll just say 'NA'.
5. Paste the DOI into https://www.doi.org/ and you'll be taken directly to the paper.


====Finding all datasets with bioRxiv pubs====
If you want to go through and find all datasets with bioRxiv pubs and update them:
<pre>
cd /hive/data/inside/cells/datasets
find . -name desc.conf | xargs grep "biorxiv" |grep -v "Strange\|\#"  
find . -name desc.conf | xargs grep "biorxiv" |grep -v "Strange\|\#"  
 
</pre>
Should get results like:
Which should get you results like:
<pre>
./cbl-dev/desc.conf:biorxiv_url = "https://www.biorxiv.org/content/10.1101/2020.06.30.174391v1 Aldinger et al. 2020. bioRxiv."  
./cbl-dev/desc.conf:biorxiv_url = "https://www.biorxiv.org/content/10.1101/2020.06.30.174391v1 Aldinger et al. 2020. bioRxiv."  
Copy this bit of the URL: 10.1101/2020.06.30.174391
And feed it to the bioRxiv API: curl https://api.biorxiv.org/details/biorxiv/10.1101/2020.06.30.174391
In the response, you should see the word "published" and if it's published, it'll have a doi otherwise it'll just say NA.
</pre>
</pre>


This is referenced from the Cells Redmine [https://redmine.soe.ucsc.edu/issues/27316 To Do #27316].
You can then extract the necessary pieces from the URLs and use them with the steps above.


You could also paste the DOI into [https://www.doi.org/ doi.org].
===Providing the Unit for datasets===


'''Providing the Unit for datasets'''
<code>unit=""</code>


<code>unit=""</code>
Provide the unit of the values used in the expression matrix. Typical values: "read count/UMI", "log of read count/UMI", "TPM", "log of TPM", "CPM", "FPKM", "RPKM".


The unit of the values in the expression matrix. You can ask the author if needed or search the Seurat data slot "normalized". Typical values: "read count/UMI", "log of read count/UMI", "TPM", "log of TPM", "CPM", "FPKM", "RPKM".
For Seurat objects, the 'counts' slot is typically 'UMI count' for 10x data or 'read count' for Smart-seq2 or similar assays. The 'data' slot is the log-normalized version of the counts slot. This Github issue has some details: https://github.com/satijalab/seurat/issues/3711. For SCT assay datasets, it's slightly different: https://satijalab.org/seurat/reference/sctransform; in short though, the units are: counts -> (corrected) counts, data -> log1p(counts), scale.data -> pearson residuals.


It's probably easiest to ask the authors if you're unsure.


== '''desc.conf''' ==
==desc.conf==


Most commonly used desc.conf settings to keep consistent:
Most commonly used desc.conf settings to keep consistent:


<code>title = "First word is capitalized and the rest is all lowercased"</code>
* <code>title = "First word is capitalized and the rest is all lowercased"</code>
 
* <code>paper_url = “http://www.paper_url.com/xxx Last name et al. Journal. Year.”</code>
<code>paper_url = “http://www.paper_url.com/xxx Last name et al. Journal. Year.”</code>
* <code>biorxiv_url = "https://www.biorxiv.org/content/123/123.full Last name et al. bioRxiv. Year."</code>
 
<code>biorxiv_url = "https://www.biorxiv.org/content/123/123.full Last name et al. bioRxiv Year."</code>
 
 
The additional database links (GSE, Bioproject, SRA accessions, PMID, etc.) can be set to just the number, no author info:


The additional database links (GSE, Bioproject, SRA accessions, PMID, etc.) can be set to just the number/accession, no author info:
<pre>
<pre>
pmid = "12343234"
pmid = "12343234"
Line 81: Line 94:
sra_study = "xxxx"
sra_study = "xxxx"
</pre>
</pre>
[[Category:Cell Browser]]

Latest revision as of 20:44, 26 August 2022

Best Practices

Formatting configuration files

  • Typically you should keep a maximum of 80-120 characters per line; you can use gqgq in VIM in visual mode to auto format a paragraph into multiple ~80 character lines
  • For special characters, please refer to HTML character encoding: https://ascii.cl/htmlcodes.htm

cellbrowser.conf

Putting things into cellbrowser-confs repo

From inside a dataset directory:

git add desc.conf cellbrowser.conf
git commit -m “message”
git push

Only do this for public datasets. If this is a collection, commit the files for each dataset in the collection. For additional help you can refer to Commit cellbrowser/desc.conf files.

Naming datasets

Dataset names should be:

  • all lowercase
  • 4 words or less
  • less than 20 characters and separated by hyphens

The names need to be lowercase because the Cell Browser website code converts all names lowercase. There are only a few exceptions for early datasets (e.g. adultPancreas).

Layout Coordinates

  • Capitalize "UMAP" and "tSNE".
  • Remove extra layout coordinates (e.g. PCA or Harmony) because the cbImportTools export all of the possible layouts and they export only the only the first two coordinates. The CB can only handle two coordinates and so these layouts often look like a clump of cells.

The following two images are examples of PCA plots. For reference, the first image is from the "lung-airway" dataset and the second image is from the "hoc" dataset.

Lungairway pca.png Hoc pca.png

Finding a paper associated with a bioRxiv pub

Sometimes you will have to go back and edit the paper citation for a dataset.

1. Get the bioRxiv URL for your dataset, e.g. https://www.biorxiv.org/content/10.1101/2020.06.30.174391v1

2. Copy this bit of the URL: 10.1101/2020.06.30.174391

3. And feed it to the bioRxiv API: https://api.biorxiv.org/details/biorxiv/. Here's an example command to call the bioRxiv api via curl and then use jq to just get the key 'published' from the JSON response:

curl https://api.biorxiv.org/details/biorxiv/10.1101/2020.06.30.174391 2>/dev/null | jq '.[] | .[] | .published'

4. If it's published, it'll have a DOI (e.g. 10.1038/s41593-021-00872-y); otherwise it'll just say 'NA'.

5. Paste the DOI into https://www.doi.org/ and you'll be taken directly to the paper.

Finding all datasets with bioRxiv pubs

If you want to go through and find all datasets with bioRxiv pubs and update them:

cd /hive/data/inside/cells/datasets
find . -name desc.conf | xargs grep "biorxiv" |grep -v "Strange\|\#" 

Which should get you results like:

./cbl-dev/desc.conf:biorxiv_url = "https://www.biorxiv.org/content/10.1101/2020.06.30.174391v1 Aldinger et al. 2020. bioRxiv." 

You can then extract the necessary pieces from the URLs and use them with the steps above.

Providing the Unit for datasets

unit=""

Provide the unit of the values used in the expression matrix. Typical values: "read count/UMI", "log of read count/UMI", "TPM", "log of TPM", "CPM", "FPKM", "RPKM".

For Seurat objects, the 'counts' slot is typically 'UMI count' for 10x data or 'read count' for Smart-seq2 or similar assays. The 'data' slot is the log-normalized version of the counts slot. This Github issue has some details: https://github.com/satijalab/seurat/issues/3711. For SCT assay datasets, it's slightly different: https://satijalab.org/seurat/reference/sctransform; in short though, the units are: counts -> (corrected) counts, data -> log1p(counts), scale.data -> pearson residuals.

It's probably easiest to ask the authors if you're unsure.

desc.conf

Most commonly used desc.conf settings to keep consistent:

The additional database links (GSE, Bioproject, SRA accessions, PMID, etc.) can be set to just the number/accession, no author info:

pmid = "12343234"
geo_series = "GSE25097"
dbgap = "phs000424.v7.p2"
arrayexpress = "xxx"
sra_study = "xxxx"