Cell Browser best practices: Difference between revisions

From Genecats
Jump to navigationJump to search
(Created page with "== Best Practices == '''Formatting configuration files''' *80-120 chars per line; use (<code>gqgq</code> in VIM to auto format a paragraph into multiple ~80 char lines *For...")
 
(→‎Best Practices: Cleaning up page.)
Line 1: Line 1:
== Best Practices ==
== '''Best Practices''' ==


'''Formatting configuration files'''
'''Formatting configuration files'''


*80-120 chars per line; use (<code>gqgq</code> in VIM to auto format a paragraph into multiple ~80 char lines
*Typically you should keep a maximum of 80-120 characters per line; you can use <code>gqgq</code> in VIM in visual mode to auto format a paragraph into multiple ~80 character lines
*For special characters, please refer to HTML character encoding: https://ascii.cl/htmlcodes.htm
*For special characters, please refer to HTML character encoding: https://ascii.cl/htmlcodes.htm


'''cellbrowser.conf'''
 
== '''cellbrowser.conf''' ==


Put things into cellbrowser-confs repo [[Commit cellbrowser/desc.conf files]](http://genomewiki.ucsc.edu/genecats/index.php/Wrangling_process#Commit_cellbrowser.2Fdesc.conf_files)
Put things into cellbrowser-confs repo [[Commit cellbrowser/desc.conf files]](http://genomewiki.ucsc.edu/genecats/index.php/Wrangling_process#Commit_cellbrowser.2Fdesc.conf_files)


<code>Git add dataset-name
<pre>git add dataset-name
 
git commit -m “message”
Git commit -m “message”
git push</pre>
 
Git push</code>


'''Naming datasets'''
'''Naming datasets'''


Dataset names should be all lowercase, using 4 words or less with less than 20 characters and separated by hyphens.
Dataset names should be all lowercase, using 4 words or less, and less than 20 characters and separated by hyphens.
The names need to be lowercased because the Cell Browser (website) code converts all names lowercase.
The names need to be lowercased because the Cell Browser (website) code converts all names lowercase.
There are only a few exceptions for early datasets [e.g. https://cells-test.gi.ucsc.edu/?ds=adultPancreas adultPancreas]
There are only a few exceptions for early datasets (e.g. [https://cells-test.gi.ucsc.edu/?ds=adultPancreas adultPancreas]).


'''Capitalizing UMAP/tSNE/etc'''
'''Layout Coordinates'''


Capitalize <code>"UMAP"</code> and <code>"tSNE"</code>
Capitalize <code>"UMAP"</code> and <code>"tSNE"</code>.


Remove extra layout coordinates (e.g. PCA or Harmony)  because the cbImportTools export all of the possible layouts. The CB can only handle two coordinates and so these layouts often look like a clump of cells.  
Remove extra layout coordinates (e.g. PCA or Harmony)  because the cbImportTools export all of the possible layouts and they export only the only the first two coordinates. The CB can only handle two coordinates and so these layouts often look like a clump of cells.  
Remove extra layout coordinates (e.g. PCA or Harmony) since cbImport tools only export the first two coordinates


The following two images are examples of PCA plots. For reference, the first image is from the "lung-airway" dataset and the second image is from the "hoc" dataset.
[[File:lungairway_pca.png]]
[[File:lungairway_pca.png]]
[[File:hoc_pca.png]]
[[File:hoc_pca.png]]
Line 34: Line 33:
'''Finding a paper associated with a bioRxiv pub'''
'''Finding a paper associated with a bioRxiv pub'''


https://redmine.soe.ucsc.edu/issues/27316#change-267287
Sometimes you will have to go back and edit the paper citation for a dataset.
Doi.org *remove \ from DOI
 
<pre>
In /hive/data/inside/cells/datasets run
 
find . -name desc.conf | xargs grep "biorxiv" |grep -v "Strange\|\#"
 
Should get results like:
./cbl-dev/desc.conf:biorxiv_url = "https://www.biorxiv.org/content/10.1101/2020.06.30.174391v1 Aldinger et al. 2020. bioRxiv."
 
Copy this bit of the URL: 10.1101/2020.06.30.174391
 
And feed it to the bioRxiv API: curl https://api.biorxiv.org/details/biorxiv/10.1101/2020.06.30.174391
 
In the response, you should see the word "published" and if it's published, it'll have a doi otherwise it'll just say NA.
</pre>
 
This is referenced from the Cells Redmine [https://redmine.soe.ucsc.edu/issues/27316 To Do #27316].
 
You could also paste the DOI into [https://www.doi.org/ doi.org].


'''Providing the Unit for datasets'''  
'''Providing the Unit for datasets'''  


<code>unit=""</code>
<code>unit=""</code>  
Ask the author
 
Search Seurat data slot normalized
The unit of the values in the expression matrix. You can ask the author if needed or search the Seurat data slot "normalized". Typical values: "read count/UMI", "log of read count/UMI", "TPM", "log of TPM", "CPM", "FPKM", "RPKM".
 


'''desc.conf'''
== '''desc.conf''' ==


Paper URL:
<code>title = "First word is capitalized and the rest is all lowercased"</code>


<code>paper_url = “url to paper  last name et al. Journal. Year.”</code>
<code>paper_url = “http://www.paper_url.com/xxx Last name et al. Journal. Year.”</code>
Journal short name from pubmed


For the description page => first word capitalized, rest is lowercase
<code>biorxiv_url = "https://www.biorxiv.org/content/123/123.full Last name et al. bioRxiv Year."</code>


Labels for various desc.conf settings => paper URL or website
The additional database links (GSE, Bioproject, SRA accessions, PMID, etc.) can be set to just the number, no author info.  


GSE, Bioproject, SRA accessions, PMID, just put the number, no author info:
<code>
pmid = "12343234"
geo_series = "GSE25097"
dbgap = "phs000424.v7.p2"
arrayexpress = "xxx"
sra_study = "xxxx"
</code>

Revision as of 20:46, 22 August 2022

Best Practices

Formatting configuration files

  • Typically you should keep a maximum of 80-120 characters per line; you can use gqgq in VIM in visual mode to auto format a paragraph into multiple ~80 character lines
  • For special characters, please refer to HTML character encoding: https://ascii.cl/htmlcodes.htm


cellbrowser.conf

Put things into cellbrowser-confs repo Commit cellbrowser/desc.conf files(http://genomewiki.ucsc.edu/genecats/index.php/Wrangling_process#Commit_cellbrowser.2Fdesc.conf_files)

git add dataset-name
git commit -m “message”
git push

Naming datasets

Dataset names should be all lowercase, using 4 words or less, and less than 20 characters and separated by hyphens. The names need to be lowercased because the Cell Browser (website) code converts all names lowercase. There are only a few exceptions for early datasets (e.g. adultPancreas).

Layout Coordinates

Capitalize "UMAP" and "tSNE".

Remove extra layout coordinates (e.g. PCA or Harmony) because the cbImportTools export all of the possible layouts and they export only the only the first two coordinates. The CB can only handle two coordinates and so these layouts often look like a clump of cells.

The following two images are examples of PCA plots. For reference, the first image is from the "lung-airway" dataset and the second image is from the "hoc" dataset. Lungairway pca.png Hoc pca.png

Finding a paper associated with a bioRxiv pub

Sometimes you will have to go back and edit the paper citation for a dataset.

In /hive/data/inside/cells/datasets run

find . -name desc.conf | xargs grep "biorxiv" |grep -v "Strange\|\#" 

Should get results like:
./cbl-dev/desc.conf:biorxiv_url = "https://www.biorxiv.org/content/10.1101/2020.06.30.174391v1 Aldinger et al. 2020. bioRxiv." 

Copy this bit of the URL: 10.1101/2020.06.30.174391

And feed it to the bioRxiv API: curl https://api.biorxiv.org/details/biorxiv/10.1101/2020.06.30.174391

In the response, you should see the word "published" and if it's published, it'll have a doi otherwise it'll just say NA. 

This is referenced from the Cells Redmine To Do #27316.

You could also paste the DOI into doi.org.

Providing the Unit for datasets

unit=""

The unit of the values in the expression matrix. You can ask the author if needed or search the Seurat data slot "normalized". Typical values: "read count/UMI", "log of read count/UMI", "TPM", "log of TPM", "CPM", "FPKM", "RPKM".


desc.conf

title = "First word is capitalized and the rest is all lowercased"

paper_url = “http://www.paper_url.com/xxx Last name et al. Journal. Year.”

biorxiv_url = "https://www.biorxiv.org/content/123/123.full Last name et al. bioRxiv Year."

The additional database links (GSE, Bioproject, SRA accessions, PMID, etc.) can be set to just the number, no author info.

pmid = "12343234" geo_series = "GSE25097" dbgap = "phs000424.v7.p2" arrayexpress = "xxx" sra_study = "xxxx"