Programmatic access to the Genome Browser: Difference between revisions

From genomewiki
Jump to navigationJump to search
(Created page with "* Get the sequence of a genome at a particular place ** Download the tool twoBitToFa from http://hgdownload.cse.ucsc.edu/admin/exe/ ** twoBitToFa http://hgdownload.cse.ucsc.edu/g...")
 
(changed cse to soe and genome-source fixes)
 
(24 intermediate revisions by one other user not shown)
Line 1: Line 1:
* Get the sequence of a genome at a particular place
The UCSC Genome Browser allows data retrieval via the Mysql command line tool and for some types of data via a HTTP Restful API. The API does not use JSON to save computational time on the server. Download of some data formats requires client-side C tools that convert to/from binary files. Data upload uses custom text files.
** Download the tool twoBitToFa from http://hgdownload.cse.ucsc.edu/admin/exe/
** twoBitToFa http://hgdownload.cse.ucsc.edu/gbdb/hg19/hg19.2bit test.fa -seq=chr21 -start=1 -end=10000
** for best performance, download the 2bit file for your genome from http://hgdownload.cse.ucsc.edu/gbdb/<databaseId> to local disk.
* Get the "wiggle" (x-y-plot) graph data
** Download bigWigToWig from http://hgdownload.cse.ucsc.edu/admin/exe/
** bigWigToWig http://hgdownload.cse.ucsc.edu/gbdb/hg19/bbi/wgEncodeBroadHistoneK562Cbx2Sig.bigWig -chrom=chr21 -start=0 -end=10000000 stdout


* Download data from a database table
Here are some common tasks that can be done from scripts with the UCSC Genome Browser. It is assumed that the reader knows the standard Unix command line tools.


** use Tools - Table Browser -  "Describe schema" to browse the database schema. All fields have a human readable description and the links to other tables are shown.
== Download data stored in a database table ==
** mysql --no-defaults -h genome-mysql.cse.ucsc.edu -u genome -A -e 'select * from pubsBingBlat' -NB > out.txt
* use Tools - Table Browser -  "Describe schema" to browse the database schema. All fields have a human readable description and the links to other tables are shown.
* to access the public Mysql server, use a commen like <code>mysql hg19 --no-defaults -h genome-mysql.soe.ucsc.edu -u genome -A -e 'select * from pubsBingBlat' -NB > out.txt</code>
* the list of data tracks is part of the table trackDb. The first column is the internal name of the track.
* note the "type" field in the table trackDb. Our documentation of file formats at http://genome.ucsc.edu/FAQ/FAQformat.html explains the meaning of the columns in these tables.
* tracks with types that start with "big" are stored in binary files (see below) and require special client programs to extract, all others are stored at least to some extent in Mysql tables.
* the first column in many tables with genomic coordinates is called "bin" and can be stripped for most applications


* Get a copy of the current Genome Browser image from a script
<pre>
** use "curl http://genome.ucsc.edu/cgi-bin/hgRenderTracks > test.png". hgRenderTracks understands the same parameters and options as the main hgTracks CGI, e.g. <trackName>=pack.
  mysql --no-defaults -h genome-mysql.soe.ucsc.edu -u genome -A -e "select ta
**  to show only a single track with hgRenderTracks, make sure that the first track parameter is hideTracks=1
bleName, type, priority from trackDb where tableName in ('gold', 'refGene','knownGene', 'ccds', 'clinvar') limit 5" hg19
** for example, to download the image for a chromosomal location with only the RefSeq transcripts and publications track to "pack" mode, use this command:
+-----------+-------------------------------------+----------+
  curl 'http://genome.ucsc.edu/cgi-bin/hgRenderTracks?position=chr17:41570860-41650551&hideTracks=1&refGene=pack&pubs=pack' > temp.png
| tableName | type                                | priority |
+-----------+-------------------------------------+----------+
| clinvar  | bigBed 12 .                        |      100 |
| gold      | bed 3 +                            |      100 |
| knownGene | genePred knownGenePep knownGeneMrna |        1 |
| refGene   | genePred refPep refMrna            |        2 |
+-----------+-------------------------------------+----------+


* Upload a custom track and link to the genome browser with the track loaded
mysql --no-defaults -h genome-mysql.soe.ucsc.edu -u genome -A -e "select * from knownGene limit 3"  hg19
+------------+-------+--------+---------+-------+----------+--------+-----------+--------------------+--------------------+-----------+------------+
| name      | chrom | strand | txStart | txEnd | cdsStart | cdsEnd | exonCount | exonStarts        | exonEnds          | proteinID | alignID    |
+------------+-------+--------+---------+-------+----------+--------+-----------+--------------------+--------------------+-----------+------------+
| uc001aaa.3 | chr1  | +      |  11873 | 14409 |    11873 |  11873 |        3 | 11873,12612,13220, | 12227,12721,14409, |          | uc001aaa.3 |
| uc010nxr.1 | chr1  | +      |  11873 | 14409 |    11873 |  11873 |        3 | 11873,12645,13220, | 12227,12697,14409, |          | uc010nxr.1 |
| uc010nxq.1 | chr1  | +      |  11873 | 14409 |    12189 |  13639 |        3 | 11873,12594,13402, | 12227,12721,14409, | B7ZGX9    | uc010nxq.1 |
+------------+-------+--------+---------+-------+----------+--------+-----------+--------------------+--------------------+-----------+------------+
</pre>


** create a file temp.bed with contents like these:
== Get the chromosome sequence for a range ==
  track name="TestTrack" description="TestTrack with links on features" url="http://www.google.com/$$"
* Download the tool twoBitToFa from http://hgdownload.soe.ucsc.edu/admin/exe/ e.g. with <code>curl http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/twoBitToFa > twoBitToFa; chmod a+x twoBitToFa</code>. For OSX, please adapt the download location to http://hgdownload.soe.ucsc.edu/admin/exe/macOSX.x86_64/
  chr1 1 1000 testIdForUrl
* To get the DNA sequence from e.g. the human genome hg19, run a command like <code>twoBitToFa http://hgdownload.soe.ucsc.edu/gbdb/hg19/hg19.2bit stdout -seq=chr21 -start=1 -end=10000</code>. You can replace stdout with a filename of your choice.
** upload your file with a command like this, it will print a string to stdout which we are calling $HGSID in the following
* for best performance, download the 2bit file for your genome from http://hgdownload.soe.ucsc.edu/gbdb/ to local disk
  curl -s -F db=hg19 -F 'hgct_customText=chr1 1 1000' http://genome.ucsc.edu/cgi-bin/hgCustom  | grep -o 'hgsid=[0-9]*_[a-zA-Z0-9]*' | uniq | sed -e 's/hgsid=//'
<pre>
** you can link to this track with http://genome.ucsc.edu/cgi-bin/hgTracks?hgsid=$HGSID&position=chr1:1-1000
twoBitToFa http://hgdownload.soe.ucsc.edu/gbdb/hg19/hg19.2bit stdout -seq=chr21 -start=15000000 -end=15000050
>chr21:15000000-15000050 
agccctgaacaaagacagggcttggcttatataggcaaacttacagaagc
</pre>
 
== Get the "wiggle" (x-y-plot) graph data for a chromosome range ==
* Download bigWigToWig from http://hgdownload.soe.ucsc.edu/admin/exe/ as shown above
* run a command like <code>bigWigToWig http://hgdownload.soe.ucsc.edu/gbdb/hg19/bbi/wgEncodeBroadHistoneK562Cbx2Sig.bigWig -chrom=chr21 -start=0 -end=10000000 stdout</code>. You can also replace stdout with a filename of your choice.
<pre>
bigWigToWig http://hgdownload.soe.ucsc.edu/gbdb/hg19/bbi/wgEncodeBroadHistoneK562Cbx2Sig.bigWig -chrom=chr21 -start=15000000 -end=15000200 stdout
variableStep chrom=chr21 span=25
15000026        0.92
15000051        1
15000076        1
15000101        1
15000126        1
15000151        1.24
15000176        2
</pre>
 
== Get a copy of the current Genome Browser image from a script ==
*  use <code>curl http://genome.ucsc.edu/cgi-bin/hgRenderTracks > test.png</code>. hgRenderTracks understands the same parameters and options as the main hgTracks CGI, e.g. <internalTrackName>=pack
* to get the internal track name of a track, mouse over the track and look at your internet browser status line or go to the track configuration page and look for the value of the variable called "g" in the current URL. You can also use the trackDb table to get a list of all tracks and their names (see above).
* to hide the default track when you use hgRenderTracks, make sure that the first track parameter is hideTracks=1
* for example, to download the image for a chromosomal location with only the RefSeq transcripts and publications track to "pack" mode, use this command:  <code>curl 'http://genome.ucsc.edu/cgi-bin/hgRenderTracks?position=chr17:41570860-41650551&hideTracks=1&refGene=pack&pubs=pack' > temp.png</code>
 
== Upload a custom track into the browser ==
* create a custom track file as documented here http://genome.ucsc.edu/goldenpath/help/customTrack.html, e.g.: <code>printf 'track name="TestTrack" description="TestTrack with links on features" url="http://www.google.com/$$"\nchr1 1 1000 testIdForUrl' > temp.bed</code>
* create link like this http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&position=chr21:33038447-33041505&hgct_customText=<UrlToTemp.bed>
 
== Upload a custom track and create multiple links to the genome browser or iteratively upload tracks as they become available  ==
* upload your file with a command like this, it will print a string to stdout which we are calling $HGSID in the following <code>curl -s -F db=hg19 -F 'hgct_customText=@temp.bed' http://genome.ucsc.edu/cgi-bin/hgCustom  | grep -o 'hgsid=[0-9]*_[a-zA-Z0-9]*' | uniq | sed -e 's/hgsid=//'</code>
* you can link to a fresh genome browser session with only this track loaded with http://genome.ucsc.edu/cgi-bin/hgTracks?hgsid=$HGSID&position=chr1:1-1000
* you can load more tracks into this session by adding the parameter hgsid=$HGSID to all future curl calls with the parameter "-F hgsid=$HGSID"
* you can download the image of this session with hgRenderTrack as shown above, by supplying the $HGSID value to hgRenderTracks, like this <code>curl "http://genome.ucsc.edu/cgi-bin/hgRenderTracks?hgsid=$HGSID" > test.png</code>

Latest revision as of 07:25, 1 September 2018

The UCSC Genome Browser allows data retrieval via the Mysql command line tool and for some types of data via a HTTP Restful API. The API does not use JSON to save computational time on the server. Download of some data formats requires client-side C tools that convert to/from binary files. Data upload uses custom text files.

Here are some common tasks that can be done from scripts with the UCSC Genome Browser. It is assumed that the reader knows the standard Unix command line tools.

Download data stored in a database table

  • use Tools - Table Browser - "Describe schema" to browse the database schema. All fields have a human readable description and the links to other tables are shown.
  • to access the public Mysql server, use a commen like mysql hg19 --no-defaults -h genome-mysql.soe.ucsc.edu -u genome -A -e 'select * from pubsBingBlat' -NB > out.txt
  • the list of data tracks is part of the table trackDb. The first column is the internal name of the track.
  • note the "type" field in the table trackDb. Our documentation of file formats at http://genome.ucsc.edu/FAQ/FAQformat.html explains the meaning of the columns in these tables.
  • tracks with types that start with "big" are stored in binary files (see below) and require special client programs to extract, all others are stored at least to some extent in Mysql tables.
  • the first column in many tables with genomic coordinates is called "bin" and can be stripped for most applications
 mysql --no-defaults -h genome-mysql.soe.ucsc.edu -u genome -A -e "select ta
bleName, type, priority from trackDb where tableName in ('gold', 'refGene','knownGene', 'ccds', 'clinvar') limit 5"  hg19 
+-----------+-------------------------------------+----------+
| tableName | type                                | priority |
+-----------+-------------------------------------+----------+
| clinvar   | bigBed 12 .                         |      100 |
| gold      | bed 3 +                             |      100 |
| knownGene | genePred knownGenePep knownGeneMrna |        1 |
| refGene   | genePred refPep refMrna             |        2 |
+-----------+-------------------------------------+----------+

mysql --no-defaults -h genome-mysql.soe.ucsc.edu -u genome -A -e "select * from knownGene limit 3"  hg19 
+------------+-------+--------+---------+-------+----------+--------+-----------+--------------------+--------------------+-----------+------------+
| name       | chrom | strand | txStart | txEnd | cdsStart | cdsEnd | exonCount | exonStarts         | exonEnds           | proteinID | alignID    |
+------------+-------+--------+---------+-------+----------+--------+-----------+--------------------+--------------------+-----------+------------+
| uc001aaa.3 | chr1  | +      |   11873 | 14409 |    11873 |  11873 |         3 | 11873,12612,13220, | 12227,12721,14409, |           | uc001aaa.3 |
| uc010nxr.1 | chr1  | +      |   11873 | 14409 |    11873 |  11873 |         3 | 11873,12645,13220, | 12227,12697,14409, |           | uc010nxr.1 |
| uc010nxq.1 | chr1  | +      |   11873 | 14409 |    12189 |  13639 |         3 | 11873,12594,13402, | 12227,12721,14409, | B7ZGX9    | uc010nxq.1 |
+------------+-------+--------+---------+-------+----------+--------+-----------+--------------------+--------------------+-----------+------------+

Get the chromosome sequence for a range

twoBitToFa http://hgdownload.soe.ucsc.edu/gbdb/hg19/hg19.2bit stdout -seq=chr21 -start=15000000 -end=15000050
>chr21:15000000-15000050  
agccctgaacaaagacagggcttggcttatataggcaaacttacagaagc

Get the "wiggle" (x-y-plot) graph data for a chromosome range

bigWigToWig http://hgdownload.soe.ucsc.edu/gbdb/hg19/bbi/wgEncodeBroadHistoneK562Cbx2Sig.bigWig -chrom=chr21 -start=15000000 -end=15000200 stdout
variableStep chrom=chr21 span=25
15000026        0.92
15000051        1
15000076        1
15000101        1
15000126        1
15000151        1.24
15000176        2

Get a copy of the current Genome Browser image from a script

  • use curl http://genome.ucsc.edu/cgi-bin/hgRenderTracks > test.png. hgRenderTracks understands the same parameters and options as the main hgTracks CGI, e.g. <internalTrackName>=pack
  • to get the internal track name of a track, mouse over the track and look at your internet browser status line or go to the track configuration page and look for the value of the variable called "g" in the current URL. You can also use the trackDb table to get a list of all tracks and their names (see above).
  • to hide the default track when you use hgRenderTracks, make sure that the first track parameter is hideTracks=1
  • for example, to download the image for a chromosomal location with only the RefSeq transcripts and publications track to "pack" mode, use this command: curl 'http://genome.ucsc.edu/cgi-bin/hgRenderTracks?position=chr17:41570860-41650551&hideTracks=1&refGene=pack&pubs=pack' > temp.png

Upload a custom track into the browser

Upload a custom track and create multiple links to the genome browser or iteratively upload tracks as they become available