Gene id conversion: Difference between revisions

Revision as of 10:41, 24 March 2010

With the UCSC Browser

There are three options for extracting the data. 1) merge/download data from the Table browser 2) query the public mySQL database 3) ftp text files

1) merge/download data from the Table browser http://genome.ucsc.edu/cgi-bin/hgTables a) set the clade, genome, assembly b) set the group to Genes and Gene Prediction Tracks, c) for the first query use UCSC Genes d) default table is knownGene. Click on "View table schema" to see field contents/order. e) set region: genomic for entire dataset or filter by region or identifiers. You can upload a list of Entrez Gene names at this point to limit the output, but it is not necessary, you can filter the file later. This primary table (knownGene) does not contain alternate gene names. To link those in: e) set output format: selected fields from primary table and related tables f) name output file so that it will download g) add in the linked table kgXref and check columns to download, then submit Starting again at step c, do the same for the Ensembl Genes track. Do the same steps until step g, where you will first need link in the table knownToEnsembl, then the table kgXref. Table Browser Help/FAQ: http://genome.ucsc.edu/cgi-bin/hgTables#Help http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html

2) query the public mySQL database Using the Table Browser to help you understand the database and table names/format, write your own SQL query to extract data. Public mySQL FAQ: http://genome.ucsc.edu/FAQ/FAQdownloads#download29

3) ftp text files Use ftp to get the complete tables in text file format and perform data merges to link the aligned transcripts in the primary tables to the gene names (such as Entrez). You would need to use your own shell, perl, or other tools to do the merges. Again, first use the Table Browser navigation tools to help you understand the database and table names/format. Download ftp FAQ: http://genome.ucsc.edu/FAQ/FAQdownloads#download1

All annotation tracks are mapped using the same coordinate system to the genomic assembly and so are directly comparable. Be aware that we use a zero-based start coordinate and a 1-based stop coordinate. We also record all alignments with respect to the positive strand, so if an alignment is on the negative strand, the start and stop will be reversed if compared to the file/table headers. These links describe in detail our file format conventions: http://genome.ucsc.edu/FAQ/FAQformat http://genome.ucsc.edu/FAQ/FAQtracks#tracks1 http://genome.ucsc.edu/FAQ/FAQtracks#tracks17

A final option is to send the data to Galaxy (from the Table Browser or uploaded text files). The functions for Interval format data are very useful and could aid in grouping the various mapped transcripts together into genes/clusters. It may be worth comparing which transcripts the Interval functions will group versus which the Entrez gene name will group. Link to Galaxy FAQ: http://g2.trac.bx.psu.edu/wiki/GopsDesc

With external tools

List of some external tools and comparison

David and Matchminer were the best ones when compared with 100 random identifiers

With Biomart

Biomart http://www.biomart.org is probably the best solution if your source ids are from Ensembl:

Click-Path:

1. martview (top-right of screen)
2. ensembl56 genes
3. (select your species)
4. "filters"
5. "gene"
6. paste your ids into "id list limit"
7. "attributes"
8. "GENE"
9. uncheck "ensembl transcript id"
10. uncheck "ensembl gene id" if you want to get rid of it
11. "EXTERNAL"
12. check "HGNC symbol" (or "HGNC automatic gene name" if not human)
13. "results"

Gene id conversion: Difference between revisions

Revision as of 10:41, 24 March 2010

With the UCSC Browser

With external tools

With Biomart

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

related sites

hosted projects

Tools

@@ Line 1: / Line 1: @@
-* [http://hum-molgen.org/NewsGen/08-2009/000020.html List of some external tools and comparison]
+== With the UCSC Browser ==
-** David and Matchminer were the best ones when compared with 100 random identifiers
+There are three options for extracting the data. 1) merge/download data
+from the Table browser 2) query the public mySQL database 3) ftp text files
-* With biomart http://www.biomart.org (probably the best if your source ids are from Ensembl), click on:
+) merge/download data from the Table browser
+http://genome.ucsc.edu/cgi-bin/hgTables
+a) set the clade, genome, assembly
+b) set the group to Genes and Gene Prediction Tracks,
+c) for the first query use UCSC Genes
+d) default table is knownGene. Click on "View table schema" to see field
+contents/order.
+e) set region: genomic for entire dataset or filter by region or
+identifiers. You can upload a list of Entrez Gene names at this point to
+limit the output, but it is not necessary, you can filter the file later.
+This primary table (knownGene) does not contain alternate gene names. To
+link those in:
+e) set output format: selected fields from primary table and related tables
+f) name output file so that it will download
+g) add in the linked table kgXref and check columns to download, then submit
+Starting again at step c, do the same for the Ensembl Genes track. Do
+the same steps until step g, where you will first need link in the table
+knownToEnsembl, then the table kgXref.
+Table Browser Help/FAQ:
+http://genome.ucsc.edu/cgi-bin/hgTables#Help
+http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html
+) query the public mySQL database
+Using the Table Browser to help you understand the database and table
+names/format, write your own SQL query to extract data.
+Public mySQL FAQ:
+http://genome.ucsc.edu/FAQ/FAQdownloads#download29
+) ftp text files
+Use ftp to get the complete tables in text file format and perform data
+merges to link the aligned transcripts in the primary tables to the gene
+names (such as Entrez). You would need to use your own shell, perl, or
+other tools to do the merges. Again, first use the Table Browser
+navigation tools to help you understand the database and table names/format.
+Download ftp FAQ:
+http://genome.ucsc.edu/FAQ/FAQdownloads#download1
+All annotation tracks are mapped using the same coordinate system to the
+genomic assembly and so are directly comparable. Be aware that we use a
+zero-based start coordinate and a 1-based stop coordinate. We also
+record all alignments with respect to the positive strand, so if an
+alignment is on the negative strand, the start and stop will be reversed
+if compared to the file/table headers. These links describe in detail
+our file format conventions:
+http://genome.ucsc.edu/FAQ/FAQformat
+http://genome.ucsc.edu/FAQ/FAQtracks#tracks1
+http://genome.ucsc.edu/FAQ/FAQtracks#tracks17
+A final option is to send the data to Galaxy (from the Table Browser or
+uploaded text files). The functions for Interval format data are very
+useful and could aid in grouping the various mapped transcripts together
+into genes/clusters. It may be worth comparing which transcripts the
+Interval functions will group versus which the Entrez gene name will group.
+Link to Galaxy FAQ:
+http://g2.trac.bx.psu.edu/wiki/GopsDesc
+== With external tools ==
+[http://hum-molgen.org/NewsGen/08-2009/000020.html List of some external tools and comparison]
+* David and Matchminer were the best ones when compared with 100 random identifiers
+== With Biomart ==
+Biomart http://www.biomart.org is probably the best solution if your source ids are from Ensembl:
+Click-Path:
 ## martview (top-right of screen)
 ## ensembl56 genes
@@ Line 16: / Line 80: @@
 ## check "HGNC symbol" (or "HGNC automatic gene name" if not human)
 ## "results"
-* With UCSC tools: [https://lists.soe.ucsc.edu/pipermail/genome/2009-March/018423.html|this thread] (best option if your source IDs are UCSC knownGenes or if you prefer the table browser to biomart)