Genes in gtf or gff format: Difference between revisions
(note the .hg.conf is for public database usage) |
m (typo fix) |
||
Line 24: | Line 24: | ||
Note the usage message from the command: | Note the usage message from the command: | ||
genePredToGtf - Convert genePred table or file to gtf. | |||
usage: | usage: | ||
genePredToGtf database genePredTable output.gtf | genePredToGtf database genePredTable output.gtf |
Revision as of 16:22, 16 August 2011
UCSC does not keep gene structures in GTF format, we use a single line format for a single gene with all the information about that gene in the single line: GenePred format.
Extracting GTF format files from the genePred format can be performed with the genePredToGtf: kent command utility.
At this time, this genePredToGtf command can provide better GTF files than available from the table browser.
To use the kent commands with the public database server, add this three line file ".hg.conf" to your home directory:
$ cat $HOME/.hg.conf db.host=genome-mysql.cse.ucsc.edu db.user=genomep db.password=password
And set the permissions:
$ chmod 600 .hg.conf
Now you can use the command to extract GTF files directly from the UCSC database. For example, fetch the UCSC gene track from hg19 into the local file knownGene.gtf:
$ genePredToGtf hg19 knownGene knownGene.gtf
Note the usage message from the command:
genePredToGtf - Convert genePred table or file to gtf. usage: genePredToGtf database genePredTable output.gtf If database is 'file' then track is interpreted as a file rather than a table in database. options: -utr - Add 5UTR and 3UTR features -honorCdsStat - use cdsStartStat/cdsEndStat when defining start/end codon records -source=src set source name to uses -addComments - Add comments before each set of transcript records. allows for easier visual inspection Note: use refFlat or extended genePred table to include geneName
You can also fetch the database text dump of the genePred content for the track to have the file on-hand locally:
ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz The SQL structure: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.sql
As noted in the usage message, the file can be used with the command in place of the database table specification. In this case, beware of files that are only partially genePred format. For example, the knownGene.txt.gz file has extra columns after the exonEnds column. Therefore, use cut to extract just the columns for genePred:
$ zcat knownGene.txt.gz | cut -f1-10 | genePredToGtf file stdin knownGene.gtf
This is not necessary in the case of using the database table since the command can determine from the table structure which columns to use.