Genes in gtf or gff format: Difference between revisions
Cath Tyner (talk | contribs) |
Cath Tyner (talk | contribs) |
||
Line 3: | Line 3: | ||
[http://genome.ucsc.edu/FAQ/FAQformat.html#format9 GenePred] is a table format commonly used for gene prediction tracks in the UCSC Genome Browser. The genePredToGtf command-line utility can be used to convert genePred to GTF. | [http://genome.ucsc.edu/FAQ/FAQformat.html#format9 GenePred] is a table format commonly used for gene prediction tracks in the UCSC Genome Browser. The genePredToGtf command-line utility can be used to convert genePred to GTF. | ||
While the [http://genome.ucsc.edu/cgi-bin/hgTables Table | While the [http://genome.ucsc.edu/cgi-bin/hgTables Table Browser] does contain an option to output query results in GTF, the output is limited, and in some cases, may contain bugs. The best method to convert genePred to GTF is the genePredToGtf operating-specific command-line utility. This utility can be downloaded from the [http://hgdownload.soe.ucsc.edu/admin/exe/ utilities directory]. | ||
==Use genePredToGtf with a downloaded genePred table== | ==Use genePredToGtf with a downloaded genePred table== |
Revision as of 19:33, 17 March 2017
Convert genePred to GTF with the genePredToGtf command line utility
GenePred is a table format commonly used for gene prediction tracks in the UCSC Genome Browser. The genePredToGtf command-line utility can be used to convert genePred to GTF.
While the Table Browser does contain an option to output query results in GTF, the output is limited, and in some cases, may contain bugs. The best method to convert genePred to GTF is the genePredToGtf operating-specific command-line utility. This utility can be downloaded from the utilities directory.
Use genePredToGtf with a downloaded genePred table
You can directly download a table (for example, the knownGene table), which will be in genePred format. You can then use this local file as input for the genePredToGtf conversion.
ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz
The SQL structure:
ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.sql
As noted in the usage message, the file can be used with the command in place of the database table specification. In this case, beware of files that are only partially genePred format. For example, the knownGene.txt.gz file has extra columns after the exonEnds column. Therefore, use cut to extract just the columns for genePred:
$ zcat knownGene.txt.gz | cut -f1-10 | genePredToGtf file stdin knownGene.gtf
This is not necessary in the case of using the database table since the command can determine from the table structure which columns to use.
Example: Here are detailed steps for converting hg19's refGene table (in genePred format) to GTF.
1.Download your gene set of interest for hg19. For this example, I'll use the refGene table, but you can choose other gene sets, such as the knownGene table from the "UCSC Genes" track.
rsync -a -P rsync://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz ./
2. Unzip
gzip -d refGene.txt.gz
3. Remove the first "bin" column:
cut -f 2- refGene.txt > refGene.input
4. Convert to gtf:
genePredToGtf file refGene.input hg19refGene.gtf
5. Sort output by chromosome and coordinate
cat hg19refGene.gtf | sort -k1,1 -k4,4 > hg19refGene.gtf.sorted
Example output for hg19refGene.gtf.sorted:
$head hg19refGene.gtf.sorted chr1 refGene.input exon 10002682 10002840 . - . gene_id "LZIC"; transcript_id "NM_001316973"; exon_number "7"; exon_id "NM_001316973.7"; gene_name "LZIC"; chr1 refGene.input exon 10002682 10002840 . - . gene_id "LZIC"; transcript_id "NM_001316975"; exon_number "7"; exon_id "NM_001316975.7"; gene_name "LZIC"; chr1 refGene.input exon 10002682 10002840 . - . gene_id "LZIC"; transcript_id "NM_001316976"; exon_number "5"; exon_id "NM_001316976.5"; gene_name "LZIC"; chr1 refGene.input exon 10002682 10002840 . - . gene_id "LZIC"; transcript_id "NM_032368"; exon_number "7"; exon_id "NM_032368.7"; gene_name "LZIC"; chr1 refGene.input CDS 10002739 10002793 . - 0 gene_id "LZIC"; transcript_id "NM_001316974"; exon_number "7"; exon_id "NM_001316974.7"; gene_name "LZIC"; chr1 refGene.input exon 10002739 10002840 . - . gene_id "LZIC"; transcript_id "NM_001316974"; exon_number "7"; exon_id "NM_001316974.7"; gene_name "LZIC"; chr1 refGene.input start_codon 10002791 10002793 . - 0 gene_id "LZIC"; transcript_id "NM_001316974"; exon_number "7"; exon_id "NM_001316974.7"; gene_name "LZIC"; chr1 refGene.input exon 10002981 10003083 . + . gene_id "NMNAT1"; transcript_id "NM_001297778"; exon_number "1"; exon_id "NM_001297778.1"; gene_name "NMNAT1"; chr1 refGene.input transcript 10002981 10045556 . + . gene_id "NMNAT1"; transcript_id "NM_001297778"; gene_name "NMNAT1"; chr1 refGene.input exon 10003307 10003485 . - . gene_id "LZIC"; transcript_id "NM_032368"; exon_number "8"; exon_id "NM_032368.8"; gene_name "LZIC";
Using kent commands with the public database server
To use the kent commands with the public database server, add this four line file ".hg.conf" to your home directory:
$ cat $HOME/.hg.conf db.host=genome-mysql.cse.ucsc.edu db.user=genomep db.password=password central.db=hgcentral
And set the permissions:
$ chmod 600 .hg.conf
Now you can use the command to extract GTF files directly from the UCSC database. For example, fetch the UCSC gene track from hg19 into the local file knownGene.gtf:
$ genePredToGtf hg19 knownGene knownGene.gtf
Note the usage message from the command:
genePredToGtf - Convert genePred table or file to gtf. usage: genePredToGtf database genePredTable output.gtf If database is 'file' then track is interpreted as a file rather than a table in database. options: -utr - Add 5UTR and 3UTR features -honorCdsStat - use cdsStartStat/cdsEndStat when defining start/end codon records -source=src set source name to uses -addComments - Add comments before each set of transcript records. allows for easier visual inspection Note: use refFlat or extended genePred table to include geneName
Bed format gene tracks (convert bed > genePred > GTF)
Some gene tracks are in a bed format in the database, perhaps with extra columns past the standard bed format. In this case, extract the standard bed columns, convert it to a genePred and then to a gtf. For example wgRna:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N \ -e "select chrom,chromStart,chromEnd,name,score,strand,thickStart,thickEnd from wgRna;" hg19 \ | bedToGenePred stdin stdout | genePredToGtf file wgRna.gtf