Genes in gtf or gff format: Difference between revisions
Cath Tyner (talk | contribs) |
Cath Tyner (talk | contribs) |
||
Line 5: | Line 5: | ||
* Some tables in older genome assemblies are not supported. | * Some tables in older genome assemblies are not supported. | ||
* Tables not in genePred format (e.g., knownCanonical) will produce unexpected GTF output, in addition to the other "known-limitations for Table Browser GTF output" listed here. | * Tables not in genePred format (e.g., knownCanonical) will produce unexpected GTF output, in addition to the other "known-limitations for Table Browser GTF output" listed here. | ||
* [https://genome.ucsc.edu/FAQ/FAQtracks#tracks18 Issue with stop codons in GTF output from Table Browser] | |||
==Warning - using a non genePred table to get GTF output in the Table Browser== | ==Warning - using a non genePred table to get GTF output in the Table Browser== |
Revision as of 22:58, 31 October 2017
Summary of limitations for Table Browser GTF output
- The Table Browser has transcript IDs only, so although it includes both "gene_id" and "transcript_id" fields in its output, the value for transcript ID (e.g., ENST#) is used for both fields.
- The Table Browser adds start and stop codon annotations whether or not the transcript alignment includes proper start and stop codons.
- Some tables in older genome assemblies are not supported.
- Tables not in genePred format (e.g., knownCanonical) will produce unexpected GTF output, in addition to the other "known-limitations for Table Browser GTF output" listed here.
- Issue with stop codons in GTF output from Table Browser
Warning - using a non genePred table to get GTF output in the Table Browser
A genePred table (such as knownGene) is needed to get GTF output in the Table Browser. Below is an example of output for the knownCanonical table, which is NOT in genePred format. Even though the TB GTF says "exons" for knownCanonical GTF output, it's really just a placeholder, not exons at all, but rather start-stop regions of the transcripts.
For example, if you do a cart reset (top menu > Genome Browser > Reset All User Settings) and go to the default region (chr1:11102837-11267747) in hg38, then go to the Table Browser, and then get all fields for knownCanonical (limit to default region, not genome), you'll get this output:
#chrom chromStart chromEnd clusterId transcript protein chr1 11106534 11262507 17297 uc001asd.4 ENSG00000198793.12 chr1 11143897 11149537 24285 uc031plf.2 ENSG00000225602.5 chr1 11152349 11152452 33500 uc057cga.1 ENSG00000253086.1 chr1 11189340 11195981 13013 uc001ase.5 ENSG00000171819.4 chr1 11226253 11226360 20530 uc057cgc.1 ENSG00000207451.1
GTF output for that same region will be:
chr1 hg38_knownCanonical exon 11106535 11262507 0.000000 . . gene_id "gene1"; transcript_id "tx1"; chr1 hg38_knownCanonical exon 11143898 11149537 0.000000 . . gene_id "gene2"; transcript_id "tx2"; chr1 hg38_knownCanonical exon 11152350 11152452 0.000000 . . gene_id "gene3"; transcript_id "tx3"; chr1 hg38_knownCanonical exon 11189341 11195981 0.000000 . . gene_id "gene4"; transcript_id "tx4"; chr1 hg38_knownCanonical exon 11226254 11226360 0.000000 . . gene_id "gene5"; transcript_id "tx5";
Example: Comparing Table Browser GTF output with genePredToGtf utility output
Table Browser output for ENST00000376819.3.
- Table Browser configuration: hg38, Genes and Gene Predictions, All GENCODE V26, Basic (wgEncodeGencodeBasicV26)
- Identifier pasted in: ENST00000376819.3
chr1 hg38_wgEncodeGencodeBasicV26 start_codon 11189580 11189582 0.000000 + . gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3"; chr1 hg38_wgEncodeGencodeBasicV26 CDS 11189580 11189955 0.000000 + 0 gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3"; chr1 hg38_wgEncodeGencodeBasicV26 exon 11189341 11189955 0.000000 + . gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3"; chr1 hg38_wgEncodeGencodeBasicV26 CDS 11192270 11192370 0.000000 + 2 gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3"; chr1 hg38_wgEncodeGencodeBasicV26 exon 11192270 11192370 0.000000 + . gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3"; chr1 hg38_wgEncodeGencodeBasicV26 CDS 11193580 11193774 0.000000 + 0 gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3"; chr1 hg38_wgEncodeGencodeBasicV26 exon 11193580 11193774 0.000000 + . gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3"; chr1 hg38_wgEncodeGencodeBasicV26 CDS 11194461 11194659 0.000000 + 0 gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3"; chr1 hg38_wgEncodeGencodeBasicV26 exon 11194461 11194659 0.000000 + . gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3"; chr1 hg38_wgEncodeGencodeBasicV26 CDS 11194854 11195020 0.000000 + 2 gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3"; chr1 hg38_wgEncodeGencodeBasicV26 stop_codon 11195021 11195023 0.000000 + . gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3"; chr1 hg38_wgEncodeGencodeBasicV26 exon 11194854 11195981 0.000000 + . gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3";
genePredToGtf utility output
$ genePredToGtf hg38 wgEncodeGencodeBasicV26 utilityOutputBasic26.gtf $ cat utilityOutputBasic26.gtf| grep -w ENST00000376819.3
chr1 wgEncodeGencodeBasicV26 transcript 11189341 11195981 . + . gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; gene_name "ANGPTL7"; chr1 wgEncodeGencodeBasicV26 exon 11189341 11189955 . + . gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "1"; exon_id "ENST00000376819.3.1"; gene_name "ANGPTL7"; chr1 wgEncodeGencodeBasicV26 CDS 11189580 11189955 . + 0 gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "1"; exon_id "ENST00000376819.3.1"; gene_name "ANGPTL7"; chr1 wgEncodeGencodeBasicV26 exon 11192270 11192370 . + . gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "2"; exon_id "ENST00000376819.3.2"; gene_name "ANGPTL7"; chr1 wgEncodeGencodeBasicV26 CDS 11192270 11192370 . + 2 gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "2"; exon_id "ENST00000376819.3.2"; gene_name "ANGPTL7"; chr1 wgEncodeGencodeBasicV26 exon 11193580 11193774 . + . gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "3"; exon_id "ENST00000376819.3.3"; gene_name "ANGPTL7"; chr1 wgEncodeGencodeBasicV26 CDS 11193580 11193774 . + 0 gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "3"; exon_id "ENST00000376819.3.3"; gene_name "ANGPTL7"; chr1 wgEncodeGencodeBasicV26 exon 11194461 11194659 . + . gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "4"; exon_id "ENST00000376819.3.4"; gene_name "ANGPTL7"; chr1 wgEncodeGencodeBasicV26 CDS 11194461 11194659 . + 0 gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "4"; exon_id "ENST00000376819.3.4"; gene_name "ANGPTL7"; chr1 wgEncodeGencodeBasicV26 exon 11194854 11195981 . + . gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "5"; exon_id "ENST00000376819.3.5"; gene_name "ANGPTL7"; chr1 wgEncodeGencodeBasicV26 CDS 11194854 11195020 . + 2 gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "5"; exon_id "ENST00000376819.3.5"; gene_name "ANGPTL7"; chr1 wgEncodeGencodeBasicV26 start_codon 11189580 11189582 . + 0 gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "1"; exon_id "ENST00000376819.3.1"; gene_name "ANGPTL7"; chr1 wgEncodeGencodeBasicV26 stop_codon 11195021 11195023 . + 0 gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "5"; exon_id "ENST00000376819.3.5"; gene_name "ANGPTL7";
Convert genePred to GTF with the genePredToGtf command line utility
GenePred is a table format commonly used for gene prediction tracks in the UCSC Genome Browser. The genePredToGtf command-line utility can be used to convert genePred to GTF.
While the Table Browser does contain an option to output query results in GTF, the output is limited, and in some cases, may contain bugs. The best method to convert genePred to GTF is the genePredToGtf command-line utility. The operating-specific utility can be downloaded from the utilities directory.
Once downloaded (and permissions changed to executable), you can run the utility without arguments to see the usage statement:
$ genePredToGtf genePredToGtf - Convert genePred table or file to gtf. usage: genePredToGtf database genePredTable output.gtf If database is 'file' then track is interpreted as a file rather than a table in database. options: -utr - Add 5UTR and 3UTR features -honorCdsStat - use cdsStartStat/cdsEndStat when defining start/end codon records -source=src set source name to use -addComments - Add comments before each set of transcript records. allows for easier visual inspection Note: use a refFlat table or extended genePred table or file to include the gene_name attribute in the output. This will not work with a refFlat table dump file. If you are using a genePred file that starts with a numeric bin column, drop it using the UNIX cut command: cut -f 2- in.gp | genePredToGtf file stdin out.gp
Using Table Browser output as input for the genePredToGtf
You can use Table Browser output as input for the genePredToGtf utility, but you will need to check that the Table Browser output is indeed in the correct GenPred format. In some cases, you may have trailing columns that need to be removed.
For example,
- From the UCSC Genome Browser, click on "Genome Browser" at the top menu bar, then select "Reset All User Settings" to refresh to the default hg38 assembly and its default position.
- Go to the Table Browser, and keeping all options as default, change only 1 setting: region should be set to "position" instead of genome.
- Accept the default drop-down option for "output format" as "all fields from selected table" and
- Type in a name for "output file" to download your file (e.g., "knownGeneABO.txt").
- Click "get output."
Note that you will have 12 columns; and you will need to remove the last two columns to get genePred format:
cat knownGeneABO.txt | cut -f1-10 > knownGeneABO.genePred
Now convert to GTF, using the "file" argument for genePredToGTF:
genePredToGtf file knownGeneABO.genePred knownGeneABO.gtf
Use genePredToGtf with a downloaded genePred table
You can directly download a table (for example, the knownGene table), which will be in genePred format. You can then use this local file as input for the genePredToGtf conversion.
ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz
The SQL structure:
ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.sql
As noted in the usage message, the file can be used with the command in place of the database table specification. In this case, beware of files that are only partially genePred format. For example, the knownGene.txt.gz file has extra columns after the exonEnds column. Therefore, use cut to extract just the columns for genePred:
$ zcat knownGene.txt.gz | cut -f1-10 | genePredToGtf file stdin knownGene.gtf
Example with downloaded refGene.txt.gz
Here are detailed steps for converting a local hg19 refGene table (in genePred format) to GTF.
1. Download your gene set of interest for hg19. For this example, I'll use the refGene table, but you can choose other gene sets, such as the knownGene table from the "UCSC Genes" track.
rsync -a -P rsync://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz ./
2. Unzip
gzip -d refGene.txt.gz
3. Remove the first "bin" column:
cut -f 2- refGene.txt > refGene.input
4. Convert to gtf:
genePredToGtf file refGene.input hg19refGene.gtf
5. Sort output by chromosome and coordinate
cat hg19refGene.gtf | sort -k1,1 -k4,4 > hg19refGene.gtf.sorted
Example output for hg19refGene.gtf.sorted:
$head hg19refGene.gtf.sorted chr1 refGene.input exon 10002682 10002840 . - . gene_id "LZIC"; transcript_id "NM_001316973"; exon_number "7"; exon_id "NM_001316973.7"; gene_name "LZIC"; chr1 refGene.input exon 10002682 10002840 . - . gene_id "LZIC"; transcript_id "NM_001316975"; exon_number "7"; exon_id "NM_001316975.7"; gene_name "LZIC"; chr1 refGene.input exon 10002682 10002840 . - . gene_id "LZIC"; transcript_id "NM_001316976"; exon_number "5"; exon_id "NM_001316976.5"; gene_name "LZIC"; chr1 refGene.input exon 10002682 10002840 . - . gene_id "LZIC"; transcript_id "NM_032368"; exon_number "7"; exon_id "NM_032368.7"; gene_name "LZIC"; chr1 refGene.input CDS 10002739 10002793 . - 0 gene_id "LZIC"; transcript_id "NM_001316974"; exon_number "7"; exon_id "NM_001316974.7"; gene_name "LZIC"; chr1 refGene.input exon 10002739 10002840 . - . gene_id "LZIC"; transcript_id "NM_001316974"; exon_number "7"; exon_id "NM_001316974.7"; gene_name "LZIC"; chr1 refGene.input start_codon 10002791 10002793 . - 0 gene_id "LZIC"; transcript_id "NM_001316974"; exon_number "7"; exon_id "NM_001316974.7"; gene_name "LZIC"; chr1 refGene.input exon 10002981 10003083 . + . gene_id "NMNAT1"; transcript_id "NM_001297778"; exon_number "1"; exon_id "NM_001297778.1"; gene_name "NMNAT1"; chr1 refGene.input transcript 10002981 10045556 . + . gene_id "NMNAT1"; transcript_id "NM_001297778"; gene_name "NMNAT1"; chr1 refGene.input exon 10003307 10003485 . - . gene_id "LZIC"; transcript_id "NM_032368"; exon_number "8"; exon_id "NM_032368.8"; gene_name "LZIC";
Using kent commands with the public database server
To use the kent commands with the public database server, add this four line file ".hg.conf" to your home directory:
$ cat $HOME/.hg.conf db.host=genome-mysql.cse.ucsc.edu db.user=genomep db.password=password central.db=hgcentral
And set the permissions:
$ chmod 600 .hg.conf
Now you can use the command to extract GTF files directly from the UCSC database. For example, fetch the UCSC gene track from hg19 into the local file knownGene.gtf:
$ genePredToGtf hg19 knownGene knownGene.gtf
Note that this still produces incorrect GTF, where the gene_id and transcript_id match. You will either need to write a custom script to fix these fields, or use the "file" option described above on a custom knownGene table download to get correct GTF.
Bed format gene tracks (convert bed > genePred > GTF)
Some gene tracks are in a bed format in the database, perhaps with extra columns past the standard bed format. In this case, extract the standard bed columns, convert it to a genePred and then to a gtf. For example wgRna:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N \ -e "select chrom,chromStart,chromEnd,name,score,strand,thickStart,thickEnd from wgRna;" hg19 \ | bedToGenePred stdin stdout | genePredToGtf file wgRna.gtf
Note that in the above methods, it was necessary to cut columns 1 - 10 to remove the extra trailing columns. With the method detailed here, this cut is not necessary in the case of using the database table since the command can determine from the table structure which columns to use.
Get a genePred file from UCSC MySQL public databases, then convert to GTF
Information about MySQl - http://genome.ucsc.edu/goldenPath/help/mysql.html
MySQL query example to get a genePred file:
$ mysql --host=genome-mysql.soe.ucsc.edu --user=genome -Ne "select a.name, a.chrom, a.strand, a.txStart, a.txEnd,\ a.cdsStart, a.cdsEnd, a.exonCount, a.exonStarts, a.exonEnds, 0 as score, b.geneSymbol from knownGene a join \ kgXref b on a.name=b.kgID" hg19 > hg19.genePred
Next, using the genePredToGtf utility:
genePredToGtf file hg38.genePred hg38.knownGene.gtf
genePred output will look like this:
uc001aaa.3 chr1 + 11873 14409 11873 11873 3 11873,12612,13220, 12227,12721,14409, 0 DDX11L1 uc010nxr.1 chr1 + 11873 14409 11873 11873 3 11873,12645,13220, 12227,12697,14409, 0 DDX11L1 uc010nxq.1 chr1 + 11873 14409 12189 13639 3 11873,12594,13402, 12227,12721,14409, 0 DDX11L1
Result:
chr1 hg19.genePred transcript 11874 14409 . + . gene_id "DDX11L1"; transcript_id "uc001aaa.3"; gene_name "DDX11L1"; chr1 hg19.genePred exon 11874 12227 . + . gene_id "DDX11L1"; transcript_id "uc001aaa.3"; exon_number "1"; exon_id "uc001aaa.3.1"; gene_name "DDX11L1"; chr1 hg19.genePred exon 12613 12721 . + . gene_id "DDX11L1"; transcript_id "uc001aaa.3"; exon_number "2"; exon_id "uc001aaa.3.2"; gene_name "DDX11L1"; chr1 hg19.genePred exon 13221 14409 . + . gene_id "DDX11L1"; transcript_id "uc001aaa.3"; exon_number "3"; exon_id "uc001aaa.3.3"; gene_name "DDX11L1";
The opposite direction, GTF to GenePred
There is a utility for this as well: gtfToGenePred. Here are some examples of using this utility:
$ wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_27/gencode.v27.long_noncoding_RNAs.gtf.gz $ gzip -cd gencode.v27.long_noncoding_RNAs.gtf.gz > gencodeLNCRNA.gtf $ head -8 gencodeLNCRNA.gtf ##description: evidence-based annotation of the human genome (GRCh38), version 27 (Ensembl 90) - long non-coding RNAs ##provider: GENCODE ##contact: gencode-help@sanger.ac.uk ##format: gtf ##date: 2017-08-01 chr1 HAVANA gene 29554 31109 . + . gene_id "ENSG00000243485.5"; gene_type "lincRNA"; gene_name "MIR1302-2HG"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000000959.2"; chr1 HAVANA transcript 29554 31097 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_name "MIR1302-2HG"; transcript_type "lincRNA"; transcript_name "MIR1302-2HG-202"; level 2; transcript_support_level "5"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1"; chr1 HAVANA exon 29554 30039 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_name "MIR1302-2HG"; transcript_type "lincRNA"; transcript_name "MIR1302-2HG-202"; exon_number 1; exon_id "ENSE00001947070.1"; level 2; transcript_support_level "5"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1"; # BASIC USAGE $ gtfToGenePred gencodeLNCRNA.gtf gencodeLNCRNA.genePred $ head -1 gencodeLNCRNA.genePred ENST00000473358.1 chr1 + 29553 31097 31097 31097 3 29553,30563,30975, 30039,30667,31097, # EXTENDED USAGE $ gtfToGenePred -genePredExt gencodeLNCRNA.gtf gencodeLNCRNA.genePred head -1 gencodeLNCRNA.genePred ENST00000473358.1 chr1 + 29553 31097 31097 31097 3 29553,30563,30975, 30039,30667,31097, 0 ENSG00000243485.5 none none -1,-1,-1,
For a full list of options available to gtfToGenePred, run the program with no args:
gtfToGenePred - convert a GTF file to a genePred usage: gtfToGenePred gtf genePred options: -genePredExt - create a extended genePred, including frame information and gene name -allErrors - skip groups with errors rather than aborting. Useful for getting infomation about as many errors as possible. -ignoreGroupsWithoutExons - skip groups contain no exons rather than generate an error. -infoOut=file - write a file with information on each transcript -sourcePrefix=pre - only process entries where the source name has the specified prefix. May be repeated. -impliedStopAfterCds - implied stop codon in after CDS -simple - just check column validity, not hierarchy, resulting genePred may be damaged -geneNameAsName2 - if specified, use gene_name for the name2 field instead of gene_id. -includeVersion - it gene_version and/or transcript_version attributes exist, include the version in the corresponding identifiers.
Mailing List Resources about GTF
Please also try searching our mailing list for previous answers that may be of interest.
- Here is an example of searching for genePredToGtf.
- Here is an example answer about using a MySQL query to build a new genePred format with a query using the kgXref (knownGene cross reference) table.