Genes in gtf or gff format: Difference between revisions

Latest revision as of 00:04, 5 February 2020

GTF Downloads Directory

There are pre-made GTF format files for every assembly that has knownGene, ncbiRefSeq, refGene, and Ensembl data.
These can be found at the following download server address:
```
http://hgdownload.soe.ucsc.edu/goldenPath/$db/bigZips/genes/ 
```
where $db is the assembly of interest.
For example, the hg38 GTF files: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/

Summary of limitations for Table Browser GTF output

The Table Browser has transcript IDs only, so although it includes both "gene_id" and "transcript_id" fields in its output, the value for transcript ID (e.g., ENST#) is used for both fields.
The Table Browser adds start and stop codon annotations whether or not the transcript alignment includes proper start and stop codons.
Some tables in older genome assemblies are not supported.
Tables not in genePred format (e.g., knownCanonical) will produce unexpected GTF output, in addition to the other "known-limitations for Table Browser GTF output" listed here.
Issue with stop codons in GTF output from Table Browser

Warning - using a non genePred table to get GTF output in the Table Browser

A genePred table (such as knownGene) is needed to get GTF output in the Table Browser. Below is an example of output for the knownCanonical table, which is NOT in genePred format. Even though the TB GTF says "exons" for knownCanonical GTF output, it's really just a placeholder, not exons at all, but rather start-stop regions of the transcripts.

For example, if you do a cart reset (top menu > Genome Browser > Reset All User Settings) and go to the default region (chr1:11102837-11267747) in hg38, then go to the Table Browser, and then get all fields for knownCanonical (limit to default region, not genome), you'll get this output:

#chrom	chromStart	chromEnd	clusterId	transcript	protein
chr1	11106534	11262507	17297	uc001asd.4	ENSG00000198793.12
chr1	11143897	11149537	24285	uc031plf.2	ENSG00000225602.5
chr1	11152349	11152452	33500	uc057cga.1	ENSG00000253086.1
chr1	11189340	11195981	13013	uc001ase.5	ENSG00000171819.4
chr1	11226253	11226360	20530	uc057cgc.1	ENSG00000207451.1

GTF output for that same region will be:

chr1	hg38_knownCanonical	exon	11106535	11262507	0.000000	.	.	gene_id "gene1"; transcript_id "tx1"; 
chr1	hg38_knownCanonical	exon	11143898	11149537	0.000000	.	.	gene_id "gene2"; transcript_id "tx2"; 
chr1	hg38_knownCanonical	exon	11152350	11152452	0.000000	.	.	gene_id "gene3"; transcript_id "tx3"; 
chr1	hg38_knownCanonical	exon	11189341	11195981	0.000000	.	.	gene_id "gene4"; transcript_id "tx4"; 
chr1	hg38_knownCanonical	exon	11226254	11226360	0.000000	.	.	gene_id "gene5"; transcript_id "tx5";

Note that the GTF regions are not exon regions, they are start-stop regions. Note also that GTF is 1-based, unlike 0-based "all fields" output. See the coordinate blog for more information about 0-based vs 1-based coord systems.

Example: Comparing Table Browser GTF output with genePredToGtf utility output

Table Browser output for ENST00000376819.3.

Table Browser configuration: hg38, Genes and Gene Predictions, All GENCODE V26, Basic (wgEncodeGencodeBasicV26)
Identifier pasted in: ENST00000376819.3

chr1	hg38_wgEncodeGencodeBasicV26	start_codon	11189580	11189582	0.000000	+	.	gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3"; 
chr1	hg38_wgEncodeGencodeBasicV26	CDS	11189580	11189955	0.000000	+	0	gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3"; 
chr1	hg38_wgEncodeGencodeBasicV26	exon	11189341	11189955	0.000000	+	.	gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3"; 
chr1	hg38_wgEncodeGencodeBasicV26	CDS	11192270	11192370	0.000000	+	2	gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3"; 
chr1	hg38_wgEncodeGencodeBasicV26	exon	11192270	11192370	0.000000	+	.	gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3"; 
chr1	hg38_wgEncodeGencodeBasicV26	CDS	11193580	11193774	0.000000	+	0	gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3"; 
chr1	hg38_wgEncodeGencodeBasicV26	exon	11193580	11193774	0.000000	+	.	gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3"; 
chr1	hg38_wgEncodeGencodeBasicV26	CDS	11194461	11194659	0.000000	+	0	gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3"; 
chr1	hg38_wgEncodeGencodeBasicV26	exon	11194461	11194659	0.000000	+	.	gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3"; 
chr1	hg38_wgEncodeGencodeBasicV26	CDS	11194854	11195020	0.000000	+	2	gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3"; 
chr1	hg38_wgEncodeGencodeBasicV26	stop_codon	11195021	11195023	0.000000	+	.	gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3"; 
chr1	hg38_wgEncodeGencodeBasicV26	exon	11194854	11195981	0.000000	+	.	gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3";

genePredToGtf utility output

$ genePredToGtf hg38 wgEncodeGencodeBasicV26 utilityOutputBasic26.gtf

$ cat utilityOutputBasic26.gtf| grep -w ENST00000376819.3

chr1	wgEncodeGencodeBasicV26	transcript	11189341	11195981	.	+	.	gene_id "ANGPTL7"; transcript_id "ENST00000376819.3";  gene_name "ANGPTL7";
chr1	wgEncodeGencodeBasicV26	exon	11189341	11189955	.	+	.	gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "1"; exon_id "ENST00000376819.3.1"; gene_name "ANGPTL7";
chr1	wgEncodeGencodeBasicV26	CDS	11189580	11189955	.	+	0	gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "1"; exon_id "ENST00000376819.3.1"; gene_name "ANGPTL7";
chr1	wgEncodeGencodeBasicV26	exon	11192270	11192370	.	+	.	gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "2"; exon_id "ENST00000376819.3.2"; gene_name "ANGPTL7";
chr1	wgEncodeGencodeBasicV26	CDS	11192270	11192370	.	+	2	gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "2"; exon_id "ENST00000376819.3.2"; gene_name "ANGPTL7";
chr1	wgEncodeGencodeBasicV26	exon	11193580	11193774	.	+	.	gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "3"; exon_id "ENST00000376819.3.3"; gene_name "ANGPTL7";
chr1	wgEncodeGencodeBasicV26	CDS	11193580	11193774	.	+	0	gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "3"; exon_id "ENST00000376819.3.3"; gene_name "ANGPTL7";
chr1	wgEncodeGencodeBasicV26	exon	11194461	11194659	.	+	.	gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "4"; exon_id "ENST00000376819.3.4"; gene_name "ANGPTL7";
chr1	wgEncodeGencodeBasicV26	CDS	11194461	11194659	.	+	0	gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "4"; exon_id "ENST00000376819.3.4"; gene_name "ANGPTL7";
chr1	wgEncodeGencodeBasicV26	exon	11194854	11195981	.	+	.	gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "5"; exon_id "ENST00000376819.3.5"; gene_name "ANGPTL7";
chr1	wgEncodeGencodeBasicV26	CDS	11194854	11195020	.	+	2	gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "5"; exon_id "ENST00000376819.3.5"; gene_name "ANGPTL7";
chr1	wgEncodeGencodeBasicV26	start_codon	11189580	11189582	.	+	0	gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "1"; exon_id "ENST00000376819.3.1"; gene_name "ANGPTL7";
chr1	wgEncodeGencodeBasicV26	stop_codon	11195021	11195023	.	+	0	gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "5"; exon_id "ENST00000376819.3.5"; gene_name "ANGPTL7";

Convert genePred to GTF with the genePredToGtf command line utility

GenePred is a table format commonly used for gene prediction tracks in the UCSC Genome Browser. The genePredToGtf command-line utility can be used to convert genePred to GTF.

While the Table Browser does contain an option to output query results in GTF, the output is limited, and in some cases, may contain bugs. The best method to convert genePred to GTF is the genePredToGtf command-line utility. The operating-specific utility can be downloaded from the utilities directory.

Once downloaded (and permissions changed to executable), you can run the utility without arguments to see the usage statement:

$ genePredToGtf
genePredToGtf - Convert genePred table or file to gtf.
usage:
   genePredToGtf database genePredTable output.gtf
If database is 'file' then track is interpreted as a file
rather than a table in database.
options:
   -utr - Add 5UTR and 3UTR features
   -honorCdsStat - use cdsStartStat/cdsEndStat when defining start/end
    codon records
   -source=src set source name to use
   -addComments - Add comments before each set of transcript records.
    allows for easier visual inspection
Note: use a refFlat table or extended genePred table or file to include
the gene_name attribute in the output.  This will not work with a refFlat
table dump file. If you are using a genePred file that starts with a numeric
bin column, drop it using the UNIX cut command:
    cut -f 2- in.gp | genePredToGtf file stdin out.gp

Using Table Browser output as input for the genePredToGtf

You can use Table Browser output as input for the genePredToGtf utility, but you will need to check that the Table Browser output is indeed in the correct GenPred format. In some cases, you may have trailing columns that need to be removed.

For example,

From the UCSC Genome Browser, click on "Genome Browser" at the top menu bar, then select "Reset All User Settings" to refresh to the default hg38 assembly and its default position.
Go to the Table Browser, and keeping all options as default, change only 1 setting: region should be set to "position" instead of genome.
Accept the default drop-down option for "output format" as "all fields from selected table" and
Type in a name for "output file" to download your file (e.g., "knownGeneABO.txt").
Click "get output."

Note that you will have 12 columns; and you will need to remove the last two columns to get genePred format:

cat knownGeneABO.txt | cut -f1-10 > knownGeneABO.genePred

Now convert to GTF, using the "file" argument for genePredToGTF:

genePredToGtf file knownGeneABO.genePred knownGeneABO.gtf

Use genePredToGtf with a downloaded genePred table

You can directly download a table (for example, the knownGene table), which will be in genePred format. You can then use this local file as input for the genePredToGtf conversion.

ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz

The SQL structure:

ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/knownGene.sql

As noted in the usage message, the file can be used with the command in place of the database table specification. In this case, beware of files that are only partially genePred format. For example, the knownGene.txt.gz file has extra columns after the exonEnds column. Therefore, use cut to extract just the columns for genePred:

$ zcat knownGene.txt.gz | cut -f1-10 | genePredToGtf file stdin knownGene.gtf

Example with downloaded refGene.txt.gz

Here are detailed steps for converting a local hg19 refGene table (in genePred format) to GTF.

1. Download your gene set of interest for hg19. For this example, I'll use the refGene table, but you can choose other gene sets, such as the knownGene table from the "UCSC Genes" track.

rsync -a -P rsync://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz ./

2. Unzip

gzip -d refGene.txt.gz

3. Remove the first "bin" column:

cut -f 2- refGene.txt > refGene.input

4. Convert to gtf:

genePredToGtf file refGene.input hg19refGene.gtf

5. Sort output by chromosome and coordinate

cat hg19refGene.gtf  | sort -k1,1 -k4,4n > hg19refGene.gtf.sorted

Example output for hg19refGene.gtf.sorted:

$head hg19refGene.gtf.sorted
chr1 refGene.input exon 10002682 10002840 . - . gene_id "LZIC"; transcript_id "NM_001316973"; exon_number "7"; exon_id "NM_001316973.7"; gene_name "LZIC";
chr1 refGene.input exon 10002682 10002840 . - . gene_id "LZIC"; transcript_id "NM_001316975"; exon_number "7"; exon_id "NM_001316975.7"; gene_name "LZIC";
chr1 refGene.input exon 10002682 10002840 . - . gene_id "LZIC"; transcript_id "NM_001316976"; exon_number "5"; exon_id "NM_001316976.5"; gene_name "LZIC";
chr1 refGene.input exon 10002682 10002840 . - . gene_id "LZIC"; transcript_id "NM_032368"; exon_number "7"; exon_id "NM_032368.7"; gene_name "LZIC";
chr1 refGene.input CDS 10002739 10002793 . - 0 gene_id "LZIC"; transcript_id "NM_001316974"; exon_number "7"; exon_id "NM_001316974.7"; gene_name "LZIC";
chr1 refGene.input exon 10002739 10002840 . - . gene_id "LZIC"; transcript_id "NM_001316974"; exon_number "7"; exon_id "NM_001316974.7"; gene_name "LZIC";
chr1 refGene.input start_codon 10002791 10002793 . - 0 gene_id "LZIC"; transcript_id "NM_001316974"; exon_number "7"; exon_id "NM_001316974.7"; gene_name "LZIC";
chr1 refGene.input exon 10002981 10003083 . + . gene_id "NMNAT1"; transcript_id "NM_001297778"; exon_number "1"; exon_id "NM_001297778.1"; gene_name "NMNAT1";
chr1 refGene.input transcript 10002981 10045556 . + . gene_id "NMNAT1"; transcript_id "NM_001297778";  gene_name "NMNAT1";
chr1 refGene.input exon 10003307 10003485 . - . gene_id "LZIC"; transcript_id "NM_032368"; exon_number "8"; exon_id "NM_032368.8"; gene_name "LZIC";

Using kent commands with the public database server

To use the kent commands with the public database server, add this four line file ".hg.conf" to your home directory. One way is to use the echo command and the >> to append lines into .hg.conf:

echo db.host=genome-mysql.soe.ucsc.edu >> .hg.conf
echo db.user=genomep >> .hg.conf
echo db.password=password >> .hg.conf
echo central.db=hgcentral >> .hg.conf

Check your work with the following command:

cat $HOME/.hg.conf

Download GenePredToGtf utility from the command line:

#for MacOSX
wget http://hgdownload.soe.ucsc.edu/admin/exe/macOSX.x86_64/genePredToGtf
#for Linux
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/genePredToGtf

Set the permissions:

 chmod 600 .hg.conf
 chmod +x genePredToGtf

Now you can use the genePredToGtf command to pull gene files directly from the UCSC public database and convert them to GTF format. For example, fetch NCBI's refGene track from hg38 and save to a local file named refGene.gtf:

 ./genePredToGtf hg38 refGene refGene.gtf

Note: The GTF files in the UCSC download server were created using the -utr flag. This adds the 5' and 3' utrs to the 9th field:

 ./genePredToGtf -utr hg38 refGene refGene.gtf

Bed format gene tracks (convert bed > genePred > GTF)

Some gene tracks are in a bed format in the database, perhaps with extra columns past the standard bed format. In this case, extract the standard bed columns, convert it to a genePred and then to a gtf. For example wgRna:

mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -N \
  -e "select chrom,chromStart,chromEnd,name,score,strand,thickStart,thickEnd from wgRna;" hg19 \
   | bedToGenePred stdin stdout | genePredToGtf file wgRna.gtf

Note that in the above methods, it was necessary to cut columns 1 - 10 to remove the extra trailing columns. With the method detailed here, this cut is not necessary in the case of using the database table since the command can determine from the table structure which columns to use.

Get a genePred file from UCSC MySQL public databases, then convert to GTF

Information about MySQl - http://genome.ucsc.edu/goldenPath/help/mysql.html

MySQL query example to get a genePred file:

$ mysql --host=genome-mysql.soe.ucsc.edu --user=genome -Ne "select a.name, a.chrom, a.strand, a.txStart, a.txEnd,\
a.cdsStart, a.cdsEnd, a.exonCount, a.exonStarts, a.exonEnds, 0 as score, b.geneSymbol from knownGene a join \
kgXref b on a.name=b.kgID" hg19 > hg19.genePred

Next, using the genePredToGtf utility:

genePredToGtf file hg38.genePred hg38.knownGene.gtf

genePred output will look like this:

uc001aaa.3    chr1    +    11873    14409    11873    11873    3    11873,12612,13220,    12227,12721,14409,    0    DDX11L1
uc010nxr.1    chr1    +    11873    14409    11873    11873    3    11873,12645,13220,    12227,12697,14409,    0    DDX11L1
uc010nxq.1    chr1    +    11873    14409    12189    13639    3    11873,12594,13402,    12227,12721,14409,    0    DDX11L1

Result:

chr1    hg19.genePred    transcript    11874    14409    .    +    .    gene_id "DDX11L1"; transcript_id "uc001aaa.3";  gene_name "DDX11L1";
chr1    hg19.genePred    exon    11874    12227    .    +    .    gene_id "DDX11L1"; transcript_id "uc001aaa.3"; exon_number "1"; exon_id "uc001aaa.3.1"; gene_name "DDX11L1";
chr1    hg19.genePred    exon    12613    12721    .    +    .    gene_id "DDX11L1"; transcript_id "uc001aaa.3"; exon_number "2"; exon_id "uc001aaa.3.2"; gene_name "DDX11L1";
chr1    hg19.genePred    exon    13221    14409    .    +    .    gene_id "DDX11L1"; transcript_id "uc001aaa.3"; exon_number "3"; exon_id "uc001aaa.3.3"; gene_name "DDX11L1";

The opposite direction, GTF to GenePred

There is a utility for this as well: gtfToGenePred. Here are some examples of using this utility:

$ wget ftp://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/current/MANE.GRCh38.v0.5.select_ensembl_genomic.gtf.gz
$ gunzip MANE.GRCh38.v0.5.select_ensembl_genomic.gtf.gz
$ head -n1 MANE.GRCh38.v0.5.select_ensembl_genomic.gtf 
chr1	ensembl_havana	gene	944203	959309	.	-	.	gene_id "ENSG00000188976.11"; gene_type "protein_coding"; gene_name "NOC2L";

# BASIC USAGE
$ gtfToGenePred MANE.GRCh38.v0.5.select_ensembl_genomic.gtf MANE.GRCh38.v0.5.select_ensembl_genomic.genePred
$ head -1 MANE.GRCh38.v0.5.select_ensembl_genomic.genePred
ENST00000327044.7	chr1	-	944202	959256	944693	959240	19	944202,945056,945517,946172,946401,948130,948489,951126,951999,952411,953174,953781,954003,955922,956094,956893,957098,958928,959214,	944800,945146,945653,946286,946545,948232,948603,951238,952139,952600,953288,953892,954082,956013,956215,957025,957273,959081,959256,

# EXTENDED USAGE
$ gtfToGenePred -genePredExt MANE.GRCh38.v0.5.select_ensembl_genomic.gtf MANE.GRCh38.v0.5.select_ensembl_genomic.genePredExt
$ head -1 MANE.GRCh38.v0.5.select_ensembl_genomic.genePredExt
ENST00000327044.7	chr1	-	944202	959256	944693	959240	19	944202,945056,945517,946172,946401,948130,948489,951126,951999,952411,953174,953781,954003,955922,956094,956893,957098,958928,959214,	944800,945146,945653,946286,946545,948232,948603,951238,952139,952600,953288,953892,954082,956013,956215,957025,957273,959081,959256,	ENSG00000188976.11	cmpl	cmpl	1,1,0,0,0,0,0,2,0,0,0,0,2,1,0,0,2,2,0,

For a full list of options available to gtfToGenePred, run the program with no args:

gtfToGenePred - convert a GTF file to a genePred
usage:
   gtfToGenePred gtf genePred

options:
     -genePredExt - create a extended genePred, including frame
      information and gene name
     -allErrors - skip groups with errors rather than aborting.
      Useful for getting infomation about as many errors as possible.
     -ignoreGroupsWithoutExons - skip groups contain no exons rather than
      generate an error.
     -infoOut=file - write a file with information on each transcript
     -sourcePrefix=pre - only process entries where the source name has the
      specified prefix.  May be repeated.
     -impliedStopAfterCds - implied stop codon in after CDS
     -simple    - just check column validity, not hierarchy, resulting genePred may be damaged
     -geneNameAsName2 - if specified, use gene_name for the name2 field
      instead of gene_id.
     -includeVersion - it gene_version and/or transcript_version attributes exist, include the version
      in the corresponding identifiers.

Mailing List Resources about GTF

Please also try searching our mailing list for previous answers that may be of interest.

Here is an example of searching for genePredToGtf.
Here is an example answer about using a MySQL query to build a new genePred format with a query using the kgXref (knownGene cross reference) table.

Scripting examples

Scripting to add IDs and other fields into the header of an .fa sequence file

Link to an archived help forum topic

Perl script to find and replace the "gene ID" with the Ensembl ID, which is named "transcript_id."

Example output prior to using the script below (in the next step). The output below is example output from the command genePredToGtf hg38 wgEncodeGencodeCompV24 hg38FileTest.gtf

chr1 wgEncodeGencodeCompV24 transcript 17369 17436 . - . gene_id "MIR6859-1"; transcript_id "ENST00000619216.1"; gene_name "MIR6859-1";
chr1 wgEncodeGencodeCompV24 exon 17369 17436 . - . gene_id "MIR6859-1"; transcript_id "ENST00000619216.1"; exon_number "1"; exon_id "ENST00000619216.1.1"; gene_name "MIR6859-1";
chr1 wgEncodeGencodeCompV24 transcript 29554 31097 . + . gene_id "RP11-34P13.3"; transcript_id "ENST00000473358.1"; gene_name "RP11-34P13.3";
chr1 wgEncodeGencodeCompV24 exon 29554 30039 . + . gene_id "RP11-34P13.3"; transcript_id "ENST00000473358.1"; exon_number "1"; exon_id "ENST00000473358.1.1"; gene_name "RP11-34P13.3";
chr1 wgEncodeGencodeCompV24 exon 30564 30667 . + . gene_id "RP11-34P13.3"; transcript_id "ENST00000473358.1"; exon_number "2"; exon_id "ENST00000473358.1.2"; gene_name "RP11-34P13.3";

Example output after running this perl script:

perl -wpe 's/gene_id "[^"]+"; transcript_id "([^"]+)"/gene_id "$1"; transcript_id "$1"/;' genePredToGtf.output

chr1 wgEncodeGencodeCompV24 transcript 17369 17436 . - . gene_id "ENST00000619216.1"; transcript_id "ENST00000619216.1"; gene_name "MIR6859-1";
chr1 wgEncodeGencodeCompV24 exon 17369 17436 . - . gene_id "ENST00000619216.1"; transcript_id "ENST00000619216.1"; exon_number "1"; exon_id "ENST00000619216.1.1"; gene_name "MIR6859-1";
chr1 wgEncodeGencodeCompV24 transcript 29554 31097 . + . gene_id "ENST00000473358.1"; transcript_id "ENST00000473358.1"; gene_name "RP11-34P13.3";
chr1 wgEncodeGencodeCompV24 exon 29554 30039 . + . gene_id "ENST00000473358.1"; transcript_id "ENST00000473358.1"; exon_number "1"; exon_id "ENST00000473358.1.1"; gene_name "RP11-34P13.3";
chr1 wgEncodeGencodeCompV24 exon 30564 30667 . + . gene_id "ENST00000473358.1"; transcript_id "ENST00000473358.1"; exon_number "2"; exon_id "ENST00000473358.1.2"; gene_name "RP11-34P13.3";

Genes in gtf or gff format: Difference between revisions

Latest revision as of 00:04, 5 February 2020

Contents

GTF Downloads Directory

Summary of limitations for Table Browser GTF output

Warning - using a non genePred table to get GTF output in the Table Browser

Example: Comparing Table Browser GTF output with genePredToGtf utility output

Convert genePred to GTF with the genePredToGtf command line utility

Using Table Browser output as input for the genePredToGtf

Use genePredToGtf with a downloaded genePred table

Example with downloaded refGene.txt.gz

Using kent commands with the public database server

Bed format gene tracks (convert bed > genePred > GTF)

Get a genePred file from UCSC MySQL public databases, then convert to GTF

The opposite direction, GTF to GenePred

Mailing List Resources about GTF

Scripting examples

Scripting to add IDs and other fields into the header of an .fa sequence file

Perl script to find and replace the "gene ID" with the Ensembl ID, which is named "transcript_id."

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

related sites

hosted projects

Tools

@@ Line 1: / Line 1: @@
-==genePredToGtf==
+==GTF Downloads Directory==
-UCSC does not keep gene structures in GTF format, we use a single line format for a single gene
+* There are pre-made GTF format files for every assembly that has knownGene, ncbiRefSeq, refGene, and Ensembl data.
-with all the information about that gene in the single line:
+* These can be found at the following download server address: <pre>http://hgdownload.soe.ucsc.edu/goldenPath/$db/bigZips/genes/ </pre>where $db is the assembly of interest.
-[http://genome.ucsc.edu/FAQ/FAQformat.html#format9 GenePred format.]
+* For example, the hg38 GTF files: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/
-Extracting GTF format files from the genePred format can be performed with the '''genePredToGtf''': [http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/ kent command utility.]
+==Summary of limitations for Table Browser GTF output==
-At this time, this '''genePredToGtf''' command can provide better GTF files than available from the table browser.
+* The Table Browser has transcript IDs only, so although it includes both "gene_id" and "transcript_id" fields in its output, the value for transcript ID (e.g., ENST#) is used for both fields.
+* The Table Browser adds start and stop codon annotations whether or not the transcript alignment includes proper start and stop codons.
+* Some tables in older genome assemblies are not supported.
+* Tables not in genePred format (e.g., knownCanonical) will produce unexpected GTF output, in addition to the other "known-limitations for Table Browser GTF output" listed here.
+* [https://genome.ucsc.edu/FAQ/FAQtracks#tracks18 Issue with stop codons in GTF output from Table Browser]
-To use the kent commands with the public database server, add this three line file ".hg.conf" to your home directory:
+==Warning - using a non genePred table to get GTF output in the Table Browser==
+A genePred table (such as knownGene) is needed to get GTF output in the Table Browser. Below is an example of output for the knownCanonical table, which is NOT in genePred format.
+Even though the TB GTF says "exons" for knownCanonical GTF output, it's really just a placeholder, not exons at all, but rather start-stop regions of the transcripts.
- $ cat $HOME/.hg.conf
+For example, if you do a cart reset (top menu > Genome Browser > Reset All User Settings) and go to the default region (chr1:11102837-11267747) in hg38, then go to the Table Browser, and then get all fields for knownCanonical (limit to default region, not genome), you'll get this output:
- db.host=genome-mysql.cse.ucsc.edu
+<pre>
- db.user=genomep
+#chrom	chromStart	chromEnd	clusterId	transcript	protein
- db.password=password
+chr1	11106534	11262507	17297	uc001asd.4	ENSG00000198793.12
- central.db=hgcentral
+chr1	11143897	11149537	24285	uc031plf.2	ENSG00000225602.5
+chr1	11152349	11152452	33500	uc057cga.1	ENSG00000253086.1
+chr1	11189340	11195981	13013	uc001ase.5	ENSG00000171819.4
+chr1	11226253	11226360	20530	uc057cgc.1	ENSG00000207451.1
+</pre>
-And set the permissions:
+GTF output for that same region will be:
- $ chmod 600 .hg.conf
-Now you can use the command to extract GTF files directly from the UCSC database.
+<pre>
-For example, fetch the UCSC gene track from hg19 into the local file knownGene.gtf:
+chr1	hg38_knownCanonical	exon	11106535	11262507	0.000000	.	.	gene_id "gene1"; transcript_id "tx1";
+chr1	hg38_knownCanonical	exon	11143898	11149537	0.000000	.	.	gene_id "gene2"; transcript_id "tx2";
+chr1	hg38_knownCanonical	exon	11152350	11152452	0.000000	.	.	gene_id "gene3"; transcript_id "tx3";
+chr1	hg38_knownCanonical	exon	11189341	11195981	0.000000	.	.	gene_id "gene4"; transcript_id "tx4";
+chr1	hg38_knownCanonical	exon	11226254	11226360	0.000000	.	.	gene_id "gene5"; transcript_id "tx5";
+</pre>
- $ genePredToGtf hg19 knownGene knownGene.gtf
+Note that the GTF regions are not exon regions, they are start-stop regions. Note also that GTF is 1-based, unlike 0-based "all fields" output. See the [http://genome.ucsc.edu/blog/the-ucsc-genome-browser-coordinate-counting-systems/ coordinate blog] for more information about 0-based vs 1-based coord systems.
-Note the usage message from the command:
+==Example: Comparing Table Browser GTF output with genePredToGtf utility output==
- genePredToGtf - Convert genePred table or file to gtf.
+'''Table Browser output for ENST00000376819.3.'''
- usage:
+* Table Browser configuration: hg38, Genes and Gene Predictions, All GENCODE V26, Basic (wgEncodeGencodeBasicV26)
-     genePredToGtf database genePredTable output.gtf
+* Identifier pasted in: ENST00000376819.3
- If database is 'file' then track is interpreted as a file
- rather than a table in database.
- options:
-    -utr - Add 5UTR and 3UTR features
-    -honorCdsStat - use cdsStartStat/cdsEndStat when defining start/end
-     codon records
-    -source=src set source name to uses
-    -addComments - Add comments before each set of transcript records.
-     allows for easier visual inspection
- Note: use refFlat or extended genePred table to include geneName
-==text file dumps of gene tracks==
+<pre>
-You can also fetch the database text dump of the genePred content for
+chr1	hg38_wgEncodeGencodeBasicV26	start_codon	11189580	11189582	0.000000	+	.	gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3";
-the track to have the file on-hand locally:
+chr1	hg38_wgEncodeGencodeBasicV26	CDS	11189580	11189955	0.000000	+	0	gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3";
+chr1	hg38_wgEncodeGencodeBasicV26	exon	11189341	11189955	0.000000	+	.	gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3";
+chr1	hg38_wgEncodeGencodeBasicV26	CDS	11192270	11192370	0.000000	+	2	gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3";
+chr1	hg38_wgEncodeGencodeBasicV26	exon	11192270	11192370	0.000000	+	.	gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3";
+chr1	hg38_wgEncodeGencodeBasicV26	CDS	11193580	11193774	0.000000	+	0	gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3";
+chr1	hg38_wgEncodeGencodeBasicV26	exon	11193580	11193774	0.000000	+	.	gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3";
+chr1	hg38_wgEncodeGencodeBasicV26	CDS	11194461	11194659	0.000000	+	0	gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3";
+chr1	hg38_wgEncodeGencodeBasicV26	exon	11194461	11194659	0.000000	+	.	gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3";
+chr1	hg38_wgEncodeGencodeBasicV26	CDS	11194854	11195020	0.000000	+	2	gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3";
+chr1	hg38_wgEncodeGencodeBasicV26	stop_codon	11195021	11195023	0.000000	+	.	gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3";
+chr1	hg38_wgEncodeGencodeBasicV26	exon	11194854	11195981	0.000000	+	.	gene_id "ENST00000376819.3"; transcript_id "ENST00000376819.3";
+</pre>
+'''genePredToGtf utility output'''
+<pre>
+$ genePredToGtf hg38 wgEncodeGencodeBasicV26 utilityOutputBasic26.gtf
+$ cat utilityOutputBasic26.gtf| grep -w ENST00000376819.3
+</pre>
+<pre>
+chr1	wgEncodeGencodeBasicV26	transcript	11189341	11195981	.	+	.	gene_id "ANGPTL7"; transcript_id "ENST00000376819.3";  gene_name "ANGPTL7";
+chr1	wgEncodeGencodeBasicV26	exon	11189341	11189955	.	+	.	gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "1"; exon_id "ENST00000376819.3.1"; gene_name "ANGPTL7";
+chr1	wgEncodeGencodeBasicV26	CDS	11189580	11189955	.	+	0	gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "1"; exon_id "ENST00000376819.3.1"; gene_name "ANGPTL7";
+chr1	wgEncodeGencodeBasicV26	exon	11192270	11192370	.	+	.	gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "2"; exon_id "ENST00000376819.3.2"; gene_name "ANGPTL7";
+chr1	wgEncodeGencodeBasicV26	CDS	11192270	11192370	.	+	2	gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "2"; exon_id "ENST00000376819.3.2"; gene_name "ANGPTL7";
+chr1	wgEncodeGencodeBasicV26	exon	11193580	11193774	.	+	.	gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "3"; exon_id "ENST00000376819.3.3"; gene_name "ANGPTL7";
+chr1	wgEncodeGencodeBasicV26	CDS	11193580	11193774	.	+	0	gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "3"; exon_id "ENST00000376819.3.3"; gene_name "ANGPTL7";
+chr1	wgEncodeGencodeBasicV26	exon	11194461	11194659	.	+	.	gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "4"; exon_id "ENST00000376819.3.4"; gene_name "ANGPTL7";
+chr1	wgEncodeGencodeBasicV26	CDS	11194461	11194659	.	+	0	gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "4"; exon_id "ENST00000376819.3.4"; gene_name "ANGPTL7";
+chr1	wgEncodeGencodeBasicV26	exon	11194854	11195981	.	+	.	gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "5"; exon_id "ENST00000376819.3.5"; gene_name "ANGPTL7";
+chr1	wgEncodeGencodeBasicV26	CDS	11194854	11195020	.	+	2	gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "5"; exon_id "ENST00000376819.3.5"; gene_name "ANGPTL7";
+chr1	wgEncodeGencodeBasicV26	start_codon	11189580	11189582	.	+	0	gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "1"; exon_id "ENST00000376819.3.1"; gene_name "ANGPTL7";
+chr1	wgEncodeGencodeBasicV26	stop_codon	11195021	11195023	.	+	0	gene_id "ANGPTL7"; transcript_id "ENST00000376819.3"; exon_number "5"; exon_id "ENST00000376819.3.5"; gene_name "ANGPTL7";
+</pre>
+==Convert genePred to GTF with the genePredToGtf command line utility==
+[http://genome.ucsc.edu/FAQ/FAQformat.html#format9 GenePred] is a table format commonly used for gene prediction tracks in the UCSC Genome Browser. The genePredToGtf command-line utility can be used to convert genePred to GTF.
+While the [http://genome.ucsc.edu/cgi-bin/hgTables Table Browser] does contain an option to output query results in GTF, the output is limited, and in some cases, may contain bugs. The best method to convert genePred to GTF is the genePredToGtf command-line utility. The operating-specific utility can be downloaded from the [http://hgdownload.soe.ucsc.edu/admin/exe/ utilities directory].
+Once downloaded (and permissions changed to executable), you can run the utility without arguments to see the usage statement:
+<pre>
+$ genePredToGtf
+genePredToGtf - Convert genePred table or file to gtf.
+usage:
+   genePredToGtf database genePredTable output.gtf
+If database is 'file' then track is interpreted as a file
+rather than a table in database.
+options:
+   -utr - Add 5UTR and 3UTR features
+   -honorCdsStat - use cdsStartStat/cdsEndStat when defining start/end
+    codon records
+   -source=src set source name to use
+   -addComments - Add comments before each set of transcript records.
+    allows for easier visual inspection
+Note: use a refFlat table or extended genePred table or file to include
+the gene_name attribute in the output.  This will not work with a refFlat
+table dump file. If you are using a genePred file that starts with a numeric
+bin column, drop it using the UNIX cut command:
+    cut -f 2- in.gp | genePredToGtf file stdin out.gp
+</pre>
+==Using Table Browser output as input for the genePredToGtf ==
+You can use Table Browser output as input for the genePredToGtf utility, but you will need to check that the Table Browser output is indeed in the correct GenPred format. In some cases, you may have trailing columns that need to be removed.
+For example,
+# From the UCSC Genome Browser, click on "Genome Browser" at the top menu bar, then select "Reset All User Settings" to refresh to the default hg38 assembly and its default position.
+# Go to the Table Browser, and keeping all options as default, change only 1 setting: region should be set to "position" instead of genome.
+# Accept the default drop-down option for "output format" as "all fields from selected table" and
+# Type in a name for "output file" to download your file (e.g., "knownGeneABO.txt").
+# Click "get output."
+Note that you will have 12 columns; and you will need to remove the last two columns to get genePred format:
+<pre>
+cat knownGeneABO.txt | cut -f1-10 > knownGeneABO.genePred
+</pre>
+Now convert to GTF, using the "file" argument for genePredToGTF:
+<pre>
+genePredToGtf file knownGeneABO.genePred knownGeneABO.gtf
+</pre>
+==Use genePredToGtf with a downloaded genePred table==
+You can directly download a table (for example, the knownGene table), which will be in genePred format.
+You can then use this local file as input for the genePredToGtf conversion.
+ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz
-ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz
 The SQL structure:
-ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.sql
+ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/knownGene.sql
 As noted in the usage message, the file can be used with the command in place of the database table specification.
@@ Line 53: / Line 148: @@
 has extra columns after the exonEnds column.  Therefore, use cut to extract just the columns for genePred:
   $ zcat knownGene.txt.gz | cut -f1-10 | genePredToGtf file stdin knownGene.gtf
-This is not necessary in the case of using the database table since the command can determine from the table
-structure which columns to use.
-==bed format gene tracks==
+===Example with downloaded refGene.txt.gz===
+'''Here are detailed steps for converting a local hg19 refGene table (in genePred format) to GTF.'''
+. Download your gene set of interest for hg19.
+For this example, I'll use the refGene table, but you can choose other gene sets, such as the knownGene table from the "UCSC Genes" track.
+<pre>
+rsync -a -P rsync://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz ./
+</pre>
+. Unzip
+<pre>
+gzip -d refGene.txt.gz
+</pre>
+. Remove the first "bin" column:
+<pre>
+cut -f 2- refGene.txt > refGene.input
+</pre>
+. Convert to gtf:
+<pre>
+genePredToGtf file refGene.input hg19refGene.gtf
+</pre>
+. Sort output by chromosome and coordinate
+<pre>
+cat hg19refGene.gtf  | sort -k1,1 -k4,4n > hg19refGene.gtf.sorted
+</pre>
+Example output for  hg19refGene.gtf.sorted:
+<pre>
+$head hg19refGene.gtf.sorted
+chr1 refGene.input exon 10002682 10002840 . - . gene_id "LZIC"; transcript_id "NM_001316973"; exon_number "7"; exon_id "NM_001316973.7"; gene_name "LZIC";
+chr1 refGene.input exon 10002682 10002840 . - . gene_id "LZIC"; transcript_id "NM_001316975"; exon_number "7"; exon_id "NM_001316975.7"; gene_name "LZIC";
+chr1 refGene.input exon 10002682 10002840 . - . gene_id "LZIC"; transcript_id "NM_001316976"; exon_number "5"; exon_id "NM_001316976.5"; gene_name "LZIC";
+chr1 refGene.input exon 10002682 10002840 . - . gene_id "LZIC"; transcript_id "NM_032368"; exon_number "7"; exon_id "NM_032368.7"; gene_name "LZIC";
+chr1 refGene.input CDS 10002739 10002793 . - 0 gene_id "LZIC"; transcript_id "NM_001316974"; exon_number "7"; exon_id "NM_001316974.7"; gene_name "LZIC";
+chr1 refGene.input exon 10002739 10002840 . - . gene_id "LZIC"; transcript_id "NM_001316974"; exon_number "7"; exon_id "NM_001316974.7"; gene_name "LZIC";
+chr1 refGene.input start_codon 10002791 10002793 . - 0 gene_id "LZIC"; transcript_id "NM_001316974"; exon_number "7"; exon_id "NM_001316974.7"; gene_name "LZIC";
+chr1 refGene.input exon 10002981 10003083 . + . gene_id "NMNAT1"; transcript_id "NM_001297778"; exon_number "1"; exon_id "NM_001297778.1"; gene_name "NMNAT1";
+chr1 refGene.input transcript 10002981 10045556 . + . gene_id "NMNAT1"; transcript_id "NM_001297778";  gene_name "NMNAT1";
+chr1 refGene.input exon 10003307 10003485 . - . gene_id "LZIC"; transcript_id "NM_032368"; exon_number "8"; exon_id "NM_032368.8"; gene_name "LZIC";
+</pre>
+==Using kent commands with the public database server==
+To use the kent commands with the public database server, add this four line file ".hg.conf" to your home directory.  One way is to use the echo command and the >> to append lines into .hg.conf:
+ echo db.host=genome-mysql.soe.ucsc.edu >> .hg.conf
+ echo db.user=genomep >> .hg.conf
+ echo db.password=password >> .hg.conf
+ echo central.db=hgcentral >> .hg.conf
+Check your work with the following command:
+ cat $HOME/.hg.conf
+Download GenePredToGtf utility from the command line:
+ #for MacOSX
+ wget http://hgdownload.soe.ucsc.edu/admin/exe/macOSX.x86_64/genePredToGtf
+ #for Linux
+ wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/genePredToGtf
+Set the permissions:
+  chmod 600 .hg.conf
+  chmod +x genePredToGtf
+Now you can use the genePredToGtf command to pull gene files directly from the UCSC public database and convert them to GTF format.
+For example, fetch NCBI's refGene track from hg38 and save to a local file named refGene.gtf:
+  ./genePredToGtf hg38 refGene refGene.gtf
+Note: The GTF files in the UCSC download server were created using the -utr flag. This adds the 5' and 3' utrs to the 9th field:
+  ./genePredToGtf -utr hg38 refGene refGene.gtf
+==Bed format gene tracks (convert bed > genePred > GTF) ==
 Some gene tracks are in a bed format in the database, perhaps with extra columns past the
 standard bed format.  In this case, extract the standard bed columns, convert it
 to a genePred and then to a gtf.  For example wgRna:
-  mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N \
+  mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -N \
     -e "select chrom,chromStart,chromEnd,name,score,strand,thickStart,thickEnd from wgRna;" hg19 \
      | bedToGenePred stdin stdout | genePredToGtf file wgRna.gtf
+Note that in the above methods, it was necessary to cut columns 1 - 10 to remove the extra trailing columns. With the method detailed here, this cut is not necessary in the case of using the database table since the command can determine from the table structure which columns to use.
+==Get a genePred file from UCSC MySQL public databases, then convert to GTF==
+Information about MySQl - http://genome.ucsc.edu/goldenPath/help/mysql.html
+MySQL query example to get a genePred file:
+<pre>
+$ mysql --host=genome-mysql.soe.ucsc.edu --user=genome -Ne "select a.name, a.chrom, a.strand, a.txStart, a.txEnd,\
+a.cdsStart, a.cdsEnd, a.exonCount, a.exonStarts, a.exonEnds, 0 as score, b.geneSymbol from knownGene a join \
+kgXref b on a.name=b.kgID" hg19 > hg19.genePred
+</pre>
+Next, using the genePredToGtf utility:
+<pre>
+genePredToGtf file hg38.genePred hg38.knownGene.gtf
+</pre>
+genePred output will look like this:
+<pre>
+uc001aaa.3    chr1    +    11873    14409    11873    11873    3    11873,12612,13220,    12227,12721,14409,    0    DDX11L1
+uc010nxr.1    chr1    +    11873    14409    11873    11873    3    11873,12645,13220,    12227,12697,14409,    0    DDX11L1
+uc010nxq.1    chr1    +    11873    14409    12189    13639    3    11873,12594,13402,    12227,12721,14409,    0    DDX11L1
+</pre>
+Result:
+<pre>
+chr1    hg19.genePred    transcript    11874    14409    .    +    .    gene_id "DDX11L1"; transcript_id "uc001aaa.3";  gene_name "DDX11L1";
+chr1    hg19.genePred    exon    11874    12227    .    +    .    gene_id "DDX11L1"; transcript_id "uc001aaa.3"; exon_number "1"; exon_id "uc001aaa.3.1"; gene_name "DDX11L1";
+chr1    hg19.genePred    exon    12613    12721    .    +    .    gene_id "DDX11L1"; transcript_id "uc001aaa.3"; exon_number "2"; exon_id "uc001aaa.3.2"; gene_name "DDX11L1";
+chr1    hg19.genePred    exon    13221    14409    .    +    .    gene_id "DDX11L1"; transcript_id "uc001aaa.3"; exon_number "3"; exon_id "uc001aaa.3.3"; gene_name "DDX11L1";
+</pre>
+==The opposite direction, GTF to GenePred==
+There is a utility for this as well: [http://hgdownload.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred gtfToGenePred]. Here are some examples of using this utility:
+<pre>
+$ wget ftp://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/current/MANE.GRCh38.v0.5.select_ensembl_genomic.gtf.gz
+$ gunzip MANE.GRCh38.v0.5.select_ensembl_genomic.gtf.gz
+$ head -n1 MANE.GRCh38.v0.5.select_ensembl_genomic.gtf
+chr1	ensembl_havana	gene	944203	959309	.	-	.	gene_id "ENSG00000188976.11"; gene_type "protein_coding"; gene_name "NOC2L";
+# BASIC USAGE
+$ gtfToGenePred MANE.GRCh38.v0.5.select_ensembl_genomic.gtf MANE.GRCh38.v0.5.select_ensembl_genomic.genePred
+$ head -1 MANE.GRCh38.v0.5.select_ensembl_genomic.genePred
+ENST00000327044.7	chr1	-	944202	959256	944693	959240	19	944202,945056,945517,946172,946401,948130,948489,951126,951999,952411,953174,953781,954003,955922,956094,956893,957098,958928,959214,	944800,945146,945653,946286,946545,948232,948603,951238,952139,952600,953288,953892,954082,956013,956215,957025,957273,959081,959256,
+# EXTENDED USAGE
+$ gtfToGenePred -genePredExt MANE.GRCh38.v0.5.select_ensembl_genomic.gtf MANE.GRCh38.v0.5.select_ensembl_genomic.genePredExt
+$ head -1 MANE.GRCh38.v0.5.select_ensembl_genomic.genePredExt
+ENST00000327044.7	chr1	-	944202	959256	944693	959240	19	944202,945056,945517,946172,946401,948130,948489,951126,951999,952411,953174,953781,954003,955922,956094,956893,957098,958928,959214,	944800,945146,945653,946286,946545,948232,948603,951238,952139,952600,953288,953892,954082,956013,956215,957025,957273,959081,959256,	ENSG00000188976.11	cmpl	cmpl	1,1,0,0,0,0,0,2,0,0,0,0,2,1,0,0,2,2,0,</pre>
+For a full list of options available to gtfToGenePred, run the program with no args:
+<pre>
+gtfToGenePred - convert a GTF file to a genePred
+usage:
+   gtfToGenePred gtf genePred
+options:
+     -genePredExt - create a extended genePred, including frame
+      information and gene name
+     -allErrors - skip groups with errors rather than aborting.
+      Useful for getting infomation about as many errors as possible.
+     -ignoreGroupsWithoutExons - skip groups contain no exons rather than
+      generate an error.
+     -infoOut=file - write a file with information on each transcript
+     -sourcePrefix=pre - only process entries where the source name has the
+      specified prefix.  May be repeated.
+     -impliedStopAfterCds - implied stop codon in after CDS
+     -simple    - just check column validity, not hierarchy, resulting genePred may be damaged
+     -geneNameAsName2 - if specified, use gene_name for the name2 field
+      instead of gene_id.
+     -includeVersion - it gene_version and/or transcript_version attributes exist, include the version
+      in the corresponding identifiers.
+</pre>
+==Mailing List Resources about GTF==
+Please also try searching our mailing list for previous answers that may be of interest.
+* Here is an example of  searching for [https://groups.google.com/a/soe.ucsc.edu/forum/?hl=en&fromgroups#!searchin/genome/genePredToGtf%7Csort:date genePredToGtf].
+* Here is an example answer about [https://groups.google.com/a/soe.ucsc.edu/d/msg/genome/kbCPdiLwbC4/YtdtXxrWBAAJ using a MySQL query] to build a new genePred format with a query using the kgXref (knownGene cross reference) table.
+==Scripting examples==
+===Scripting to add IDs and other fields into the header of an .fa sequence file===
+[https://groups.google.com/a/soe.ucsc.edu/d/msg/genome/R8CstMtiJZM/TFeA7iIYAQAJ Link to an archived help forum topic]
+===Perl script to find and replace the "gene ID" with the Ensembl ID, which is named "transcript_id."===
+Example output prior to using the script below (in the next step).
+The output below is example output from the command ''genePredToGtf hg38 wgEncodeGencodeCompV24 hg38FileTest.gtf''
+<pre>
+chr1 wgEncodeGencodeCompV24 transcript 17369 17436 . - . gene_id "MIR6859-1"; transcript_id "ENST00000619216.1"; gene_name "MIR6859-1";
+chr1 wgEncodeGencodeCompV24 exon 17369 17436 . - . gene_id "MIR6859-1"; transcript_id "ENST00000619216.1"; exon_number "1"; exon_id "ENST00000619216.1.1"; gene_name "MIR6859-1";
+chr1 wgEncodeGencodeCompV24 transcript 29554 31097 . + . gene_id "RP11-34P13.3"; transcript_id "ENST00000473358.1"; gene_name "RP11-34P13.3";
+chr1 wgEncodeGencodeCompV24 exon 29554 30039 . + . gene_id "RP11-34P13.3"; transcript_id "ENST00000473358.1"; exon_number "1"; exon_id "ENST00000473358.1.1"; gene_name "RP11-34P13.3";
+chr1 wgEncodeGencodeCompV24 exon 30564 30667 . + . gene_id "RP11-34P13.3"; transcript_id "ENST00000473358.1"; exon_number "2"; exon_id "ENST00000473358.1.2"; gene_name "RP11-34P13.3";
+</pre>
+Example output after running this perl script:
+<pre>
+perl -wpe 's/gene_id "[^"]+"; transcript_id "([^"]+)"/gene_id "$1"; transcript_id "$1"/;' genePredToGtf.output
+</pre>
+<pre>
+chr1 wgEncodeGencodeCompV24 transcript 17369 17436 . - . gene_id "ENST00000619216.1"; transcript_id "ENST00000619216.1"; gene_name "MIR6859-1";
+chr1 wgEncodeGencodeCompV24 exon 17369 17436 . - . gene_id "ENST00000619216.1"; transcript_id "ENST00000619216.1"; exon_number "1"; exon_id "ENST00000619216.1.1"; gene_name "MIR6859-1";
+chr1 wgEncodeGencodeCompV24 transcript 29554 31097 . + . gene_id "ENST00000473358.1"; transcript_id "ENST00000473358.1"; gene_name "RP11-34P13.3";
+chr1 wgEncodeGencodeCompV24 exon 29554 30039 . + . gene_id "ENST00000473358.1"; transcript_id "ENST00000473358.1"; exon_number "1"; exon_id "ENST00000473358.1.1"; gene_name "RP11-34P13.3";
+chr1 wgEncodeGencodeCompV24 exon 30564 30667 . + . gene_id "ENST00000473358.1"; transcript_id "ENST00000473358.1"; exon_number "2"; exon_id "ENST00000473358.1.2"; gene_name "RP11-34P13.3";
+</pre>