DoEnsGeneUpdate
Ensembl Gene updates for the UCSC genome browser
Source File
Usage
usage: doEnsGeneUpdate.pl -ensVersion=NN <db>.ensGene.ra required options: -ensVersion=NN - must specify desired Ensembl version - possible values: 48, 47 <db>.ensGene.ra - configuration file with database and other options
Steps of script
- download - fetch gtf and peptide files from Ensembl FTP site.
Optionally, the assembly.txt and seq_region.txt MySQL table dumps for GeneScaffold coordinate translation. - process - perform transformation of gtf file into UCSC genePred file with appropriate coordinate transformations
- load tables ensGene, ensGtp, ensPep, and optionally ensemblGeneScaffold
- cleanup removes temporary files
- makeDoc prints out what would be in the make doc and does a sanity check on the tables
download
Files are fetched from ftp://ftp.ensembl.org/pub/
Version 48 GTF files are under that URL plus: release-48/homo_sapiens/Homo_sapiens.NCBI36.48.gtf.gz
Peptide files under that URL plus: release-48/homo_sapiens/pep/Homo_sapiens.NCBI36.48.pep.all.fa.gz
When translating from GeneScaffold coordinates, the two mysql table dumps, assembly.txt.gz and seq_region.txt.gz under that URL plus: release-48/mysql/'homo_sapiens_core_48_36j/
Beware, Ensembl may change these locations in the future. These specific file names and URL paths are encoded in the PERL module src/hg/utils/automation/EnsGeneAutomate.pm referenced by an Ensembl version number and a UCSC database name. The perl script /cluster/bin/scripts/ensVersions can be used to examine the list of possible Ensembl versions vs. the UCSC database name. Currently versions 47 and 48 are available. When Ensembl updates occur, this PERL module would be updated to encode new version names.
process
This sequence of events is driven by the specified options in the <db>.ensGene.ra configuration file.
liftRandoms yes - currently only on Mouse - mm9. Utilizes the ctgPos table to construct a lift file to lift Ensembl contig coordinates to UCSC chr*_random coordinates.
nameTranslation <"sed commands"> - almost always required. A series of sed commands to translate Ensembl chrom names, which are usually just numbers, into UCSC chrom naming scheme which is usually chr<N>. Unusual transformations can also take place here. Names which UCSC does not have can be filtered out. Example, from chicken galGal3:
nameTranslation "s/^\([0-9EWXYZ][0-9]*\)/chr\1/; s/^MT/chrM/; s/^Un/chrUn/"
To see all examples:
$ grep ^nameTranslation /cluster/data/*/*.ensGene.ra
haplotypeLift <path to lift file.lft> - currently only on Human, hg18. Translate Ensembl haplotype full chrom coordinates into UCSC simple haplotype coordinates.
After the above potential lifts and name translations, the Ensembl GTF file is translated to a UCSC genePred file with gtfToGenePred and options -infoOut=infoOut.txt and -genePredExt. The infoOut.txt file is used to extract the peptide and other gene names which may be in the GTF file.
geneScaffolds yes - on the scaffold based assemblies, Ensembl does gene predictions by mapping genes from their own private "GeneScaffold" coordinate system onto the scaffolds. During this mapping, gene predictions can spread out over multiple scaffolds and even reverse order of exons within a scaffold. The Ensembl MySQL tables assembly and seq_region are used to determine the mapping of GeneScaffolds to UCSC scaffolds. This procedure adds a new track to the UCSC genome browser showing the mapping of the GeneScaffolds to scaffolds. This information could be useful to genome assembly teams. A <db>.ensGene.lft file is built via ensGeneScaffolds.pl and used with the liftAcross command. A GeneScaffold may not necessarily map completely to scaffolds. Therefore, not all parts of a gene may necessarly map to a scaffold. Genes can be incomplete when viewed in their scaffold context.
liftUp <path to liftUp file> - one potential lift after GeneScaffold conversion to change Ensembl names into UCSC gene names. Currently used on Fugu fr2, Tree Shrew tupBel1 and Bushbaby otoGar1. This is a simple one to one name translation where the UCSC names have a pattern but it can not be taken care of with the previously mentioned sed nameTranslation.
load
- optionally load ensemblGeneScaffold table
- ensGene table loaded from genePred file
- ensPep table loaded from Ensembl peptide file
- verify foreign key relationship between ensGene and ensPep name columns
- optionally load knownToEnsembl table
- insert row in hgFixed.trackVersion for version recording
After the process step, all files are ready to load. Optionally it will load the ensemblGeneScaffold track if GeneScaffolds were used. The genePred file is loaded into ensGene, peptides into ensPep, and name relations, gene, transcript, other name, into ensGtf. The name correspondence/foreign key relationship between the ensGene and ensPep tables is verified to match the %96 minimum all.joiner requirement.
cleanup
Not much going on here, some of the temp files created during loading are removed.
makeDoc
The <db>.ensGene.ra files were created the first time this process was run and continue to exist in /cluster/data/<db>/<db>.ensGene.ra. They can be used to generate the make doc text entry to document the procedure that would be used to run this process again. This step of the procedure will output those commands, looking something like the following, for ci2, Ciona Intestinalis:
############################################################################ # Adding Ensembl Genes (DONE - 2008-02-22 - Hiram) ssh kkstore02 cd /cluster/data/ci2 cat << '_EOF_' > ci2.ensGene.ra # required db variable db ci2 # optional nameTranslation, the sed command that will transform # Ensemble names to UCSC names. With quotes just to make sure. nameTranslation "s/^\([0-9][pq]\)/chr0\1/; s/^\([0-9][0-9][pq]\)/chr\1/; " '_EOF_' # << happy emacs doEnsGeneUpdate.pl -ensVersion=48 ci2.ensGene.ra
A copy of all these entries has been saved in the source tree in src/hg/makeDb/doc/makeEnsembl.txt