DoEnsGeneUpdate: Difference between revisions

Revision as of 23:02, 3 March 2008

Ensembl Gene updates for the UCSC genome browser

Steps of script

download - fetch gtf and peptide files from Ensembl FTP site.
Optionally, the assembly.txt and seq_region.txt MySQL table dumps for GeneScaffold coordinate translation.
process - perform transformation of gtf file into UCSC genePred file with appropriate coordinate transformations
load tables ensGene, ensGtp, ensPep, and optionally ensemblGeneScaffold
cleanup removes temporary files
makeDoc prints out what would be in the make doc and does a sanity check on the tables

download

Files are fetched from ftp://ftp.ensembl.org/pub/

Version 48 GTF files are under that URL plus: release-48/homo_sapiens/Homo_sapiens.NCBI36.48.gtf.gz

Peptide files under that URL plus: release-48/homo_sapiens/pep/Homo_sapiens.NCBI36.48.pep.all.fa.gz

When translating from GeneScaffold coordinates, the two mysql table dumps, assembly.txt.gz and seq_region.txt.gz under that URL plus: release-48/mysql/'homo_sapiens_core_48_36j/

Beware, Ensembl may change these locations in the future. These specific file names and URL paths are encoded in the PERL module src/hg/utils/automation/EnsGeneAutomate.pm referenced by an Ensembl version number and a UCSC database name. The perl script /cluster/bin/scripts/ensVersions can be used to examine the list of possible Ensembl versions vs. the UCSC database name. Currently versions 47 and 48 are available. When Ensembl updates occur, this PERL module would be updated to encode new version names.

process

This sequence of events is driven by the specified options in the <db>.ensGene.ra configuration file.

liftRandoms yes - currently only on Mouse - mm9. Utilizes the ctgPos table to construct a lift file to lift Ensembl contig coordinates to UCSC chr*_random coordinates.

nameTranslation <"sed commands"> - almost always required. A series of sed commands to translate Ensembl chrom names, which are usually just numbers, into UCSC chrom naming scheme which is usually chr<N>. Unusual transformations can also take place here. Names which UCSC does not have can be filtered out. Example, from chicken galGal3:

nameTranslation "s/^\([0-9EWXYZ][0-9]*\)/chr\1/; s/^MT/chrM/; s/^Un/chrUn/"

To see all examples:

$ grep ^nameTranslation /cluster/data/*/*.ensGene.ra

haplotypeLift <path to lift file.lft> - currently only on Human, hg18. Translate Ensembl haplotype full chrom coordinates into UCSC simple haplotype coordinates.

After the above potential lifts and name translations, the Ensembl GTF file is translated to a UCSC genePred file with gtfToGenePred and options -infoOut=infoOut.txt and -genePredExt. The infoOut.txt file is used to extract the peptide and other gene names which may be in the GTF file.

geneScaffolds yes - on the scaffold based assemblies, Ensembl does gene predictions by mapping genes from their own private "GeneScaffold" coordinate system onto the scaffolds. During this mapping, gene predictions can spread out over multiple scaffolds and even reverse order of exons within a scaffold. The Ensembl MySQL tables assembly and seq_region are used to determine the mapping of GeneScaffolds to UCSC scaffolds. This procedure adds a new track to the UCSC genome browser showing the mapping of the GeneScaffolds to scaffolds. This information could be useful to genome assembly teams. A <db>.ensGene.lft file is built via ensGeneScaffolds.pl and used with the liftAcross command.

load
cleanup
makeDoc

@@ Line 27: / Line 27: @@
 ==process==
+This sequence of events is driven by the specified options in the &lt;db&gt;.ensGene.ra configuration file.
+<B>liftRandoms yes</B> - currently only on Mouse - mm9. Utilizes the ctgPos table to construct a lift file to lift Ensembl contig coordinates to UCSC chr*_random coordinates.
+<B>nameTranslation &lt;"sed commands"&gt;</B> - almost always required.  A series of sed commands to translate Ensembl chrom names, which are usually just numbers, into UCSC chrom naming scheme which is usually chr&lt;N&gt;.  Unusual transformations can also take place here.  Names which UCSC does not have can be filtered out.  Example, from chicken galGal3:
+<pre>
+nameTranslation "s/^\([0-9EWXYZ][0-9]*\)/chr\1/; s/^MT/chrM/; s/^Un/chrUn/"
+</pre>
+To see all examples:
+<pre>
+$ grep ^nameTranslation /cluster/data/*/*.ensGene.ra
+</pre>
+<B>haplotypeLift &lt;path to lift file.lft&gt;</B> - currently only on Human, hg18.  Translate Ensembl haplotype full chrom coordinates into UCSC simple haplotype coordinates.
+<HR>
+After the above potential lifts and name translations, the Ensembl GTF file is translated to a UCSC genePred file with gtfToGenePred and options -infoOut=infoOut.txt and -genePredExt.  The infoOut.txt file is used to extract the peptide and other gene names which may be in the GTF file.
+<HR>
+<B>geneScaffolds yes</B> - on the scaffold based assemblies, Ensembl does gene predictions by mapping genes from their own private "GeneScaffold" coordinate system onto the scaffolds.  During this mapping, gene predictions can spread out over multiple scaffolds and even reverse order of exons within a scaffold.  The Ensembl MySQL tables <em>assembly</em> and <em>seq_region</em> are used to determine the mapping of GeneScaffolds to UCSC scaffolds.  This procedure adds a new track to the UCSC genome browser showing the mapping of the GeneScaffolds to scaffolds.  This information could be useful to genome assembly teams.  A <em>&lt;db&gt;.ensGene.lft</em> file is built via <em>ensGeneScaffolds.pl</em> and used with the <em>liftAcross</em> command.
+<B>
 ==load==
 ==cleanup==
 ==makeDoc==

DoEnsGeneUpdate: Difference between revisions

Revision as of 23:02, 3 March 2008

Contents

Steps of script

download

process

load

cleanup

makeDoc

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

related sites

hosted projects

Tools