DoEnsGeneUpdate

From genomewiki
Jump to navigationJump to search

Ensembl Gene updates for the UCSC genome browser

Steps of script

  1. download - fetch gtf and peptide files from Ensembl FTP site.
    Optionally, the assembly.txt and seq_region.txt MySQL table dumps for GeneScaffold coordinate translation.
  2. process - perform transformation of gtf file into UCSC genePred file with appropriate coordinate transformations
  3. load tables ensGene, ensGtp, ensPep, and optionally ensemblGeneScaffold
  4. cleanup removes temporary files
  5. makeDoc prints out what would be in the make doc and does a sanity check on the tables

download

Files are fetched from ftp://ftp.ensembl.org/pub/

Version 48 GTF files are under that URL plus: release-48/homo_sapiens/Homo_sapiens.NCBI36.48.gtf.gz

Peptide files under that URL plus: release-48/homo_sapiens/pep/Homo_sapiens.NCBI36.48.pep.all.fa.gz

When translating from GeneScaffold coordinates, the two mysql table dumps, assembly.txt.gz and seq_region.txt.gz under that URL plus: release-48/mysql/'homo_sapiens_core_48_36j/

Beware, Ensembl may change these locations in the future. These specific file names and URL paths are encoded in the PERL module src/hg/utils/automation/EnsGeneAutomate.pm referenced by an Ensembl version number and a UCSC database name. The perl script /cluster/bin/scripts/ensVersions can be used to examine the list of possible Ensembl versions vs. the UCSC database name. Currently versions 47 and 48 are available. When Ensembl updates occur, this PERL module would be updated to encode new version names.

process

load

cleanup

makeDoc