GenbankAlignments

From Genecats
Jump to navigationJump to search

The Genbank Alignment Process

This is a description of the behind the scenes parts of the Genbank alignment process. If you want doc on how to add a new species to the list of aligned assemblies you want to go here.

Overview

The genbank alignment process aligns RNA and EST sequences from NCBI, as well as the RefSeq mRNA's to (almost) all the assemblies that UCSC supports. The process is divided into roughly five parts: download, process, align, database load, and dissemination. The first four parts and the beginning of the fifth happen on the genbank-101 machine, the dissemination part includes hgwdev, hgwbeta, and then to our official mirrors (RR, euro, japan).

GenBank/RefSeq Update Goals

  • Incremental update of mRNAs and ESTs for multiple species and assemblies based on daily updates from NCBI.
  • Automatically run from cron, possibly every night.
  • Only require manual intervention on non-recoverable errors
  • Incremental across GenBank releases; don't force a full realignment every quarter.
  • Allow removal of older genbank full releases (and still not force a full realignment).
  • Avoid corruption of disk files and databases.
  • Recover from failures state, automatically when possible, making manual recover easy.
  • Allow restarting failed steps without restarting the entire process.
  • Don't require the process to be run at defined intervals. When a run is done, data files will be updated to reflect the current state of the NCBI repository.

GenBank/RefSeq Download Step

The download step retrieves files from the NCBI ftp area and store the results in the download/ directory. Algorithm Run the gbDownloadStep script, which:

  • Check if the GenBank release version has change from last version.
  • If version changed, create a new version directory and download the full release and non-cumulative daily files.
  • If the version is unchanged, download only the new, non-cumulative daily files.
  • Downloads RefSeq files, similar to GenBank download

Note that this process isn't mirroring; it doesn't overwrite existing files. This minimizes the danger of leaving the data in an indeterminate state. Only the required subset of files are downloaded.

Downloads Directory structure

The directory structure for GenBank and RefSeq are a subset of the directories at the NCBI ftp site. Release version numbers are added to the database directory names to allow keeping multiple versions. Now that RefSeq is doing versioned releases, it is handled in the same manner as GenBank on an independent releases cycle.

$gbRoot/data/download/ - downloaded files from NCBI ftp genbank/${ver}/

README.genbank - version number is parsed from text
gbrel.txt - release nodes
gb*.seq.gz - sequences, in flat-file format, grouped by division.
full.md5 - md5 checksums for files downloaded with the full releases.
daily-nc/
nc0825.flat.gz - flat-file update for a 08/25.
nc0825.flat.gz.md5 - md5 checksum for update file


refseq.${ver} release/complete/

complete*.rna.gbff.gz - flat-files containing full RefSeq mRNA .
full.md5 - md5 checksum of all files
daily/
rsnc.0101.2002.gbff.gz - Flat-file update for 01/01/2002.
rsnc.0101.2002.gbff.gz.md5 - md5 checksum

...

GenBank/RefSeq Data Processing Step

The data processing step extracts data from the downloaded GenBank files into a format that is ready for import into the database.

Run the gbProcessStep script, which:

Examines each full and daily download files to determine which files need to be created. For each set of source files, check to see if mrna.md5 and est.*.md5 files exist in the appropriate processed/ directory.
For each missing *.md5 file, run the gbProcessSeqs script, which:
Parse flat-files with gbToFaRa into data files that are used to update the browser databases. An index file (*.gbidx) is created to location the each sequence and version. All species remain grouped together; spliting by species at this step would generate a very large number of small files.
Checksum (md5) the data files. The checksum file serves as indicator that the task completed successfully.

Processed Directory structure

$gbRoot/data/processed/ - data extracted from the NCBI flat-files genbank.${ver}/

full/
mrna.ra.gz - meta-data for mRNAs
mrna.fa - fasta sequence data
mrna.gbidx - index file
mrna.md5 - checksums of all mRNA files
est.aa.ra.gz - files for ESTs accessions starting with AA (case insensitive).
est.aa.fa, est.aa.gbidx, est.aa.cksum
est.ab.ra.gz, est.ab.fa, est.ab.gbidx, est.ab.cksum
daily.${date}/
mrna.ra.gz, ...
est.aa.ra, ...
refseq.${ver}/
full/
mrna.ra.gz, ...
daily.${date}/
mrna.ra.gz,

Genbank index file A GenBank index file is a tab-seperated file in the format:

   acc version moddate organism

The name of the file is either mrna.gbidx or est.*.gbidx and is associated with the a *.ra or *.fa files of the same name. The columns are: acc - GenBank or RefSeq accession version - Version number, not including the accession moddate - Modification date, in 2002-22-08 format organism - Organism name

GenBank/RefSeq Alignment Step

This step is done for each browser database (species and assembly) that is being updated. Algorithm

  • Select the most current GenBank (and corresponding RefSeq) full releases from the processed/ directory.
  • If there is not a corresponding aligned/.../full directory for the release:
  • If there is a previous aligned release for this database, find the aligned sequences that have not changed since that release and the new, full release. Save these in a a temporary directory.
  • Copy new and changed sequences to temporary fasta files for alignment.
  • For each processed/.../daily.${ver}/ directory that does not have a completed alignment/ directory:
  • Copy new and changed sequences to temporary fasta files for alignment.
  • If the number of sequences requiring alignment exceeds some configured threshold, send e-mail requesting an alignment on the big cluster and stop the automated process. Normally, this should only be required when a new database is built.
  • If the number of sequences is below the threshold, run BLAT on the mini-cluster, using the parasol make facility.
  • Process the completed alignments:
  • Combine alignments migrated from the previous releases if pending
  • Building an index file
  • Checksum the files.
  • Do some sanity check on the alignment. Check that changed sequences continue to align, at least in most cases and that the number of aligned sequences increases.

Directory structure

  • $gbRoot/data/aligned/ - aligned files
  • genbank.${ver}/
  • ${db}/ - alignments for this genome database (e.g. hg12).
  • full/ - Alignments corrisponding to the full release. This is a combination alignments migrated from previous releases and new alignments.
  • mrna.native.psl.gz, mrna.native.oi.gz, mrna.native.alidx, mrna.native.md5
  • est.aa.native.psl.gz, est.aa.native.oi.gz, est.aa.native.alidx, est.aa.native.md5,
  • mrna.xeno.psl.gz, mrna.xeno.oi.gz, mrna.xeno.alidx, mrna.xeno.md5
  • est.aa.xeno.psl.gz, est.aa.xeno.oi.gz, est.aa.xeno.alidx, est.aa.xeno.md5,
  • daily.${date}/ - Alignments for sequence that were new or modified in the daily update.
  • refseq.${ver}/ - BLAT alignments for RefSeq, same structure as used for GenBank, with only native mRNAs.

Index file Two alignment index files are always created for the corrisponding processed.gbidx file, for native and xeno, the if there are no sequences to align. This supports easy checking for the alignment being completed. The file is a tab-seperated in the format:

   acc version numaligns

The name of the file is either mrna.alidx or est.*.alidx and is associated with the a *.psl file of the same name. The columns are:

  • acc - GenBank or RefSeq accession
  • version - Version number, not including the accession
  • numAligns - Count of the number of alignments for this accession.

GenBank/RefSeq Data Database Update Step

This step is done for each browser database (species and assembly) on each database server. When building a new database, this process is run on the master database (genbank), and then the database pushed to hgwdev.

Genbank Status Table

This table (gbStatus) is used to keep track of the current version of the Genbank data this is loaded in the database. While some of this information is redundant to the mrna table, this data is only used by by the update. This table is updated last, so that it can be used to record form failures during update.

The columns of the tables are:

  • acc - GenBank accession
  • version - GenBank version number.
  • modDate - last modified date.
  • type - the type of the entry: EST or mRNA
  • srcDb - source database: GenBank or RefSeq
  • gbSeq - id in gbSeq table.
  • numAligns - number of alignments of the accession in the approriate track
  • seqRelease - release version where the sequence was obtained
  • seqUpdate - update where sequence was obtained (date or full
  • metaRelease - release version where the metadata was obtained
  • metaUpdate - update where the metadata was obtained (date or full)
  • extRelease - release version containing the external file
  • extUpdate - update containing the external file (date or full)
  • time - time that this entry was inserted

Genbank Loaded Table

This table contains the releases, updates and partitions that have been loaded. For a given partition, the updates containing gbidx or alidx files are compared to the gbLoaded table. If all of the updates are loaded, there is no need to do any more checking. This saves loading the alidx files and querying the gbStatus and gbSeq tables.

The columns of the tables are:

  • srcDb - source database: GenBank or RefSeq
  • type - the type of the entry: EST or mRNA
  • loadRelease - release version
  • loadUpdate - update date or full
  • accPrefix - accession prefix for ESTs
  • time - time entry was added

Algorithm

Table updates are done the the following steps. This is designed to allow restarting the update process on a crash from any point. It prevents display of stale data if the update process aborts. There is a window where a sequence that has change will not be in the database. If the load process crashes, new sequences may be in the other tables, but not in the gbStatus table. To detect this, we check the new sequences against the gbSeq table. These are orphaned sequences that must first be removed before loading. new seqChanged metaChanged deleted orphaned

To minimize the memory require for the update, one partition of the date is loaded at a time. Partitions the RefSeq mRNAs, the GenBank mRNAs, and the GenBank ESTs split on the first two letters of the accession.

  • For each partition of the GenBank and RefSeq data:
  • Determine if this partition could have any data to load by comparing the processed gbidx files with the gbLoaded, skipping the partition if there is no missing updates.
  • Delete any accession in etc/ignore.idx from all relevant tables. Doing this upfront prevents a lot of complexity in other code.
  • Compare accession versions stored in gbStatus table to contents of the processed/ and aligned/ directories (gbIndex). classifying each accessions as:
  • new - Accession is not in gbStatus.
  • seqChanged - The sequence and metadata changed.
  • metaChanged - The metadata changed.
  • extChanged - The release containing the external sequence files has changed and the entry has not changed. This is used to migrate fasta file references to the latest release, to allow cleanup of older releases.
  • deleted - The accession is not in the gbIndex.
  • Check gbSeq table to see if contains any of the new entries, which become orphaned. These are sequences from a failed load.
  • Remove seqChanged, deleted, and orphaned from alignment and orientation information tables.
  • Remove deleted and orphaned accessions from the mrna table.
  • If this is RefSeq, remove deleted and orphaned accessions from the refSeqStatus and refLink tables.
  • Remove deleted and orphanedaccessions from the gbSeq table.
  • Update rows for seqChanged and metaChanged accessions in the gbSeq table. Add new accessions to the gbSeq table. This must be the first table update so that orphans can be detected.
  • Update rows for seqChanged and metaChanged accessions in the gbSeq table. Add new accessions to the gbSeq table.
  • If this is RefSeq, update rows for seqChanged and metaChanged accessions in the refSeqStatus and refLink tables and add new accessions to these tables.
  • Add new strings to the unique string tables (author, library, etc), No attempt is made to remove entries that will no longer be referenced.
  • Update rows for seqChanged and metaChanged accessions in the gbSeq table. Add new accessions to the gbSeq table.
  • Update rows for seqChanged and metaChanged accessions in the mrna table. Add new accessions to the mrna table.
  • Add new and seqChanged rows to the alignment and orientation information tables.
  • Update sequence fastas references the gbSeq and gbExtFile tables for all relChanged entries.
  • Update rows for seqChanged and metaChanged accessions in the gbStatus table. Add new accessions to the table.

GenBank/RefSeq Error Handling

The following approaches are used to make this process as robust as possible:

  • The output of the all scripts is logged; error notification is done by e-mail.
  • Semaphore files are created when a task script is running, which prevents other tasks from being accidental run at the same time. If a task fails, a file is created indicate that this occurred. Tasks will not run as long as a failed semaphore file is in place. The condition must be manually correct and the semaphores removed.
  • There is a defined data flow between each step. A step can be restart from the beginning to synchronize it with results of the previous step without corrupting data.
  • Files are written in an atomic manner where needed, first writing the file in the same directory with a temporary name, then renaming it. This is done any time the existance of a file indicates a step is complete.
  • Data files are not modified after successful creation.
  • Various verifications and sanity checks are used.
  • Ability to explictly exclude incorrect genbank entries (data/ignore.idx files).

GenBank/RefSeq Annoying Issues

  • The entire GenBank directory is replace when a new version is release. Daily releases are relative to this.
  • GenBank daily release don't indicate deleted entries.
  • GenBank daily filenames don't include a year, so daily files between the beginning of the year and next release (probably Jan 15th) will not sort in a simple manner.
  • RefSeq updates it cumulative files daily as well as having separate daily files. There is no concept of a release.
  • RefSeq deleted entries are still in the older daily releases. There no daily records indicating when an entry has been deleted.
  • Ocassionally, there are incorrect genbank entries that break assumptions in this code. These are skipped by placing them an data/ignore.idx acc.
  • MySql ISAM tables don't support foreign keys. Using auto_increment for id columns was a problems because mysqlimport would reset the numbers (or at least not insert zero).
  • Want to use disk files rather than a database to track genbank repository files. This is faster when we need to look at all entries and makes setup and loading multiple database servers easier. It was also easier to implement. However this proved to be a problem for ESTs, which require large amount of memory to handle. To reduce the memory required, ESTs are partitioned by the first two letters of the accession.
  • Don't handle realigning sequences (say to take advantage of changes to the aligner).

Overview of directories

$gbRoot/ - root directory

etc/ - configuration files and scripts
ignore.idx - ignore index file.
genban.conf - configuration file.
data/ - data files
download/ - downloaded files from NCBI ftp
genbank.${ver}/
genbank.${ver}/daily-nc/
refseq.${ver}/cummulative/
refseq.${ver}/daily/
processed/ - data extracted from the NCBI flat-files
genbank.${ver}/
full/
daily.${date}/
refseq.${ver}/
full/
daily.${date}/
aligned/ - aligned sequences
${db}/
var/build/ - files associated with download and build steps. Only on build server
run/ - semaphore files
logs/ - log files
build.time - File contain the time that the last download and alignment steps completed, in seconds since 00:00:00 1970-01-01 UTC. This is used by process running on other systems to poll for completion.
var/copy/ - files associated with copying to the gbdb server.
run/ - semaphore files
logs/ - log files
build.time - copy of build/build.time from the last completed copy.
copy.time - file containing time last copy completed.
var/dbload/$host/ - files associated with that last database load on database server $host.
run/ - semaphore files
logs/ - log files
copy.time - copy of copy/copy.time from the last completed copy.
load.time - file containing time last load completed completed.

Realigning Tracks

It maybe necessary to realign and reload tracks to change alignment parameters or other attributes. This is fairly straight forward when a genome databases is initially being built. It's more complex if one has to sync up multiple systems.

  • To triger a realignment, on needs to remove the related files for some partation of the data for all updates. These live under either the genbank or refseq alignment directories, for example:
  • data/aligned/genbank.139.0/hg16/
  • data/aligned/refseq.139.0/hg16/
  • To realign native RefSeq mRNAs for hg16, one would remove:
  • data/aligned/refseq.139.0/hg16/*/mrna.native.*
  • To realign xeno GeneBank ESTs for hg16, one would remove:
  • data/aligned/refseq.139.0/hg16/*/est.*.xeno.*
  • Reload the database with the partation of data that was realigned. The -srcDb and -type options restrict the subset. The organism category (native or xeno) isn't specified. Reloading of ESTs isn't supported, use -drop and -initialLoad instead.
  • nice bin/gbDbLoadStep -reload -srcDb=genbank -type=mrna $db