GenbankAlignments

From Genecats
Revision as of 20:27, 15 January 2019 by Braney (talk | contribs)
Jump to navigationJump to search

The Genbank Alignment Process

This is a description of the behind the scenes parts of the Genbank alignment process. If you want doc on how to add a new species to the list of aligned assemblies you want to go here.

Overview

The genbank alignment process aligns RNA and EST sequences from NCBI, as well as the RefSeq mRNA's to (almost) all the assemblies that UCSC supports. The process is divided into roughly five parts: download, process, align, database load, and dissemination. The first four parts and the beginning of the fifth happen on the genbank-101 machine, the dissemination part includes hgwdev, hgwbeta, and then to our official mirrors (RR, euro, japan).

GenBank/RefSeq Update Goals Incremental update of mRNAs and ESTs for multiple species and assemblies based on daily updates from NCBI. Automatically run from cron, possibly every night. Only require manual intervention on non-recoverable errors or when a large cluster run is required to do a large alignment. Incremental across GenBank releases; don't force a full realignment every quarter. Allow removal of older genbank full releases (and still not force a full realignment). Avoid corruption of disk files and databases. Recover from failures state, automatically when possible, making manual recover easy. Allow restarting failed steps without restarting the entire process. Don't require the process to be run at defined intervals. When a run is done, data files will be updated to reflect the current state of the NCBI repository. Include HTS files in automated download process.

GenBank/RefSeq Download Step The download step retrieves files from the NCBI ftp area and store the results in the download/ directory. Algorithm Run the gbDownloadStep script, which: Downloads GenBank files: Check if the GenBank release version has change from last version. If version changed, create a new version directory and download the full release and non-cumulative daily files. If the version is unchanged, download only the new, non-cumulative daily files. Downloads RefSeq files, similar to GenBank download Note that this process isn't mirroring; it doesn't overwrite existing files. This minimizes the danger of leaving the data in an indeterminate state. Only the required subset of files are downloaded.

Directory structure The directory structure for GenBank and RefSeq are a subset of the directories at the NCBI ftp site. Release version numbers are added to the database directory names to allow keeping multiple versions. Now that RefSeq is doing versioned releases, it is handled in the same manner as GenBank on an independent releases cycle. $gbRoot/data/download/ - downloaded files from NCBI ftp genbank/${ver}/ README.genbank - version number is parsed from text gbrel.txt - release nodes gb*.seq.gz - sequences, in flat-file format, grouped by division. full.md5 - md5 checksums for files downloaded with the full releases. daily-nc/ nc0825.flat.gz - flat-file update for a 08/25. nc0825.flat.gz.md5 - md5 checksum for update file ... refseq.${ver} release/complete/ complete*.rna.gbff.gz - flat-files containing full RefSeq mRNA . full.md5 - md5 checksum of all files daily/ rsnc.0101.2002.gbff.gz - Flat-file update for 01/01/2002. rsnc.0101.2002.gbff.gz.md5 - md5 checksum ...

GenBank/RefSeq Error Handling The following approaches are used to make this process as robust as possible: The output of the all scripts is logged; error notification is done by e-mail. Semaphore files are created when a task script is running, which prevents other tasks from being accidental run at the same time. If a task fails, a file is created indicate that this occurred. Tasks will not run as long as a failed semaphore file is in place. The condition must be manually correct and the semaphores removed. There is a defined data flow between each step. A step can be restart from the beginning to synchronize it with results of the previous step without corrupting data. Files are written in an atomic manner where needed, first writing the file in the same directory with a temporary name, then renaming it. This is done any time the existance of a file indicates a step is complete. Data files are not modified after successful creation. Various verifications and sanity checks are used. Ability to explictly exclude incorrect genbank entries (data/ignore.idx files).

GenBank/RefSeq Annoying Issues The entire GenBank directory is replace when a new version is release. Daily releases are relative to this. GenBank daily release don't indicate deleted entries. GenBank daily filenames don't include a year, so daily files between the beginning of the year and next release (probably Jan 15th) will not sort in a simple manner. RefSeq updates it cumulative files daily as well as having separate daily files. There is no concept of a release. RefSeq deleted entries are still in the older daily releases. There no daily records indicating when an entry has been deleted. Ocassionally, there are incorrect genbank entries that break assumptions in this code. These are skipped by placing them an data/ignore.idx acc. MySql ISAM tables don't support foreign keys. Using auto_increment for id columns was a problems because mysqlimport would reset the numbers (or at least not insert zero). Want to use disk files rather than a database to track genbank repository files. This is faster when we need to look at all entries and makes setup and loading multiple database servers easier. It was also easier to implement. However this proved to be a problem for ESTs, which require large amount of memory to handle. To reduce the memory required, ESTs are partitioned by the first two letters of the accession. Don't handle realigning sequences (say to take advantage of changes to the aligner).

Overview of directories $gbRoot/ - root directory etc/ - configuration files and scripts ignore.idx - ignore index file. genban.conf - configuration file. data/ - data files download/ - downloaded files from NCBI ftp genbank.${ver}/ genbank.${ver}/daily-nc/ refseq.${ver}/cummulative/ refseq.${ver}/daily/ processed/ - data extracted from the NCBI flat-files genbank.${ver}/ full/ daily.${date}/ refseq.${ver}/ full/ daily.${date}/ aligned/ - aligned sequences ${db}/ var/build/ - files associated with download and build steps. Only on build server run/ - semaphore files logs/ - log files build.time - File contain the time that the last download and alignment steps completed, in seconds since 00:00:00 1970-01-01 UTC. This is used by process running on other systems to poll for completion. var/copy/ - files associated with copying to the gbdb server. run/ - semaphore files logs/ - log files build.time - copy of build/build.time from the last completed copy. copy.time - file containing time last copy completed. var/dbload/$host/ - files associated with that last database load on database server $host. run/ - semaphore files logs/ - log files copy.time - copy of copy/copy.time from the last completed copy. load.time - file containing time last load completed completed.

Realigning Tracks It maybe necessary to realign and reload tracks to change alignment parameters or other attributes. This is fairly straight forward when a genome databases is initially being built. It's more complex if one has to sync up multiple systems. If automated alignment or update has been enabled for the database, disable it by editing $gbRoot/etc/align.dbs. Make sure an automated alignment isn't current running. To triger a realignment, on needs to remove the related files for some partation of the data for all updates. These live under either the genbank or refseq alignment directories, for example: data/aligned/genbank.139.0/hg16/ data/aligned/refseq.139.0/hg16/ To realign native RefSeq mRNAs for hg16, one would remove: data/aligned/refseq.139.0/hg16/*/mrna.native.* To realign xeno GeneBank ESTs for hg16, one would remove: data/aligned/refseq.139.0/hg16/*/est.*.xeno.* Do an initial alignment as described above, restricting with -srcDb and -type. Reload the database with the partation of data that was realigned. The -srcDb and -type options restrict the subset. The organism category (native or xeno) isn't specified. Reloading of ESTs isn't supported, use -drop and -initialLoad instead. nice bin/gbDbLoadStep -reload -srcDb=genbank -type=mrna $db