DCC pipeline discussion
In order to define the functions of the automated pipeline, it is useful to look at the existing, manual, process of creating ENCODE tracks, as well as what new features will be needed for the ENCODE production phase. At the end of this page is a proposed submission process/pipeline for the DCC that was developed from discussion of the first two sections.
Existing process (ENCODE pilot phase)
1. Data submitter creates a submission package, consisting of data files and documentation files. The data files are in these formats: BED, wiggle, GTF. Usually the files are tarred and gzipped, with arbitrary directory structure. Files are named to indicate the experiment. Sometimes there is a README. Sometimes there are custom track headers in the files. Sometimes there are multiple custom tracks in a file. Sometimes the documentation is HTML, sometimes it is MS-WORD. Usually they follow the description page template we provide. Often the descriptions are incomplete. Sometimes they provide URL's for cell lines and antibodies. Sometimes they provide URL's for references (but not in our standard format). Often there are multiple tracks for the same experiment (e.g. Signal and Sites).
2. Data submitter posts the submission to our FTP site, or posts in their web space, then emails UCSC to notify of the submission.
3. Kate responds to email, creates a named & dated build dir transfers the submission to the build dir, and creates an entry in the 'make doc'. Updates the ENCODE portal Data Status page, and notifies the submitter that we have received it.
4. Kate requests an engineer assignment from Donna, updates the project list (or dev pushQ) to include the new dataset.
5. Kate/developer decide on track group, track types, and track structure -- Should it be wiggle or bedGraph if it's float valued ? Should it have special track display, details, or filtering ? Should this be a new track or a new subtrack of an existing track? If it's a new track, should it be a composite ? Should it be part of an existing super-track, or should a new super-track be created ? Based on these choices and the metadata, labels and tablenames are chosen.
6. Developer processes files in preparation for loading:
- remove track lines
- split multi-track files
- truncate precision
- trim overlaps
- fix off-by-one
- coordinate conversion
- assign unique item names
- scale scores
- sanity check data distribution with histogram
7. Developer loads data, including wigEncoding, and symlinking wib files.
8. Developer creates track configuration (trackDb):
- ordering
- labels: include submitter, experiment type, distinguishing metdata
e.g. <submitter> <type> <antibody> Yale ChIP Pol2
- colors: selected to distinguish and draw attention to similar experiments
- wiggle view limits determined by histogram
- data version: MON YYYY
- original assembly: how data was originally submitted
9. Developer edits & installs track description, or updates existing track description to include new subtrack info. Creates or updates super-track description if needed. Optionally passes on for scientific review (e.g. Ting, Rachel, Jim) or technical writing review (Donna).
10. Developer installs on genome-test, and requests review from submitter.
11. Developer posts downloads for any wiggle files.
12. Developer creates pushQ entry, notifies Kate that track is ready.
13. Kate updates internal Project List, external Data Status page, and reviews track.
14. Q/A reviews track and releases. Automation updates the ENCODE Release Log (if track name begins with 'ENCODE').
15. Kate updates the Data Status page.
16. Periodically (ideally quarterly, or when something significant happens), Kate posts a News item on the ENCODE portal, summarizing tracks released or other events. Also emails to the ENCODE Consortium mailing list.
16. Periodically (e.g NHGRI Progress Reports), Kate collects overall stats on tracks released and generates a report.
New Features for ENCODE Production phase
- Web-based submission process
- Standardized submission package with formal metadata (controlled vocabulary or URL)
- Track structure defined by submission type and metadata
- Track configuration generated automatically from metadata
- Manual tweaking of generated track configuration (by developer or Q/A ?)
- Manual editing of track description (by developer, Q/A, scientific lead, tech writer ?)
- Interactive query of submissions and status
- Automated notification to submitter if submission has problems
- Automated request for review by submitter
- Submitter acceptance triggers automated creation of pushQ entry.
- Automated notification to submitter that track has been released
- Regular, automated reporting of submissions and status -- quarterly summary to Consortium members, detailed report to NHGRI
Proposed process (ENCODE production phase)
1. Data submitter registers with the DCC, providing lab/institution information, contact information for the PI and bioinformatics contacts, and overall project description. If there are subgroups in the project that will be working directly with the DCC, they should register separately, but be associated with the overall grant so that submissions can be grouped for reporting purposes. This could be done using a Wiki form or a Ruby/Rails web page. The end result should be a project page on the Wiki and a database table with project information. DCC assigns a primary and alternate development engineer to the project.
2. Data submitter provides methods description for each experiment type, describes experimental parameters (e.g. cell source, antibodies), and proposes data formats/constraints for submission. Examples of constraints are: data precision (default to 5 digits), allowed data range, coordinate adjustment ('data starts at 0/1')). The DCC reviews and amends as needed. When the documentation and data submission formats are finalized, the DCC assigns super-tracks/tracks and submission can proceed.
3. Data submitter creates a compressed archive containing all data files in approved formats, along with metadata files that describe each file.
Some constraints on packaging are:
- Custom track headers ('track' and 'browser') lines are allowed, and will be removed
- Only a single dataset (or custom track) can be in a single file
The metadata will describe each data file, including:
- track name
- assembly
- experimental variables (e.g. cell type, antibody, treatment, timepoint)
- optional data version (default is YYYY-MM or YYYY-MM.# if > 1 submission/month)
- optional URL referencing raw data in a public repository, along with name of repository (e.g. GEO, ArrayExpress, short trace repository).
Values for experimental variables must use existing controlled vocabulary, or supply new CV with identifier, short description, and long description or URL.
The archive is uploaded from the web submission page.
4. The automated data validator will confirm package integrity (all files present and of correct format), metadata is complete, data constraints are met. Data characteristics will be computed and displayed/emailed to the submitter, e.g.:
- #data elements
- data range
- data distribution
The data submission will be logged for reporting purposes (# elements, type, date submitted).
5. Automated data loader loads database table and track configuration. Table names, labels, and version/assembly settings will be generated from metadata. Data limits and view windows will be determined from data range and distribution (e.g. 2nd-3rd quartile viewing window). Track description is automatically updated to include the submitted experiments (e.g. a table of cell types, or other experimental variables is generated from metadata in the trackDb, and inserted into the description). When this is complete, the status will change to 'Loaded', and email will be sent to the DCC and assigned developer.
6. Development engineer reviews the validation and loading results pages, track display, description, and configuration, adjusting niceties (labels, ordering, color, view limits), and checks performance. Developer sign-off causes submission to change status to 'Ready for submitter review', and email is sent to the submitter requesting sign-off.
7. Submitter reviews and signs-off on the submission. This results in the creation of a pushQ entry, and the status changes to 'Q/A'.
8. When data is released, status changes to 'released' and email is sent to notify the submitter.
A history of status changes will be maintained, so that summary reports can be generated to characterize DCC throughput and bottlenecks.
On a regular basis (quarterly), a summary report of tracks submitted and tracks released will be generated. These reports will be archived at the ENCODE portal/Wiki, and possibly emailed to the Consortium members (or by request ?). Detailed quarterly reports will be automatically generated, for delivery to NHGRI.