DCC pipeline discussion

From genomewiki
Revision as of 02:27, 5 November 2007 by Kate (talk | contribs)
Jump to navigationJump to search

In order to define the functions of the automated pipeline, it is useful to look at the existing, manual, process of creating ENCODE tracks, as well as what new features will be needed for the ENCODE production phase. At the end of this page is a proposed submission process/pipeline for the DCC that was developed from discussion of the first two sections.

Existing process (ENCODE pilot phase)

1. Data submitter creates a submission package, consisting of data files and documentation files. The data files are in these formats: BED, wiggle, GTF. Usually the files are tarred and gzipped, with arbitrary directory structure. Files are named to indicate the experiment. Sometimes there is a README. Sometimes there are custom track headers in the files. Sometimes there are multiple custom tracks in a file. Sometimes the documentation is HTML, sometimes it is MS-WORD. Usually they follow the description page template we provide. Often the descriptions are incomplete. Sometimes they provide URL's for cell lines and antibodies. Sometimes they provide URL's for references (but not in our standard format). Often there are multiple tracks for the same experiment (e.g. Signal and Sites).

2. Data submitter posts the submission to our FTP site, or posts in their web space, then emails UCSC to notify of the submission.

3. Kate responds to email, creates a named & dated build dir transfers the submission to the build dir, and creates an entry in the 'make doc'. Updates the ENCODE portal Data Status page, and notifies the submitter that we have received it.

4. Kate requests an engineer assignment from Donna, updates the project list (or dev pushQ) to include the new dataset.

5. Kate/developer decide on track group, track types, and track structure -- Should it be wiggle or bedGraph if it's float valued ? Should it have special track display, details, or filtering ? Should this be a new track or a new subtrack of an existing track? If it's a new track, should it be a composite ? Should it be part of an existing super-track, or should a new super-track be created ? Based on these choices and the metadata, labels and tablenames are chosen.

6. Developer processes files in preparation for loading:

  • remove track lines
  • split multi-track files
  • truncate precision
  • trim overlaps
  • fix off-by-one
  • coordinate conversion
  • assign unique item names
  • scale scores
  • sanity check data distribution with histogram

7. Developer loads data, including wigEncoding, and symlinking wib files.

8. Developer creates track configuration (trackDb):

  • ordering
  • labels: include submitter, experiment type, distinguishing metdata
     e.g. <submitter> <type> <antibody>   Yale ChIP Pol2
  • colors: selected to distinguish and draw attention to similar experiments
  • wiggle view limits determined by histogram
  • data version: MON YYYY
  • original assembly: how data was originally submitted

9. Developer edits & installs track description, or updates existing track description to include new subtrack info. Creates or updates super-track description if needed. Optionally passes on for scientific review (e.g. Ting, Rachel, Jim) or technical writing review (Donna).

10. Developer installs on genome-test, and requests review from submitter.

11. Developer posts downloads for any wiggle files.

12. Developer creates pushQ entry, notifies Kate that track is ready.

13. Kate updates internal Project List, external Data Status page, and reviews track.

14. Q/A reviews track and releases. Automation updates the ENCODE Release Log (if track name begins with 'ENCODE').

15. Kate updates the Data Status page.

16. Periodically (ideally quarterly, or when something significant happens), Kate posts a News item on the ENCODE portal, summarizing tracks released or other events. Also emails to the ENCODE Consortium mailing list.

16. Periodically (e.g NHGRI Progress Reports), Kate collects overall stats on tracks released and generates a report.


New Features for ENCODE Production phase

  • Web-based submission process
  • Standardized submission package with formal metadata (controlled vocabulary or URL)
  • Track structure defined by submission type and metadata
  • Track configuration generated automatically from metadata
  • Manual tweaking of generated track configuration (by developer or Q/A ?)
  • Manual editing of track description (by developer, Q/A, scientific lead, tech writer ?)
  • Interactive query of submissions and status
  • Automated notification to submitter if submission has problems
  • Automated request for review by submitter
  • Submitter acceptance triggers automated creation of pushQ entry.
  • Automated notification to submitter that track has been released
  • Regular, automated reporting of submissions and status -- quarterly summary to Consortium members, detailed report to NHGRI

Proposed process (ENCODE production phase)

1. Data submitter registers with the DCC, providing lab/institution information, contact information for the PI and bioinformatics contacts, and overall project description. If there are subgroups in the project that will be working directly with the DCC, they should register separately, but be associated with the overall grant so that submissions can be grouped for reporting purposes. This could be done using a Wiki form or a Ruby/Rails web page. The end result should be a project page on the Wiki and a database table with project information.

2. Data submitter provides methods description for each experiment type, describes experimental parameters (e.g. cell source, antibodies), and proposes data formats/constraints for submission. Examples of constraints are: data precision (default to 5 digits), allowed data range, adjustment +-1 required). The DCC reviews and amends as needed. When the documentation and data submission formats are finalized, the DCC assigns super-tracks/tracks and submission can proceed.

3. Data submitter creates a compressed archive containing all data files in approved formats, along with metadata files that describe each file. The metadata will describe each data file, including track name, assembly, and values of experimental variables (reference to existing controlled vocab, or adding new CV with identifier, short description, and URL or long description). Optionally, a URL for the raw data in a public respository can be included. Some additional constraints on packaging are:

  • Custom track headers ('track' and 'browser') lines are allowed, and will

be removed

  • Only a single dataset (or custom track) can be in a single file

The archive is uploaded from the web submission page.

4. The automated data validator will confirm package integrity (all files present and of correct format), metadata is complete, data constraints are met. Data characteristics will be computed and displayed/emailed to the submitter, e.g.:

  • #data elements
  • data range
  • data distribution

The data submission will be logged for reporting purposes.

<MORE TO COME>