DCC pipeline discussion: Difference between revisions

Revision as of 01:47, 5 November 2007

In order to define the functions of the automated pipeline, it is useful to look at the existing, manual, process of creating ENCODE tracks, as well as what new features will be needed for the ENCODE production phase. At the end of this page is a proposed submission process/pipeline for the DCC that was developed from discussion of the first two sections.

Existing process (ENCODE pilot phase)

1. Data submitter creates a submission package, consisting of data files and documentation files. The data files are in these formats: BED, wiggle, GTF. Usually the files are tarred and gzipped, with arbitrary directory structure. Files are named to indicate the experiment. Sometimes there is a README. Sometimes there are custom track headers in the files. Sometimes there are multiple custom tracks in a file. Sometimes the documentation is HTML, sometimes it is MS-WORD. Usually they follow the description page template we provide. Often the descriptions are incomplete. Sometimes they provide URL's for cell lines and antibodies. Sometimes they provide URL's for references (but not in our standard format). Often there are multiple tracks for the same experiment (e.g. Signal and Sites).

2. Data submitter posts the submission to our FTP site, or posts in their web space, then emails UCSC to notify of the submission.

3. Kate responds to email, creates a named & dated build dir transfers the submission to the build dir, and creates an entry in the 'make doc'. Updates the ENCODE portal Data Status page, and notifies the submitter that we have received it.

4. Kate requests an engineer assignment from Donna, updates the project list (or dev pushQ) to include the new dataset.

5. Kate/developer decide on track group, track types, and track structure -- Should it be wiggle or bedGraph if it's float valued ? Should it have special track display, details, or filtering ? Should this be a new track or a new subtrack of an existing track? If it's a new track, should it be a composite ? Should it be part of an existing super-track, or should a new super-track be created ? Based on these choices and the metadata, labels and tablenames are chosen.

6. Developer processes files in preparation for loading:

remove track lines
split multi-track files
truncate precision
trim overlaps
fix off-by-one
coordinate conversion
assign unique item names
scale scores
sanity check data distribution with histogram

7. Developer loads data, including wigEncoding, and symlinking wib files.

8. Developer creates track configuration (trackDb):

ordering
labels: include submitter, experiment type, distinguishing metdata

     e.g. <submitter> <type> <antibody>   Yale ChIP Pol2

colors: selected to distinguish and draw attention to similar experiments
wiggle view limits determined by histogram
data version: MON YYYY
original assembly: how data was originally submitted

9. Developer edits & installs track description, or updates existing track description to include new subtrack info. Creates or updates super-track description if needed. Optionally passes on for scientific review (e.g. Ting, Rachel, Jim) or technical writing review (Donna).

10. Developer installs on genome-test, and requests review from submitter.

11. Developer posts downloads for any wiggle files.

12. Developer creates pushQ entry, notifies Kate that track is ready.

13. Kate updates internal Project List, external Data Status page, and reviews track.

14. Q/A reviews track and releases. Automation updates the ENCODE Release Log (if track name begins with 'ENCODE').

15. Kate updates the Data Status page.

16. Periodically (ideally quarterly, or when something significant happens), Kate posts a News item on the ENCODE portal, summarizing tracks released or other events. Also emails to the ENCODE Consortium mailing list.

16. Periodically (e.g NHGRI Progress Reports), Kate collects overall stats on tracks released and generates a report.

New Features for ENCODE Production phase

Web-based submission process
Standardized submission package with formal metadata (controlled vocabulary or URL)
Track structure defined by submission type and metadata
Track configuration generated automatically from metadata
Manual tweaking of generated track configuration (by developer or Q/A ?)
Manual editing of track description (by developer, Q/A, scientific lead, tech writer ?)
Interactive query of submissions and status
Automated notification to submitter if submission has problems
Automated request for review by submitter
Submitter acceptance triggers automated creation of pushQ entry.
Automated notification to submitter that track has been released
Regular, automated reporting of submissions and status -- quarterly summary to Consortium members, detailed report to NHGRI

DCC pipeline discussion: Difference between revisions

Revision as of 01:47, 5 November 2007

Existing process (ENCODE pilot phase)

New Features for ENCODE Production phase

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

related sites

hosted projects

Tools

@@ Line 2: / Line 2: @@
 to look at the existing, manual, process of creating ENCODE tracks,
 as well as what new features will be needed for the ENCODE production phase.
-At the end of this page is a proposed submission process/pipeline for the DCC.
+At the end of this page is a proposed submission process/pipeline for the DCC
+that was developed from discussion of the first two sections.
 == Existing process (ENCODE pilot phase) ==