DCC metadata discussion

From genomewiki
Jump to navigationJump to search

One area of development for the ENCODE project at UCSC to accomodate the production phase is to expand and formalize handling of metadata. For some background, see Track metadata handling. Some terminology used in this discussion:

Experiment:     Design of an experiment that will be performed on
                 one or more Samples, measured via an Assay,
                 with data reported from one or more Analyses.
                 (Comparable to 'Series', or 'Study' in other repositories).

Sample:         One experimental run, using defined experimental variables
                  (e.g. cell type, treatment, antibody)

Analysis:       A data transformation on a set of raw data      

What is collected

Existing repositories have two categories of metadata:

  • Experimental design, including experimental variables,

together with investigator and institution information and references to publications. The ENCODE DCC will collaboratively develop a 'data agreement' that specifies the experimental design and how the data will be stored and displayed.

  • Specifics of individual experiments, including which values were used

for experimental variables, and which specific protocols were used, and data files for each experiment. The ENCODE DCC will collect for each individual experimental dataset in a submission archive

+Data filename:       Filename in this archive that contains the data (e.g. Pol2.bed)
Data file block:      Sequence number of file when dataset is split (e.g. by chrom)
+Assembly:            Human genome assembly (e.g. hg18)
+Track:               Genome browser track, assigned by DCC (e.g. Yale ChIP Signal)
Raw data accession:   Accession of raw data in public repository (e.g. GEO 1234, if submitted
Data version:         Internal project version, if any
<Variable1>:          Value of experimental variable used for this sample, from design
<VariableN>                Some variables are:  cell type, treatment, antibody, timepoint

How metadata is specified

One standard is the MAGE-TAB format, designed specifically for microarray data, which collects metadata via three files:

  • IDF (Investigation Description File)
  • SDRF (Sample and Data Relationship Format)
  • ADF (Array Design Format)

The modENCODE project is extending these formats to support a broader range of data types. The ENCODE DCC will adopt a similar approach, using separate files to represent overall experimental design and the specifics of individual data sets generated by experiments, however we will adopt a much simpler set of data submission file formats, as our requirements are narrower (e.g. modENCODE will be submitting raw data to the public repositories, while the ENCODE DCC will not). Also, the bulk of the overall description information will be gathered independently and ahead of the data submission, and will require collaborative effort between the DCC and investigator, as the display and storage tools provided by UCSC are more specialized than the standard repositories. For microarray data submissions, the ADF information will be available via link to the microarray repositories, and will be summarized on the track description.

The DCC will support two metadata files for a given project:

  • Project Information File: PIF.{xls,txt,csv}
  • Dataset Description File: DDF.{xls,txt,csv} example

The PIF file will describe experimental parameters that apply to the entire project, and allow specification of additional values for experimental variables (e.g. new antibodies used for ChIP experiments, or use of additional cell lines beyond the ENCODE standards). Generally, fields in the PIF file can be overriden for individual experiments in the DDF file. Note that most of the project information -- experimental design (description, methods, variables, data types), investigators, and references -- will be collected prior to data submission as part of developing the data agreement.

The DDF file contains a list of all files in the project, with fields supplied to indicate how the files will be processed. The DDF format contains required fields, fields required by data agreement, and optional fields. Fields in the PIF can be specified in the DDF file as optional fields to override the values in the PIF.

How metadata will be stored

  • Database tables associated with submission pipeline
  • trackDb table
  • Doc files (in /gbdb, or on Wiki)

How metadata will be used and displayed

  • Track search tool
  • Track details page:
    • subtrack section will include experimental variables
    • track description will include protocol or other info from Wiki or auxiliary pages

Expected data types

  • ChIP-chip
  • ChIP-seq
  • DNase-chip
  • Dnase-seq
  • RNA-chip
  • RNA-seq
  • RNA-IP
  • RACE
  • Motifs
  • Ditags
  • Gene models
  • Methylation-seq
  • Promoters
  • 5C Chromatin conformation