DCC metadata discussion

From genomewiki
Revision as of 21:17, 16 November 2007 by Kate (talk | contribs)
Jump to navigationJump to search


One area of development for the ENCODE project at UCSC to accomodate the production phase is to expand and formalize handling of metadata. For some background, see Track metadata handling. Some terminology used in this discussion:

Experiment:     Design of an experiment that will be performed on
                 one or more Samples, measured via an Assay,
                 with data reported from one or more Analyses.
                 (Comparable to 'Series', or 'Study' in other repositories).

Sample:         One experimental run, using defined experimental variables
                  (e.g. cell type, treatment, antibody)

Analysis:       A data transformation on a set of raw data      

What is collected

Existing repositories have two categories of metadata:

  • Experimental design, including experimental variables,

together with investigator and institution information and references to publications. The ENCODE DCC will collaboratively develop a 'data agreement' that specifies the experimental design and how the data will be stored and displayed.

  • Specifics of individual experiments, including which values were used

for experimental variables, and which specific protocols were used, and data files for each experiment. The ENCODE DCC will collect for each individual experimental dataset in a submission archive

+Data filename:       Filename in this archive that contains the data (e.g. Pol2.bed)
Data file block:      Sequence number of file when dataset is split (e.g. by chrom)
+Assembly:            Human genome assembly (e.g. hg18)
+Track:               Genome browser track, assigned by DCC (e.g. Yale ChIP Signal)
Raw data accession:   Accession of raw data in public repository (e.g. GEO 1234, if submitted
Data version:         Internal project version, if any
<Variable1>:          Value of experimental variable used for this sample, from design
...
<VariableN>                Some variables are:  cell type, treatment, antibody, timepoint

How metadata is specified

One standard is the MAGE-TAB format, designed specfically for microarray data, which collects metadata via three files:

  • IDF (Investigation Description File)
  • SDRF (Sample and Data Relationship Format)
  • ADF (Array Design Format)

The modENCODE project is extending these formats to support a broader range of data types. The ENCODE DCC will adopt a similar approach, reusing the modENCODE extensions where applicable, but using a restricted subset needed for our requirements (e.g. modENCODE will be submitting raw data to the public repositories, while the ENCODE DCC will not). Also, only the SDRF type of file will be used. The IDF information will be gathered independently and ahead of the data submission, and will require collaborative effort between the DCC and investigator, as the display and storage tools provided by UCSC are more specialized than the standard repositories. The ADF information will be available via link to the microarray repositories, and will be summarized on the track description.

The DCC will require an SDRF file in each data submission archive. This file can be in ENCODE MAGE-TAB format (tab-delimited rows, spreadsheet-style), or in UCSC RA format (tag/value blocks, plaintext editor-style).

The DCC will reuse existing MAGE-TAB terms where they exist.

How metadata will be stored

  • Database tables associated with submission pipeline
  • trackDb table
  • Doc files (in /gbdb, or on Wiki)

How metadata will be used and displayed

  • Track search tool
  • Track details page:
    • subtrack section will include experimental variables
    • track description will include protocol or other info from Wiki or auxiliary pages

== Expected data types

  • ChIP-chip
  • ChIP-seq
  • DNase-chip
  • Dnase-seq
  • RNA-IP
  • FAIRE
  • RACE
  • Motifs
  • Ditags
  • Gene models
  • Methylation-seq
  • Promoters