DCC metadata discussion: Difference between revisions

From genomewiki
Jump to navigationJump to search
No edit summary
 
(3 intermediate revisions by the same user not shown)
Line 42: Line 42:
    
    
== How metadata is specified ==
== How metadata is specified ==
One standard is the MAGE-TAB format, designed specfically for microarray data,
 
One standard is the MAGE-TAB format, designed specifically for microarray data,
which collects metadata via three files:
which collects metadata via three files:
* IDF (Investigation Description File)
* IDF (Investigation Description File)
Line 50: Line 51:
The modENCODE project is extending these formats to support  
The modENCODE project is extending these formats to support  
a broader range of data types.  The ENCODE DCC will adopt
a broader range of data types.  The ENCODE DCC will adopt
a similar approach, reusing the modENCODE extensions where
a similar approach, using separate files to represent
applicable, but using a restricted subset needed for our requirements
overall experimental design and the specifics of individual
(e.g. modENCODE will be submitting raw data to the public repositories,
data sets generated by experiments, however we will adopt
while the ENCODE DCC will not).  Also, only the SDRF type of file
a much simpler set of data submission file formats, as our
will be used.  The IDF information will be gathered independently
requirements are narrower (e.g. modENCODE will be submitting raw data to the public repositories,
while the ENCODE DCC will not).  Also, the bulk of the overall
description information will be gathered independently
and ahead of the data submission, and will require collaborative effort
and ahead of the data submission, and will require collaborative effort
between the DCC and investigator, as the display and storage
between the DCC and investigator, as the display and storage
tools provided by UCSC are more specialized than the standard  
tools provided by UCSC are more specialized than the standard  
repositories.  The ADF information will be available via link
repositories.  For microarray data submissions,
the ADF information will be available via link
to the microarray repositories, and will be summarized on the
to the microarray repositories, and will be summarized on the
track description.
track description.


The DCC will require an SDRF file in each data submission archive.
The DCC will support two metadata files for a given project:
This file can be in ENCODE MAGE-TAB format (tab-delimited rows, spreadsheet-style),
* Project Information File:  PIF.{xls,txt,csv}
or in UCSC RA format (tag/value blocks, plaintext editor-style).
* Dataset Description File:  DDF.{xls,txt,csv} [http://spreadsheets.google.com/pub?key=pmF9mvzJce1G753HT99d_KA example]
 
The PIF file will describe experimental parameters that apply
to the entire project, and allow specification of additional values for
experimental variables (e.g. new antibodies used for ChIP experiments,
or use of additional cell lines beyond the ENCODE standards).
Generally, fields in the PIF file can be overriden for individual
experiments in the DDF file.  Note that most of the project
information -- experimental design (description, methods,
variables, data types), investigators, and references --  
will be collected prior to data submission as part of
developing the data agreement.


The DCC will reuse existing MAGE-TAB terms where they exist.
The DDF file contains a list of all files in the project, with
fields supplied to indicate how the files will be processed. 
The DDF format contains required fields,
fields required by data agreement, and optional fields.
Fields in the PIF can be specified in the DDF file as optional
fields to override the values in the PIF.


== How metadata will be stored ==
== How metadata will be stored ==
Line 79: Line 99:
** track description will include protocol or other info from Wiki or auxiliary pages
** track description will include protocol or other info from Wiki or auxiliary pages


== Expected data types
== Expected data types ==
* ChIP-chip
* ChIP-chip
* ChIP-seq
* ChIP-seq
* DNase-chip
* DNase-chip
* Dnase-seq
* Dnase-seq
* RNA-chip
* RNA-seq
* RNA-IP
* RNA-IP
* FAIRE
* FAIRE
Line 90: Line 112:
* Ditags
* Ditags
* Gene models
* Gene models
* Methylation-seq
* Methylation-seq  
* Promoters
* Promoters  
* 5C Chromatin conformation

Latest revision as of 06:45, 18 November 2007


One area of development for the ENCODE project at UCSC to accomodate the production phase is to expand and formalize handling of metadata. For some background, see Track metadata handling. Some terminology used in this discussion:

Experiment:     Design of an experiment that will be performed on
                 one or more Samples, measured via an Assay,
                 with data reported from one or more Analyses.
                 (Comparable to 'Series', or 'Study' in other repositories).

Sample:         One experimental run, using defined experimental variables
                  (e.g. cell type, treatment, antibody)

Analysis:       A data transformation on a set of raw data      

What is collected

Existing repositories have two categories of metadata:

  • Experimental design, including experimental variables,

together with investigator and institution information and references to publications. The ENCODE DCC will collaboratively develop a 'data agreement' that specifies the experimental design and how the data will be stored and displayed.

  • Specifics of individual experiments, including which values were used

for experimental variables, and which specific protocols were used, and data files for each experiment. The ENCODE DCC will collect for each individual experimental dataset in a submission archive

+Data filename:       Filename in this archive that contains the data (e.g. Pol2.bed)
Data file block:      Sequence number of file when dataset is split (e.g. by chrom)
+Assembly:            Human genome assembly (e.g. hg18)
+Track:               Genome browser track, assigned by DCC (e.g. Yale ChIP Signal)
Raw data accession:   Accession of raw data in public repository (e.g. GEO 1234, if submitted
Data version:         Internal project version, if any
<Variable1>:          Value of experimental variable used for this sample, from design
...
<VariableN>                Some variables are:  cell type, treatment, antibody, timepoint

How metadata is specified

One standard is the MAGE-TAB format, designed specifically for microarray data, which collects metadata via three files:

  • IDF (Investigation Description File)
  • SDRF (Sample and Data Relationship Format)
  • ADF (Array Design Format)

The modENCODE project is extending these formats to support a broader range of data types. The ENCODE DCC will adopt a similar approach, using separate files to represent overall experimental design and the specifics of individual data sets generated by experiments, however we will adopt a much simpler set of data submission file formats, as our requirements are narrower (e.g. modENCODE will be submitting raw data to the public repositories, while the ENCODE DCC will not). Also, the bulk of the overall description information will be gathered independently and ahead of the data submission, and will require collaborative effort between the DCC and investigator, as the display and storage tools provided by UCSC are more specialized than the standard repositories. For microarray data submissions, the ADF information will be available via link to the microarray repositories, and will be summarized on the track description.

The DCC will support two metadata files for a given project:

  • Project Information File: PIF.{xls,txt,csv}
  • Dataset Description File: DDF.{xls,txt,csv} example

The PIF file will describe experimental parameters that apply to the entire project, and allow specification of additional values for experimental variables (e.g. new antibodies used for ChIP experiments, or use of additional cell lines beyond the ENCODE standards). Generally, fields in the PIF file can be overriden for individual experiments in the DDF file. Note that most of the project information -- experimental design (description, methods, variables, data types), investigators, and references -- will be collected prior to data submission as part of developing the data agreement.

The DDF file contains a list of all files in the project, with fields supplied to indicate how the files will be processed. The DDF format contains required fields, fields required by data agreement, and optional fields. Fields in the PIF can be specified in the DDF file as optional fields to override the values in the PIF.

How metadata will be stored

  • Database tables associated with submission pipeline
  • trackDb table
  • Doc files (in /gbdb, or on Wiki)

How metadata will be used and displayed

  • Track search tool
  • Track details page:
    • subtrack section will include experimental variables
    • track description will include protocol or other info from Wiki or auxiliary pages

Expected data types

  • ChIP-chip
  • ChIP-seq
  • DNase-chip
  • Dnase-seq
  • RNA-chip
  • RNA-seq
  • RNA-IP
  • FAIRE
  • RACE
  • Motifs
  • Ditags
  • Gene models
  • Methylation-seq
  • Promoters
  • 5C Chromatin conformation