Track metadata handling

From genomewiki
Revision as of 20:38, 12 February 2007 by Kate (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Background

I have a topic prompted specifically by the ENCODE grant proposal, but it's one that I think could have broad applicability -- how to store and use track 'metadata'. What capabilities do you think we can/should provide relating to metadata ? Are there helpful examples at other bioinformatics sites (e.g. NIH DCC's) that you have seen ?

For ENCODE, metadata typically includes which cell lines were used for an experiment, which antibodies for chip/chip, sometimes timecourse of an experiment (e.g. at 0, 8, and 24 hrs). The ENCODE users may want to locate, for example, all datasets on HeLa cells. More generally at our site, we get ML questions asking if we have XX type experimental data on any organism/assembly. We currently keep metadata in trackDb settings and the track description, and have no explicit search mechanisms. --Kate 12:38, 12 February 2007 (PST)

Discussion

From Daryl:

The HapMap DCC site has great metadata examples.  
See the Downloads|Documentation section here 
(the 'Bulk Data Download' link from the main page):

	http://www.hapmap.org/downloads/index.html.en

The Protocols (including versioning) maps directly to what we need.  
We'll also need a mechanism for tracking reagents -- individual
cell lines, antibodies, etc.  We should also keep track of chip designs 
in GEO/ArrayExpress.  The HapMap DCC uses XML to communicate
the metadata, and has gone through many updates of their formats 
(http://www.hapmap.org/downloads/xml_docs/).

There are more metadata examples on the HapMap DCC internal site, 
but it is down at the moment.  I can send the access info later.

The main difference between the HapMap and ENCODE DCCs is going 
to be the expansion in data types.  The output of the HapMap project
was primarily diploid genotypes, so this provided a fixed point that 
allowed many inputs (different genotyping platforms and
protocols, different populations and individual samples) and many 
outputs (analyses -- genotype/allele frequencies, phasing, LD,
etc.)  The ENCODE DCC will need to be quite a bit more flexible to 
handle all of the various data types.