Track metadata handling: Difference between revisions

From genomewiki
Jump to navigationJump to search
No edit summary
 
No edit summary
Line 44: Line 44:
handle all of the various data types.
handle all of the various data types.
</pre>
</pre>
== Notes on Metadata Handling at the HapMap DCC ==
--[[User:Kate|Kate]] 15:59, 13 February 2007 (PST)
* Metadata Format
All HapMap data exchanged between providers and the DCC is
formatted as XML, using XML schema files to specify the semantics.
An advantage of this approach, they claim, is that file format
validity can be verified by the submitter before handing off.
* Metadata item identifiers
Each item -- data or documentation -- that is tracked by the DCC is assigned an LSID (Life Sciences Identifier).  This is a URL-like
string (actually, a URN) that is intended to always link to the item, regardless in changes to web sites. 
Here's an  example:
urn:lsid:pdb.org:1AFT:1
This is the first version of the 1AFT protein in the Protein Data Bank.
There is supposedly some browser support, at least in development, to translate the URN's. There's an overview website at sourceforge, that seems to have mostly broken links( http://lsid.sourceforge.net/),
but there are functional links to software: perl and java impementations and a Firefox extension (map URN's to URL's ?).
IBM also has a long page on LSID'S:
http://www-128.ibm.com/developerworks/opensource/library/os-lsidbp/
Net gossip suggests that LSID's are probably not widely used,
and support is not quite available.
* Some Hapmap metadata types
** Labgroup (Informatics contact, PI, Institution, etc.)
** Data submission (Submitter, Comment...)
** Protocol (type, submitter, short and long descriptions)

Revision as of 23:59, 13 February 2007

Background

I have a topic prompted specifically by the ENCODE grant proposal, but it's one that I think could have broad applicability -- how to store and use track 'metadata'. What capabilities do you think we can/should provide relating to metadata ? Are there helpful examples at other bioinformatics sites (e.g. NIH DCC's) that you have seen ?

For ENCODE, metadata typically includes which cell lines were used for an experiment, which antibodies for chip/chip, sometimes timecourse of an experiment (e.g. at 0, 8, and 24 hrs). The ENCODE users may want to locate, for example, all datasets on HeLa cells. More generally at our site, we get ML questions asking if we have XX type experimental data on any organism/assembly. We currently keep metadata in trackDb settings and the track description, and have no explicit search mechanisms. --Kate 12:38, 12 February 2007 (PST)

Discussion

From Daryl:

The HapMap DCC site has great metadata examples.  
See the Downloads|Documentation section here 
(the 'Bulk Data Download' link from the main page):

	http://www.hapmap.org/downloads/index.html.en

The Protocols (including versioning) maps directly to what we need.  
We'll also need a mechanism for tracking reagents -- individual
cell lines, antibodies, etc.  We should also keep track of chip designs 
in GEO/ArrayExpress.  The HapMap DCC uses XML to communicate
the metadata, and has gone through many updates of their formats 
(http://www.hapmap.org/downloads/xml_docs/).

There are more metadata examples on the HapMap DCC internal site, 
but it is down at the moment.  I can send the access info later.

The main difference between the HapMap and ENCODE DCCs is going 
to be the expansion in data types.  The output of the HapMap project
was primarily diploid genotypes, so this provided a fixed point that 
allowed many inputs (different genotyping platforms and
protocols, different populations and individual samples) and many 
outputs (analyses -- genotype/allele frequencies, phasing, LD,
etc.)  The ENCODE DCC will need to be quite a bit more flexible to 
handle all of the various data types.

Notes on Metadata Handling at the HapMap DCC

--Kate 15:59, 13 February 2007 (PST)

  • Metadata Format

All HapMap data exchanged between providers and the DCC is formatted as XML, using XML schema files to specify the semantics. An advantage of this approach, they claim, is that file format validity can be verified by the submitter before handing off.

  • Metadata item identifiers

Each item -- data or documentation -- that is tracked by the DCC is assigned an LSID (Life Sciences Identifier). This is a URL-like string (actually, a URN) that is intended to always link to the item, regardless in changes to web sites.

Here's an example: urn:lsid:pdb.org:1AFT:1 This is the first version of the 1AFT protein in the Protein Data Bank.

There is supposedly some browser support, at least in development, to translate the URN's. There's an overview website at sourceforge, that seems to have mostly broken links( http://lsid.sourceforge.net/), but there are functional links to software: perl and java impementations and a Firefox extension (map URN's to URL's ?). IBM also has a long page on LSID'S: http://www-128.ibm.com/developerworks/opensource/library/os-lsidbp/ Net gossip suggests that LSID's are probably not widely used, and support is not quite available.

  • Some Hapmap metadata types
    • Labgroup (Informatics contact, PI, Institution, etc.)
    • Data submission (Submitter, Comment...)
    • Protocol (type, submitter, short and long descriptions)