Track metadata handling

From genomewiki
Revision as of 00:37, 14 February 2007 by Kate (talk | contribs)
Jump to navigationJump to search

Background

I have a topic prompted specifically by the ENCODE grant proposal, but it's one that I think could have broad applicability -- how to store and use track 'metadata'. What capabilities do you think we can/should provide relating to metadata ? Are there helpful examples at other bioinformatics sites (e.g. NIH DCC's) that you have seen ?

For ENCODE, metadata typically includes which cell lines were used for an experiment, which antibodies for chip/chip, sometimes timecourse of an experiment (e.g. at 0, 8, and 24 hrs). The ENCODE users may want to locate, for example, all datasets on HeLa cells. More generally at our site, we get ML questions asking if we have XX type experimental data on any organism/assembly. We currently keep metadata in trackDb settings and the track description, and have no explicit search mechanisms. Some capabilities we should consider:

  • GB display of all track/subtracks satisfying a metadata search
  • Download of all data tables for tracks/subtracks satisfying metdata search
  • Search could be query on predefined categories (dropdown menu of cell lines) and/or free-form (fuzzy search of track description and descriptive settings)

--Kate 12:38, 12 February 2007 (PST)

Discussion

From Daryl:

The HapMap DCC site has great metadata examples.  
See the Downloads|Documentation section here 
(the 'Bulk Data Download' link from the main page):

	http://www.hapmap.org/downloads/index.html.en

The Protocols (including versioning) maps directly to what we need.  
We'll also need a mechanism for tracking reagents -- individual
cell lines, antibodies, etc.  We should also keep track of chip designs 
in GEO/ArrayExpress.  The HapMap DCC uses XML to communicate
the metadata, and has gone through many updates of their formats 
(http://www.hapmap.org/downloads/xml_docs/).

There are more metadata examples on the HapMap DCC internal site, 
but it is down at the moment.  I can send the access info later.

The main difference between the HapMap and ENCODE DCCs is going 
to be the expansion in data types.  The output of the HapMap project
was primarily diploid genotypes, so this provided a fixed point that 
allowed many inputs (different genotyping platforms and
protocols, different populations and individual samples) and many 
outputs (analyses -- genotype/allele frequencies, phasing, LD,
etc.)  The ENCODE DCC will need to be quite a bit more flexible to 
handle all of the various data types.

Notes on Metadata Handling at the HapMap DCC

--Kate 15:59, 13 February 2007 (PST)

Metadata Format

All HapMap data exchanged between providers and the DCC is formatted as XML, using XML schema files to specify the semantics. An advantage of this approach, they claim, is that file format validity can be verified by the submitter before handing off.

Metadata item identifiers Each item -- data or documentation -- that is tracked by the DCC is assigned an LSID (Life Sciences Identifier). This is a URL-like string (actually, a URN) that is intended to always link to the item, regardless in changes to web sites.

Here's an example:

urn:lsid:pdb.org:1AFT:1

This is the first version of the 1AFT protein in the Protein Data Bank.

There is supposedly some browser support, at least in development, to translate the URN's. There's an overview website at sourceforge, that seems to have mostly broken links( http://lsid.sourceforge.net/), but there are functional links to software: perl and java impementations and a Firefox extension (map URN's to URL's ?). IBM also has a long page on LSID'S: http://www-128.ibm.com/developerworks/opensource/library/os-lsidbp/ Net gossip is skeptical about how broadly LSID's are used and how widely supported they are.

Some Hapmap metadata types

  • Labgroup (Informatics contact, PI, Institution, etc.)
  • Data submission (Submitter, Comment...)
  • Protocol (type, submitter, short and long descriptions)

Search/Retrieval capabilities

I didn't see anything provided (Daryl ?)

Notes on Metadata Usage at ENCODEdb/GEO

Laura Elnitski & Andy Baxevanis at NHGRI have developed a web portal that provides access to ENCODE data at UCSC and at GEO (raw microarray data). The GEO access pages allow selecting data based on:

  • Lab
  • Experiment name
  • Cell line
  • Binding site

The output of these searches can produce custom tracks for UCSC, or the data can be directed to Galaxy.

There is also a simple front-end to the GB and TB that allows you to select an ENCODE data track by group/lab/experiment.

The UCSC access pages only allow selecting by Lab and experiment.