Track metadata handling
I have a topic prompted specifically by the ENCODE grant proposal, but it's one that I think could have broad applicability -- how to store and use track 'metadata'. What capabilities do you think we can/should provide relating to metadata ? Are there helpful examples at other bioinformatics sites (e.g. NIH DCC's) that you have seen ?
For ENCODE, metadata typically includes which cell lines were used for an experiment, which antibodies for chip/chip, sometimes timecourse of an experiment (e.g. at 0, 8, and 24 hrs). The ENCODE users may want to locate, for example, all datasets on HeLa cells. More generally at our site, we get ML questions asking if we have XX type experimental data on any organism/assembly. We currently keep metadata in trackDb settings and the track description, and have no explicit search mechanisms. Some capabilities we should consider:
- GB display of all track/subtracks satisfying a metadata search
- Download of all data tables for tracks/subtracks satisfying metdata search
- Search could be query on predefined categories (dropdown menu of cell lines) and/or free-form (fuzzy search of track description and descriptive settings)
Design Discussion (Genecats meeting)
Provide storage of ENCODE (and other) track metadata and search features that use it. Provide a means for ENCODE data providers to submit metadata for multiple tracks in a standard file-based format.
Other projects here (GSID, Phenotype) will also have metadata storage and search aspects (with likely a larger range of metadata items than ENCODE).
XML is used for metadata exchange by the HapMap data center, with a defined schema so data providers can validate their files before submitting to the data center. The Hapmap XML schemas are quite verbose.
Search on metadata ('search tracks' page) could generate a a candidate list of tracks that could be selected from to display in browser (possibly in a 'user track group'). For ENCODE, it would be good to have a similar feature to search on metadata and produce a list of candidate tables and auxiliary files for data download.
XML seems like a reasonable choice for us to use for metadata exchange. Jim suggests restricting the total #tags to 10, with a goal of limiting bloat to 3X of the original data size. (More on this below).
Metadata submitted to us in XML files would be loaded into a database table(s) to facilitate searching. Possibly use autoSql here.
We could stratify metadata types for flexible entry and search, e.g.:
Level 1: System defined (required) Level 2: User defined Level 3: Undefined ?
The defined metadata types could appear in menu pulldowns. The undefined/free-format stuff could be searched as we do for other text fields.
Jim's suggestions for #tags/fields that are reasonable for these levels:
1 2-5 tags, 10-25 fields 2 5-20 tags, 10-100 fields 3 open
The HapMap DCC site has great metadata examples. See the Downloads|Documentation section here (the 'Bulk Data Download' link from the main page): http://www.hapmap.org/downloads/index.html.en The Protocols (including versioning) maps directly to what we need. We'll also need a mechanism for tracking reagents -- individual cell lines, antibodies, etc. We should also keep track of chip designs in GEO/ArrayExpress. The HapMap DCC uses XML to communicate the metadata, and has gone through many updates of their formats (http://www.hapmap.org/downloads/xml_docs/). There are more metadata examples on the HapMap DCC internal site, but it is down at the moment. I can send the access info later. The main difference between the HapMap and ENCODE DCCs is going to be the expansion in data types. The output of the HapMap project was primarily diploid genotypes, so this provided a fixed point that allowed many inputs (different genotyping platforms and protocols, different populations and individual samples) and many outputs (analyses -- genotype/allele frequencies, phasing, LD, etc.) The ENCODE DCC will need to be quite a bit more flexible to handle all of the various data types.
Notes on Metadata Handling at the HapMap DCC
--Kate 15:59, 13 February 2007 (PST)
All HapMap data exchanged between providers and the DCC is formatted as XML, using XML schema files to specify the semantics. An advantage of this approach, they claim, is that file format validity can be verified by the submitter before handing off.
Metadata item identifiers Each item -- data or documentation -- that is tracked by the DCC is assigned an LSID (Life Sciences Identifier). This is a URL-like string (actually, a URN) that is intended to always link to the item, regardless in changes to web sites.
Here's an example:
This is the first version of the 1AFT protein in the Protein Data Bank.
There is supposedly some browser support, at least in development, to translate the URN's. There's an overview website at sourceforge, that seems to have mostly broken links( http://lsid.sourceforge.net/), but there are functional links to software: perl and java impementations and a Firefox extension (map URN's to URL's ?). IBM also has a long page on LSID'S: http://www-128.ibm.com/developerworks/opensource/library/os-lsidbp/ Net gossip is skeptical about how broadly LSID's are used and how widely supported they are.
Some Hapmap metadata types
- Labgroup (Informatics contact, PI, Institution, etc.)
- Data submission (Submitter, Comment...)
- Protocol (type, submitter, short and long descriptions)
I didn't see anything provided (Daryl ?)
Notes on Metadata Usage at ENCODEdb/GEO
Laura Elnitski & Andy Baxevanis at NHGRI have developed a web portal that provides access to ENCODE data at UCSC and at GEO (raw microarray data). The GEO access pages allow selecting data based on:
- Experiment name
- Cell line
- Binding site
The output of these searches can produce custom tracks for UCSC, or the data can be directed to Galaxy.
There is also a simple front-end to the GB and TB that allows you to select an ENCODE data track by group/lab/experiment.The UCSC access pages only allow selecting by Lab and experiment.