Browser Agreement Action Plan

From genomewiki
Revision as of 18:41, 20 January 2010 by Hiram (talk | contribs) (adding category)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Comments to this page can be found in the discussion page.

Contributing authors

Paul Kitts, Avi Kimchi, Mike DiCuccio, Karen Clark, Mark Cavanaugh and Deanna Church

Background

In late June, 2008 the three major genome browser and data distribution centers (Ensembl, NCBI and UCSC) agreed in principle to a common set of rules for displaying date. (see the ‘Browser_Genome_Release_Agreement.pdf’). All parties agreed that this agreement would go into effect on July 1, 2008 but no actions have taken place to move this forward. These actions include:

  • contacting high volume assembly submitters in a formal way
  • posting the document at all three web sites
  • defining a mechanism for distribution of data from the INSDC

Proposal

Contacting high volume assembly submitters in a formal way

We should send an email to high volume assembly submitters informing them of this agreement and providing recommendations for assembly submission. This letter should go to:

  • The Broad Institute (Chad Nusbaum?)
  • The Wellcome Trust Sanger Institute (Richard Durbin?)
  • The Genome Center at Washington University (LaDeana Hillier?)
  • Baylor College of Medicine (Kim Worley)
  • Steven Salzberg’s Group (Steven Salzberg)
  • Joint Genome Institute (Dan Rokhsar)
  • J. Craig Venter Institute (Saul Kravitz)

The letter should be signed by all centers and sent out by Sep 12, 2008.

Posting the document at all three web sites

All three centers should post a copy of the browser agreement on their web sites immediately.

Defining a mechanism for distribution of data from the INSDC

The largest implementation issue concerns the distribution of assembly data. Ideally, all members of the INSDC will produce the same set of files to be distributed to all annotating centers. We make a straw man proposal below and are actively seeking the input of other groups to ensure this structure will work for everyone. I would be useful if we could agree on this data exchange structure by Sep. 12, 2008.

Proposal:

There should be a single master directory for distribution of assemblies. For example: ftp/genbank/genome_assemblies/.

Within this directory, subdirectories will be organized in broad taxonomic groups:

Fungi          \
Plants         \ 
Invertebrates  \  
Mammals        \    >  broad taxonomic groups

Within each high level taxonomic directory will be a series of directories, organized by organism:

   Mammals\Bos_taurus\
   Mammals\Mus_musculus\

Within each organism directory, one directory per assembly:

   Mammals\Bos_taurus\Btau_3.1\
   Mammals\Bos_taurus\Btau_3.2\

Within each assembly directory, assembly information:

   Mammals\Bos_taurus\Btau_3.1\
       initial_release\
           assembly_meta_data (see below)
           ordered\   - all data that is ordered and oriented on a chromosome
               chromosome-from-scaffold AGP
               scaffold-from-component AGP
               chromosome-from-component AGP
               chromosome fasta
               chromosome quality scores (if available)
               scaffold fasta
               scaffold quality scores (if available)
               component fasta
               component quality scores (if available)
           unordered\  - data that is assigned to a chromosome, but not ordered or oriented
               scaffold-from-component AGP
               scaffold fasta
               scaffold quality scores (if available)
               component fasta
               component quality scores (if available)
           unplaced\   - not assigned to a chromosome
               scaffold-from-component AGP
               scaffold fasta
               scaffold quality scores (if available)
               component fasta
               component quality scores (if available)
           alternate_alleles\ - files describing any alternate alleles for a given assembly.
               scaffold-from-component AGP
               scaffold fasta
               scaffold quality scores (if available)
               component fasta
               component quality scores (if available)
               File describing location of alternate with respect to the assembled chromosome.
               Alignment file.
           revision_XX\
               To be determined.

The above directory structure can be made into one tarball for easy downloading.

Outstanding questions

  • Quality scores: These are typically submitted at the level of the component (either GenBank HTG accession or WGS contig). Would it be acceptable to provide a program that could produce quality files for scaffolds/chromosomes based on the AGP? This would significantly reduce the space requirements for the site.
  • Updates: There are two kinds of updates; 1) small updates usually only correcting/omitting a handful of scaffolds and 2) complete assembly updates.
    • For 1) (small updates): Do you want a complete assembly dump, or just the updates?
    • For 1) (small updates): Is a subdirectory (or set of subdirectories) within the main assembly directory OK?
    • For 2) (complete updates): Should these live in the same directory? Do they get a different ‘assembly’ directory? How do you want to manage the hand shake? Our assertion is that update/new is a user defined option.
  • Alternate Alleles: Both human and mouse will have alternate alleles in their next updates. We are proposing a separate directory to handle this data. The set of files would largely be the same with the exception of the data that puts the allele data in reference chromosome coordinates. We are proposing a tab delimited file that would describe the locations with respect to the assembled chromosome coordinates. We would also make the alignment file available. Is this sufficient.
  • Handshake: What sort of notification would you like? Would you prefer email, a jira system or would you just set up a job to monitor the FTP site?

Assembly meta data: Below is an xml schema describing the proposed assembly meta data.

xs:element name="GC-ProjectList">
  <xs:complexType>
    <xs:sequence>
<!-- Genbank / Refseq -->
      <xs:element name="GC-ProjectList_project-role" type="xs:string"/>
      <xs:element name="GC-ProjectList_project-id" type="xs:integer"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

<xs:element name="GC-Assembly">
  <xs:complexType>
    <xs:sequence>
<!--
 The identifier of this assembly
 examples: GC internal id, Assembly-accession.version
-->
      <xs:element name="GC-Assembly_id">
        <xs:complexType>
          <xs:sequence minOccurs="0" maxOccurs="unbounded">
            <xs:element ref="Dbtag"/>
          </xs:sequence>
        </xs:complexType>
      </xs:element>
<!--
project ID for this genome: 
 this is the ID for this asssembly and may reflect the submitter/source
-->
      <xs:element name="GC-Assembly_project" minOccurs="0">
        <xs:complexType>
         <xs:sequence minOccurs="0" maxOccurs="unbounded">
            <xs:element ref="GC-ProjectList"/>
          </xs:sequence>
        </xs:complexType>
      </xs:element>
<!-- Names of the assembly -->
      <xs:element name="GC-Assembly_name" type="xs:string" minOccurs="0"/>
      <xs:element name="GC-Assembly_submitter-name" type="xs:string" minOccurs="0"/>
<!--
 Various attributes assigned at this level:
 biosrc, comments, publications...
-->
      <xs:element name="GC-Assembly_descr" minOccurs="0">
        <xs:complexType>
          <xs:sequence>
            <xs:element ref="Seq-descr"/>
          </xs:sequence>
        </xs:complexType>
      </xs:element>
    </xs:sequence>
  </xs:complexType>
</xs:element>

</xs:schema>