ENCODE Data Wrangler HOWTO

From Genecats
Jump to navigationJump to search

Background

This is the ENCODE2 wrangling instructions from the ENCODE2 DCC Wiki (which is currently hosted by Stanford and is read-only). The original page, with history is (password-protected) here:

http://oldencode2wiki.encodedcc.org/index.php?title=Data_Wrangler_HOWTO

Definitions

  • <organism> is 'human' or 'mouse',
  • <db> is 'hg19' or 'mm9' or 'hg18' or 'encodeTest'
  • <composite> is the object name of the composite track (wgEncodeLabDataType)
  • <user> is the wrangler's user ID
  • <submissionId> is the submission ID on the ENCODE submit page or in the production pipeline
  • PIPE is /hive/groups/encode/dcc/pipeline/encpipeline_prod
  • TRACKDB is /<user>/kent/src/hg/makeDb/trackDb
  • TRACKDBhg19 is /<user>/kent/src/hg/makeDb/trackDb/human/hg19
  • ALPHA is /<user>/kent/src/hg/makeDb/trackDb/human/<db>/metaDb/alpha
  • <download> is /hive/groups/encode/dcc/analysis/ftp/pipeline/<db> or /usr/local/apache/htdocs-hgdownload/goldenPath/<db>/encodeDCC
  • <notes> is /cluster/home/<user>/kent/src/hg/makeDb/doc/encodeDccHg19
  • <cgi-bin> is /usr/local/apache/cgi-bin-YOUR LOGIN


  • track is ambiguous term that refers to either an individual line in the browser or a composite track which is a collection of such sub-tracks. Often people say track when they mean trackDb.ra
  • Composite track
  • Sub-track

Files of a "track object"

  • trackDb.ra
  • mdb.ra
  • files.txt
  • md5sum.txt

Web Page URLs

NEW Composite Track

In this phase the wrangler is contacted by a lab and told that there will be a set of experiments coming in. The wrangler will need to help to produce the track description, create the DAF and example ddf, facilitate the registering of any new controlled vocabulary and possibly create new file formats.


Create a new redmine ticket

In the ENCODE project ( or one of its sub-projects), use the New Issue link in Redmine:

  • Tracker should be Track
  • Subject will be the short label (shortLabel) of the track followed by (PROPOSED) if this is a proposed track, (Release N), f this is release 2 or higher, or (assembly) if there is a rare multiple assembly situation. Examples: HAIB TFBS (Release 2), UW DNase (mm9), HAIB shRNA (PROPOSED)
  • Description contains
  • Status is set to New by default
  • Priority should be set to medium as default
  • Assignee is yourself
  • Category should be left blank, we do not use it for tracks
  • Assembies will be one of hg18, hg19, mm9
  • QA Start Date is not currently used
  • Date Last Released should be set to the last release or NEVER???
  • Release History should be set to appropriate choice
  • Parent Task should be set to the superTrack if there is one
  • Start Date should be autofilled to today's date
  • Due Date should be set by default to 6 months after the last release date, this will be adjusted by encodeAdmin
  • Estimated Time should be your best idea. You can start with the estimator VENKATS TOOL GOES HERE
  • % Done The wranglers use 0-50% and QA uses 50-100%
  • Developer is yourself
  • Number of Exps should be filled in with the estimate from the lab.
  • Passed Pre-QA should default to no
  • Check boxes for the appropriate watchers
  • After creating the issue, link to related tracks (previous releases, weekly lab meetings)

Approval from Management

The labs sometimes want to submit more data than we have funding for. When there is a request for a new composite we need to get approval and new terms from management. After creating a redmine ticket for the composite, ask the following Questions.

TO: Kate

Will we be accepting this track?
Will the proposed short label (subject), long label (description), compositeName (proposed link) work for this track?  Please modify as needed.

Produce the composite track description

1) Get familiar with lab project

  • View scientific presentations from Consortium meetings
  • Review previous browser tracks for your lab
  • Review browser tracks from other labs that are similar
  • Review your contact information for this track

2) Create the composite track description

  • Acquire a basic track description from the lab. Provide them with a draft track description if there are previous similar tracks or provide Basic Outline of a Track Description.
  • If dealing with someone proficient in html, send them a template {would like link to template}
  • Transfer the documentation to html
  • Use the guidelines in Completing ENCODE track descriptions to complete the description.
  • Check the the html file into the revision control system as TRACKDB/<organism>/<db>/wgEncode<lab><dataType>.html
  • Make sure that none of the links are going up two levels like this "../../ENCODE"

Develop the data agreement

We currently use 2 files to submit the data. The first is the DAF, or Data Agreement File, which describes the constant for the composite track. For example, the specifications of the lab, the grant, the experimental variables, the validation settings and which views are to be accepted are in the DAF. Currently these folder are stored in the DAFs directory. [put directory here] However, since the DAF's require changing so much, often the best thing to do is to look in the submissions for the most recent one that passed. When these files were first created, the idea was that they would be static. However, they are now sort of obsolete and hopefully will change into something simpler. The second file is the DDF or the Data Definition File. This is created by the lab and is a list of the files they are submitting and the metadata associated with those files. The wrangler should make an example DDF. Currently, we also list the columns expected in the DDF in the DAF. However, we could probably replace that with a link to typeOfTerms.

  1. Create the Data Agreement File
  2. Create a sample Data Definition File
  3. Check in the files to the hg/encode/DAFs/2.0 directory in the source tree.
  4. Send a copy of each to the lab technical contact.

New file formats

As part of developing a new track, you may need to create new file formats. In general, we push back on those. However, sometimes it is required. Once the format is agreed upon. You need to add it to the portal [NEED more instructions here]. And you will need to create and autoSql file for it which are kept here [where?] Possibly could use exraFields?

Attic Files

Attic Files are files that supplemental to the composite track/data set.

1. Supplemental: Files supplemental to the composite level. Currently not supported by pipeline, manually loaded by wrangler, no metadata.

2. Auxiliary Supplemental: Files supplemental at the experiment level (Validation documents).

* DAF lines
 * type document
 * supplemental yes
* MetaData lines
 * attic auxSup

3. Auxiliary Valid: Files supplemental at the experiment level that are in a format that can be validated.

* DAF lines
 * auxiliary yes
* MetaData lines
 * attic auxValid

4. Auxiliary Experiments: Supplemental experiments or replicates

* DDF lines
 * display no
* MetaData lines
 * attic auxExp

Choose validateSettings

Because there are various validateFiles settings needed for different labs and datasets, and because validateFiles is run by doEncodeValidate.pl as part of the submission pipeline, I have added a new setting to the DAF. Here is an example:

 validationSettings allowReloads;skipAutoCreation;validateFiles.tagAlign:mmCheckOneInN=100,mismatches=3

Reading this setting:

  • when running doEncodeValidate.pl use "-allowReloads -skipAutoCreation"
  • when running validateFiles for tagAligns use "-mmCheckOneInN=100 -mismatches=3"

The doEncodeValidate.pl params (eg allowReloads) are specifically looked for and currently only these are supported: allowReloads, skipAutoCreation, skipValidateFiles, skipOutput.

All validateFiles params are supported and passed to validateFiles without examination. This means that changes to validateFiles to add new options will not require changing doEncodeValidate.pl to support them. Currently validateFiles is run for certain specific file types and could have differing parameters, so the type must be stated in the form "validateFiles.{fileType}" where types currently using validateFiles are: tagAlign, fastq, fasta, csfast, csqual, broadPeak

How to parse validationSettings:

* The major delimiter is ';' (eg "allowReloads;vali...").
* The pair delimiter is ':' (eg "tagAlign:mmChec...").
* The minor delimiter is ',' (eg "mmCheckOneInN=100,mismatches=3") .
* The fileType delimiter is '.' (eg "validateFiles.tagAlign"),
* The '=' is part of the parameter passed on (eg "mismatches=3").
* Also note that the leading " -" is left off (eg "mismatches=3" becomes " -mismatches=3").

Certain validateFiles parameters are automatically passed in from doEncodeValidate.pl, so do not need to be included in the DAF. Specifically chromDb for checking chrom lengths and genome for sequence/alignment validation. But you may wish to avoid the validation triggered by these automatically included paramters. There are a couple of extra secret settings that are not true parameters of the validateFiles that can override them. If you don't want chromLen validated, you can include ignoreChromLen in validationSettings. And if you want to avoid validating the alignment include ignoreAlignment.

Another example (can you read it?):

  validationSettings allowReloads;validateFiles.tagAlign:mmCheckOneInN=100,mismatches=3,ignoreChromLen;validateFiles.fastq:mmCheckOneInN=20

The "validationSettings" line is intentionally messy! This is because we do not want the labs changing these parameters themselves. The messier it is, the more likely they will depend upon us to set it and send them the DAF.

And remember, when they submit and it fails because of these settings (or others), all we need to do is fix the DAF and then go to the submission pipeline web page to restart validation.


Register the controlled vocabulary

Instructions to Register Controlled Vocabulary


Create a composite trackDb.ra

  • Create a composite track in <organism>/<db>/wgEncode<composite>.ra and check in under revision control.

You will need the stanza identifying the track and sub-stanzas for each view you are expecting. If you already have data coming in, the first <subId>/out/trackDb.ra will have a basic outline for you. However, a basic outline should be provided here. Link to the famous readme.

  • Add the composite to <organism>/<db>/trackDb.wgEncode.ra
      # Lab DataType (wrangler: username)
      include <composite>.ra alpha

Example:

     # UNC/BSU Proteogenomics (wrangler: cline)
     include wgEncodeUncBsuProt.ra alpha

Add composite trackDb.ra to trackDb.wgEncode.ra

Create a mdb.ra file

     cd <organism>/<db>/metaDb/alpha
     edit <composite>.ra with
            metaObject <composite>
            objType composite
            composite <composite>
            expVars lab,dataType,cell,otherExpVar
     edit makefile to add <composite>.ra
     check in makefile and <composite>.ra

Example:

     edit wgEncodeCshlLongRnaSeq.ra with
            metaObject wgEncodeCshlLongRnaSeq
            objType composite
            composite wgEncodeCshlLongRnaSeq
            expVars lab,dataType,cell,localization,rnaExtract

Add mdb.ra file to makefile

NEW RELEASE

  • Create a .releaseN.ra file for the new release by copying the current release.
  • In the .releaseN.ra file, add a line to the composite in the approximate format: "html $(composite).releaseN"
  • Copy the current html over into a new file with .releaseN.html as the extension.
  • Add in any pertinent release information in the release section of the .releaseN.html.
  • Add appropriate tags in trackDb.wgEncode.ra . Usually you must relegate the current release to "beta,public" and the new release to "alpha" only.
  • Check in trackDb.wgEncode.ra <composite>.ra metaDb/alpha/<composite>.ra

SUBMIT

The goal is one experiment per submission with all of the relevant files. However, this is not the reality. There are often multiple experiments per submission or multiple submissions per experiment.

Prepare the lab

Introduction to the Pipeline

  • Loader sends email when data is ready
  • The submission is stored in PIPE/<submissionId>. There you will find:
    • Original tar ball
    • Untarred data
    • Submitted DDF
    • Submitted DAF
    • .report files with validator information
    • out directory
      • trackDb.ra which is a patch for the <composite> trackDb
      • mdb.txt which is a patch for the mdb
      • load.ra which is instructions to the loader
      • unload.ra which are instructions for unloading
  • The data of the submission is stored in the tables of <db> and in the downloads directory under the name of the composite. GBDB pointers are generated for bigwigs and bam files.

Debugging Tips

  • Wrangler (admin) can log in as another user
    <wrangler ID> as <lab user ID>


  • Possible errors:
  1. validate file settings
  2. CV errors - in DAF or DDF or in the CV
  3. Bigwig version
  4. Gender
  5. Bam errors
  6. Previously submitted


When a user hits a validation error, it's often easier for the wrangler to fix whatever is wrong (e.g. fix a syntax error in a file or add a missing antibody) and then manually validate and load the submission. However, if possible, continuing to use the submission pipeline will save headache down the line. If a simple correction is made, the validator can be run using '-quick', then the submission can be revalidated and hopefully loaded through the pipeline:

To manually do a quick validate of a submission:

    cd  {submissionDir}
    mv validate_error validate_error.old
    doEncodeValidate.pl x -quick ../{submissionDir} > validate_error 2>&1 &

You can also try options '-skipAutoCreation', '-skipOutput' and/or '-allowReloads' if the tables have already been loaded and need to be reloaded.

To manually validate the syntax of a single file ("test.bed" in this example):

    doEncodeValidate.pl -validateFile -fileType=narrowPeak x test.bed

When you feel confident that the submission will pass validation, you can log into the submission website and restart the submission with the "validation" link. You may need to first use the "unload" link.

HOWEVER, when all else fails, it may be necessary to abandon the pipeline, and run the validator and loader manually.

Use similar syntax to manually load with doEncodeLoad.pl

    cd  {submissionDir}
    mv load_error upload_error.old
    doEncodeLoad.pl x ../{submissionDir} > upload_error 2>&1 &

When the pipeline goes down

The pipeline should automatically restart if hgwdev is restarted. However if the pipeline fails to restart you must manually restart the pipeline as user: encodeteam. This is because what ever ID starts the pipeline is the only ID that can turn off the pipeline.

Steps to restart pipeline:

1) ssh encodeteam@hgwdev.gi.ucsc.edu

2) cd /hive/groups/encode/dcc/pipeline/kent-prod/src/hg/encode/hgEncodeSubmit

3) ./status

  Will provide output of the current status of process running for various pipelines (prod, beta, development)
  Port 49000, indicates any rails process running for the production pipeline. If you don't see these process running
  you will need to kick off the pipeline again.
  /usr/bin/ruby /usr/bin/mongrel_rails start -p 49000 -d
  ruby ./pipeline_runner.rb prod
  /usr/bin/ruby /usr/bin/mongrel_rails start -p 49000 -d
  ruby ./pipeline_runner.rb prod

4) If any subset of the above process are running, you will need to stop the pipeline and restart

a. ./stop
b. ./status

5) ./go: To start the pipeline

6) ./status: To check all the process in Step 3 are running

Fix submissions stuck at a given status

1) hgsql encpipeline_prod

2) Select * from queued_jobs;

  If submission Id is not in the queue, then a sub-process failed to update the status of the submissions.

3) select * from project_status_logs where project_id=<subId>;

  Checkout the last status that was updated for the track by sub-processes. This can provide clues on if the track was loaded or failed on upload. Double check with the submission
  directory to get a clear picture of the submissions status.

4) select id,status,run_stat from projects where id=<subId>;

  Use this to check what status the submission is displayed on the submissions pipeline and verify that run_stat = 'waiting'. Change the status from waiting to NULL in order to get back submission controls:
  validate, unload, upload.
  
  update projects set run_stat=NULL where id=<subId>;
  If all submission controls are not present, you might need to change the status for the project as well.

5) Proceed to continue to load submissions from the point where it froze.

Turning off the pipeline

There are 2 ways to turn off the pipeline.

1. Shut down the pipeline portal so that the UI is not visible.

a) ssh encodeteam@hgwdev.gi.ucsc.edu

b) cd /hive/groups/encode/dcc/pipeline/kent-prod/src/hg/encode/hgEncodeSubmit

c) ./status

  /usr/bin/ruby /usr/bin/mongrel_rails start -p 49000 -d
  ruby ./pipeline_runner.rb prod
  /usr/bin/ruby /usr/bin/mongrel_rails start -p 49000 -d
  ruby ./pipeline_runner.rb prod

d) ./stop: To stop the pipeline

e) ./status: Check that none of the process in part c are running


2. Close pipeline to submitters for 'Freeze' of data submissions or so that wranglers can catch up. Most Common Instance

Go the production pipeline git repository on hive, specifically the following direcotry: /hive/groups/encode/dcc/pipeline/kent-prod/src/hg/encode/hgEncodeSubmit/config/

There will be a file called: deadline.yml

    year: 2012
    month: 8
    day: 17
    hour: 0
    minute: 0

Edit the file to change the day, time and year that you would like the pipeline to close. Note: If there was a previous freeze copy the deadline.yml file to a new file (deadline.<date>.yml) before editing.

Project Status

Don't forget to change the project status to "loaded" if you manually validate and load.

DISPLAY

(This should be its own page, something like submission structure) Now that the data is loaded. There will be a submission with an unique submission ID listed on the ENCODE Submissions Website. There will also be a new directory named <submissionId> in the PIPE directory. This directory will include the submitted tarball, all of the files that were included in that tarball. There will also be a subdirectory called <submissionId>/out. In this directory, the loader created:

  • trackDb.ra
  • fileDb.ra
  • mdb.txt
  • README.txt (outdated)
  • pushQ.sql (maybe outdated)

There will also be a downloads directory in /usr/local/apache/htdocs-hgdownload/goldenPath/<db>/encodeDCC/<composite>.


Add the sub tracks to the composite track

For each submission ID, edit the PIPE/<submissionId>/out/trackDb.ra file as follows:

  • In the subtrack lines, verify the the name of the composite track matches the one in trackDb.wgEncode.ra.
  • You will probably have to edit subgroup values to integrate the new tracks into the existing tracks. For example, if a subgroup label has been renamed in the composite track, such as to enforce a certain ordering, the label will need to be edited in the trackDb.ra of the new subtrack. For example, here the antibody H3K4me1 has been renamed to appear (in alphabetical order) before antibodies that it does not proceed lexicographically, such as H3K27me3
 subGroup3 factor Antibody CTCF=CTCF H3K04me1=H3K4me1 
  • Check each subgroup values against the existing subgroups in <organism>/<db>/<composite>.ra or <composite>.new.ra. If some subgroup value is not in the composite track, then add it, in alphabetical order.
 subGroup3 factor Factor FOXP2=FOXP2 KAP1=KAP1

2) Merge the new <submissionId>/out/trackDb.ra into the <organism>/<db>/<composite>.ra

  • Integrate trackDb.ra into <composite>.ra by running encodePatchTdb from TRACKDB/<organism>/<db>:
 encodePatchTdb  PIPE/<submissionId>/out/trackDb.ra  <composite>.ra
  • If you are replacing an existing track, then instead run:
 encodePatchTdb -mode=replace PIPE/<submissionId>/out/trackDb.ra   <composite>.ra

Load the meta data into the metaDb

  • Make sure you are working with a clean mdb.ra
   git pull
   cd TRACKDB
   make DBS=<db>
  • For each submission, update the the metaDb in your local copy (metaDb_<username>)
 mdbUpdate <db>  PIPE/<submissionId>/out/mdb.txt
  • Pull the meta data from your sql table into a .ra file that will be stored in the metaDb/alpha
  cd ........./metaDb/alpha
  mdbPrint <db> -vars="composite=wgEncode<composite>" > metaDb/alpha/wgEncode<composite>.ra
  • Make sure that the mdb.ra file (<composite>.ra) is in the makefile
  vi makefile
  • Verify that this is working in your sandbox by viewing the meta data link (the down arrow)
  cd trackDb   
  make DBS=<db>

Wrangle the trackDb.ra

  • Long labels
  • Short labels
  • Colors (should be automated)
  • Default settings for each view
  • View ordering
  • Cell ordering
  • Cell labels
  • Change the default visibility from off to on for TIER 1 cells.
  • Review the Matrix
  • fileSortOrder
  • sortOrder

Review and Checkin

  • Review your own sandbox. Make the <db>. From TRACKDB:
 make DBS=<db>
  • Make on alpha and review that your tracks look good out of your sandbox,
 make alpha DBS=<db>
  • Check in your changes to <composite>.ra
  git pull
  git add <composite>.ra  metaDb/alpha/<composite>.ra
  git commit -m "Adding more data to the composite"

Create experiment ID's

First you need to find new experiments and review the list for consistency with your expectations.

  encodeExp find <db> -composite=<composite> outfile
  more outfile

Then you should have another wrangler review. It is hard to back out from the next step.

   mail <otherwrangler> -s"Please review these experiments" < outfile

After everyone is convinced that this is a good list, add these experiments to the accession table.

 encodeExp add outfile

Then you will want to add the expIds to the metaDb. Make sure your metaDb table is up to date with anything that you are working on before doing this.

  mdbUpdate <db> -composite=<composite> -encodeExp
  cd <db>/metaDb/alpha
  mdbPrint <db> -composite=<composite> > <composite>.ra

Update the project status

  • For each submission, change the status of the data from "Loaded" to "Displayed" with encodeStatus.pl
  encodeStatus.pl <subID> displayed

APPROVE

Inventory

  • mdbPrint <db> -composite=<composite> -experimentify
  • metaCheck <db> <composite>
  • Something I came up with while trying to do inventory:
 for j in `cat views`; 
   do for i in `cat list`; 
     do echo -n "$i:"; 
     if mdbPrint hg19 -composite=<composite> -vars="expId=$i view=$j" 1>/dev/null 2>/dev/null; 
       then echo yes; 
       else echo no; 
     fi;
   done | grep no | cut -f1 -d ":" > missing$j; 
 done
    • views is a list of views for a particular track, list is a list of expIds
    • writes a list of expIds that are missing a particular view in a file called missing[View]
    • This is really slow though, and should really be coded in python using ra.py

Ready for QA Checklist

Approval Call

  • Review inventory
  • Check default settings
  • Clarify track descriptions

Give Accession Numbers

  mdbUpdate <db> -composite=<composite> -accession
  mdbPrint

Make the releaseN dir

  • Go to the downloads directory
  • Make the releaseN directory
 mkdir releaseN  

  • Make the beta and releaseLatest link
  rm beta releaseLatest
  ln -s releaseN beta
  ln -s releaseN releaseLatest
  • Create a hard link of each wgEncode file in the releaseN subdirectory. If you are only doing a subset of the data, you will need to be creative here.
  cd releaseN
  bash
  for i in ../wg*
  do
  ln $i .
  done
  exit
  • Copy the README.txt to this directory from the parent directory
  cp ../README.txt .
  • If there is a supplemental directory
   mkdir supplemental
   cd supplemental
   bash
   for i in ../../supplemental/*
   do
   ln $i .
   done
   exit
   cd ..

NOTE: the top-level directory (e.g. above releaseN) has directory entries (hard links) for all data files in releaseN directories, plus any unreleased data files. The metadata and description files (files.txt, README.txt, md5sum.txt, etc.) should be copies of those in the currently released directory. With the ENCODE submission pipeline closed, these should always be those in 'releaseLatest' (which will be the highest numbered release directory).

Create files.txt

  • Run the script from downloads directory for your composite. encodeMkFilesList generates the files.txt using your current metaDb (metaDb_user) and calculates md5sums for each file in the current directory. The md5sums are put in the metaDb if they are not there. Errors are generated for non-matching md5sums.
 cd <download>/<composite>
 encodeMkFilesList  <db> -md5 &

If you have a large composite this will take a long time the first time. However, once you have a md5sum.history in the directory it will go much faster and only md5sum files that have been touched.

  • Move the files.txt to the main directory <download>/<compositeName>
 cp files.txt md5sum.txt releaseN/
  • If the encodMkFilesList tells you that it updated the .ra file, you will need to check that file in.
  cd <db>/metaDb/alpha
  git add <composite>.ra
  git commit -m "Added md5sums"
  git pull
  git push

Update project status

  • For each subId in the current release run
  cd <db>metaDb/alpha/
  grep subId wgEncodeLicrRnaSeq.ra | cut -f 2 -d " " | sort -u
 /hive/groups/encode/dcc/pipeline/bin/encodeStatus.pl <subid> Approved

Run encodeMkChangeNotes

Use encodeMkChangeNotes to generate a list of all tables and files which are being added in a new release or dropped slice the last release. Place that file list in the appropriate dir under <notes> and check it in.

 /cluster/bin/scripts/encodeMkChangeNotes hg19 <composite> 1  (if no previous release)
 /cluster/bin/scripts/encodeMkChangeNotes hg19 <composite> <N>  <N-1> > ~/kent/src/hg/makeDb/doc/encodeDcc<db>/wgEncode<composite>.releaseN.notes
  git add ~/kent/src/hg/makeDb/doc/encodeDcc<db>/wgEncode<composite>.releaseN.notes
  git commit -m "Creating the notes file"

Run encodeQaInit --test

To get a preview of the issues that the QA will see run:

  encodeQaInit <db> <composite> <release #>  <redmine ticket number> -t

MAKE SURE TO RUN IN TEST MODE. Otherwise it writes into the QA directory.

Update Redmine

  • Change the redmine status to approved
  • Change the percent done to 50% (the other 50% is for QA)
  • Ask a question to Jacob to tell him it is ready for preQA

REVIEW

The QA team gets a chance to look over your track and ask questions. Since QA is our current bottleneck, we give this the highest priority.

QA claims your track

  • Someone is QA will claim your redmine ticket away from you.
  • QA will set the subIds to reviewing.

Update the project status

  • QA scripts will automatically update the status for each subid to reviewing

Questions will come through Redmine

  • QA will send Questions to you through redmine with issues that they have found.
  • Wrangler's highest priority is to answer those questions.
  • Our convention is to copy the Questions over into a new note, strike out finished questions and list what you did to fix it.

Redo files.txt and notes.txt

  • If any of the questions involve changing the metadata or files list, you will need to make sure to regenerate files.txt with encodeMkFilesList db -md5
  • If any of the questions involve changes in which files are being pushed, the notes file will need to be regenerated.

RELEASE

After the Q/A process is complete and the track is viewable on the public server:

Update project status

  • For each subId in the current release run
 /hive/groups/encode/dcc/pipeline/bin/encodeStatus.pl <subid> Released
  • Possibly QA will start doing this automatically

Notify

  • Kate is periodically notifying the ENCODE group about data.
  • The wranglers update the monthly report

GEO Submissions

Once the data is released to public, it needs to be submitted to GEO.

This is the old page. Data Wrangler GEO Submissions Anything that is salvageable from this old page should be put here in this section.

Add to the redmine ticket

We have two on going tickets for the GEO Submissions.

Information about the composite

Submit

Get accession numbers

CORRECTIONS

There are two categories, data that is unreleased and data that is released.

Unreleased Data

For unreleased data, we do not create versions usually. We just replace the data, leaving the same file name. Some exceptions to this have been the Gencode Versions, the rnaElements files, and CaltechRnaSeq which found bad data in the last part of the push to public. How to replace the data is determined by if you are replacing or revoking and whether it is one file or an entire submission.

Replace an entire submission

  • Use allowReloads to get new data and replace old
  • Make sure to update the mdb.txt with dateResubmitted
  • Set old submission to "revoked"

Replace an individual file

  • Use allowReloads with the new file in a new submission
  • Leave the old submission with its status
  • Make sure to update the mdb.txt with dateResubmitted

Revoke an entire submission

  • Unload
  • Set old submission to "revoked"
  • Delete the metaData

Revoke an individual object

  • Remove from trackDb
  • Remove from mdb.ra
  • Remove from gbdb
  • Remove the table
  • Remove file from the downloads directory

MetaData Fix

  • Change the metaDb
  • Change the subgroups in the trackDb
  • Change the labels in the trackDb
  • Use encodeRenameObject for the file/table names and references everywhere.

Released Data

This is data that is already on the RR and something has been found wrong with it. There are a few different cases on this. There is "Versioned" data where there is nothing wrong with the old data, but there is better new data. There is "Revoked" data where there is actually something wrong with the old data. There is "Renamed" data where there is nothing wrong with the data, we are just calling it the wrong thing (Hela versus Huvec). There are metaData only changes where the update is in a non-expVar (experiment defining) metaData variable.

You should first assess the severity of the issue. If data is clearly wrong, it needs to be replaced. If metadata (and filenames/tablenames incorporating metadata) are clearly wrong, that should be changed. If data is incomplete or metadata is unclear, then just adding explanatory text to track description or additional detail to labels should suffice.

Versions

  • Old version table reference is removed from the trackDb (thus hgTrackUi and tableBrowser) by the wrangler
  • Old version metaData object is updated with a objStatus=replaced
  • Old version table is left (alpha, beta, public)
  • Old version gbdb is left (alpha, beta, public)
  • Old version download file left on public
  • Old version download file should stay on hgdownloads-test in the main directory and be hard-linked in each successive releaseDir. [ This will make the main alpha dir match the expected RR dir (if/when data gets released). And release dirs to continue to be snapshots of the RR dir at the time of that release.]
  • The release notes will say something about new versions being available.
  • A new release is created. New version is added everywhere (trackDb, metaDb,gbdb,tables,downloads) just like a new experiment or file would be.
  • New version object name has V2 or VN appended to the name.
  • New version metaData object has a "version reason" attached to it
  • New version downloads file is put in the main downloads dir on hgdownloads-test and is hard-linked in releaseN sub-directory just like a normal submission.

The results of this treatment are:

  • Old version does not display in the browser by virtue of trackDb.ra entries being removed.
  • Old version does not show in hgFileUi by virtue of the "objStatus" metadata.
  • Old version will not come up in trackSearch or fileSearch again by virtue of the "objStatus".
  • If old version is referenced in some other track (like an integrated track), it will still be there, but its metaData will say "replaced".
  • If someone really is looking for an old file on the ftp site, it is there with its metaData in files.txt, but if they are not looking, it will be hidden by hgFileUi or index.html (older composites). The metaData will say objStatus=replaced.
  • Old version will be listed in files.txt with the metaDb term objStatus=replaced.
  • Oldest versions (data that was revoked or replaced before this policy) will be listed with no metaData at all. These will also be filtered by hgFileUi and trackSearch. They will show with no metaData in files.txt. Our README now says, data with no metaData is obsolete.

Revoked

This is data that is 'bad.' Until further discussion the policy is to treat this like Version with the following differences:

  • Old version metaData object is updated with a objStatus=revoked
  • To push this to public you would just push the mdb.ra, trackDb.ra, files.txt, index.html if still around. There would be no new files attached.

TO IMPLEMENT: objStatus reason field, Do we want this as a field or as part of objStatus?

Removed

There are exceptional cases, for example corrupted fastq files, where the size of the data and the uselessness of the information inside will make it prudent to actually remove data. This may make more sense with older data that we are missing lots of pieces of and have no idea why it was revoked. If a case for removal comes up, we are removing it entirely.

  • gbdb (alpha, beta, public)
  • table (alpha, beta, public)
  • file (alpha, beta, public)
  • trackDb (browser and table browser)
  • metaDb (hgfileUi)
  • files.txt

Renamed

A renamed object comes from the fact that the metaData is recorded in the objectName (file, table, gbdb pointer, metaObject). If there is a change to the metaData in one of the main expVars then the file gets renamed. EXAMPLE: wgEncodeCshlLongRnaSeqHuvecCellPamAln.bam -> wgEncodeCshlLongRnaSeqHeLaCellPamAln.bam. This should be treated like a replaced situation. The new data coming in or swapping should have a V2 appended even if it is the first true submission for that experiment.

TO IMPLEMENT: The renamer breaks this policy currently.

POLICY QUESTION: Should we correct the metaData on the replaced version? Example: We have the below object and we discover that it is really HeLa. We are leaving this object in place and adding the term objStatus=replaced and objStatusReason="This is not HUVEC." Do we also change the cell =HUVEC to cell = HeLa? Do we change the expID? I think our policy is no, but I am trying to clarify.

metaObject wgEncodeCshlLongRnaSeqHuvecCellLongnonpolyaFastqRd2Rep2
objType file
bioRep 010WC
cell HUVEC
composite wgEncodeCshlLongRnaSeq 
dataType RnaSeq
dataVersion ENCODE Jan 2011 Freeze
dateSubmitted 2010-12-23
dateUnrestricted 2011-09-23
dccAccession wgEncodeEH000188
dccRep 2
expId 188
fileName wgEncodeCshlLongRnaSeqHuvecCellLongnonpolyaFastqRd2Rep2.fastq.gz
geoSampleAccession GSM767856
grant Gingeras
lab CSHL
labExpId LID8789
labProtocolId 010WC-
localization cell
md5sum 594c90e69113f6c486f9efab32e32188
origAssembly hg19
project wgEncode
readType 2x76D
replicate 2
rnaExtract longNonPolyA
seqPlatform Illumina_GA2x
subId 3135
view FastqRd2

MetaData Only Change

If the variable is not an expVar and thus is not in the file name, this is only a push for files.txt and mdb.ra. However, you will need to make sure that there is not a current release in development that would make the files.txt and mdb.ra have other changes then the one you are currently working with.


A Whole Composite

  • The metaData is updated with the objectStatus.
  • A new README is created explaining what is going on.
  • A new files.txt is created.
  • The track is removed from public and beta in trackDb.
  • The metaData is removed from beta/makefile and public/makefile

Data in GEO

POLICY QUESTION: If the data that we need to replace is in GEO, I think the policy is: For replaced data, just add newest Version. For revoked data, I think that GEO wants it pulled.

Secret Wrangler Tricks

Clear the Cache for hgFileUi

Often you are not seeing all of the files in hgFileUi that you are expecting. When that is the case you need to clear the cache. &clearCache=yes gets added to the end of your url.


Out-Of-Memory Errors in hgTracks

Sometimes the data being loaded by hgTracks is so large that it crashes the program. The solution to this is to find the offending view with giant data files, and limit its max window with maxWindowToDraw in trackDb.


Control/Input Data

Q) How do labs upload control/input data?

There are at least two methods:

(A) for ChIP-Seq:

Create a DAF with appropriate views (usually fewer than normal DAF) and have the lab upload data with the antibody (or other relevant variable) equal to "control"; e.g.:

grant   Myers
lab     HudsonAlpha
dataType        ChipSeq
variables       cell, antibody
assembly        hg18
dafVersion      0.2.2

Track/view definition

view             Alignments
longLabelPrefix  HudsonAlpha ChIP-Seq Sites
type             tagAlign
hasReplicates    no
required         yes

(B) for DNase, FAIRE:

Create a DAF with dataType == 'Control'. Create a dummy composite track; e.g. wgEncodeLabControl (we may want to fix the requirement to have a dummy composite track). Then create a DAF similar to the DNase/FAIRE DAF. This will result in tracks like "wgEncodeLabControlCellLine".