Old ENCODE QA

From Genecats
Jump to navigationJump to search

WARNING: This is the Old QA ENCODE page that we used before all the scripts that resulted from QA bootcamp. We are keeping it here just in case our scripts break and we need to do things the old fashioned way.

  • See the current ENCODE QA page for the current protocol.

Getting Started

Claim track (pushQ & redmine)

  1. Select the top-most track from the encodePushQ, which is a sub-pushQ accessed from pushQ Gateway.
  2. Claim the track's Redmine Issue & change status from "Approved" to "Reviewing"
    • Add yourself as a watcher to the redmine ticket (so if you assign it back to the developer you will get updates)
    • Make a question in the redmine ticket for Kate:
      1. let her know you claimed the track
      2. ask her to make a determination about the composite and subtrack labels (may not be necessary for subsequent releases)
    • Make a question for the wrangler asking them to update the ENCODE status to "Reviewing"
    • Use the % Done on Redmine Issue to estimate your QA progress. Kate uses this to check status.

Review PushQ Entry

  • Track: Composite Short Label
    • Finding the composite short label: in hgTrackUi, on the left-hand side of the top light blue bar you'll see "<composite shortLabel> Track Settings"
  • Release Log: Composite Long Label (if not a first release, should also have (Release N) after)
    • Finding the composite long label: in hgTrackUi, the track title next to the logo is the composite longLabel
  • Release Log URL: ../../cgi-bin/hgTrackUi?db=<db>&g=<composite>
  • Databases: hg19, hg18 or mm9
  • Open issues: a link to the redmine ticket (OK to have additional things)
  • Notes: path to the notes file (OK to have additional things)
  • Notes file exists (pre-QA skip)

Review the "notes" file

  • Check out the developer's #Notes file to get a feel for what the track consists of.
    • the path to the notes file will be in the "Notes" section of the pushQ entry
    • trust the notes file over the pushQ entry table/files information
  • If this is a subsequent release, see #Subsequent Release of Data (e.g. Release 2) first.
  • Compare the notes file to the hgTrackUi (to make sure it reflects the notes file).
    • If a Release N, compare the hgTrackUi on dev to the previous release's hgTrackUi on the RR to help verify notes file & new hgTrackUi is correct (e.g. make sure things aren't missing from the new release in comparison to the previous release that aren't accounted for in the notes file).

Create table list

Pre-QA

Some tracks may have already gone through some preliminary QA, see Pre-QA for more information.

Run qaEncodeTracks.csh & check output

which does:

  • countPerChrom
  • check for entry in tableDescriptions table
  • check that shortLabel does not exceed 17 characters
  • check that longLabel does not exceed 80 characters
  • check that there are no underscores in the table names
  • check for indices on the tables
  • check that positional tables are sorted
  • checkTableCoords (checks for any illegal coordinates)

Also, run genePredCheck/pslCheck if applicable. (i.e. if your track is a gene prediction track)

Staging on hgwbeta

Push /gbdb files

Push new and, if applicable, updated /gbdb files (e.g. .wib, .bb, etc.) from hgwdev -> hgnfs1.

Open track on beta (if subsequent release)

Open the track on hgwbeta before staging it.
This way, when you check the track on beta (in the last staging step) you'll be able to tell if the update will cause a cart clash for users who happen to be using it when you release it to the RR (as evidenced by a completely blank screen).

Push tables to hgwbeta

Use bigPush.csh using the table list you created above.

Prepare trackDb (release tags and metaDb)

  1. release tags: see the Three State TrackDb page for more info on release tags and our three-state trackDb
    • In /cluster/home/$usr/trackDb/$species/$db/trackDb.wgEncode.ra, find the include statement for your track's .ra file and change 'alpha' tag to an 'alpha,beta' tag and, if applicable (releaseN), change 'beta,public' to 'public' and then check in these changes.
      If a new track is in a super-track, make sure there are release tags! See explanation)
  2. metaDb: starting from /cluster/home/$usr/trackDb/$species/$db/metaDb
    1. copy metDb .ra file from ~metaDb/alpha -> metaDb/beta
    2. add .ra file name to the makefile in ~metaDb/beta
  3. commit changes
  4. On hgwbeta, make beta DBS=__ from /cluster/home/$usr/trackDb/

Check track on beta

Check that the track looks good on beta.
If this is a subsequet release, you already had the track open on beta from #Open track on beta (if subsequent release). Refresh the page to see the changes.
If you get a blank screen:

  1. Don't reset your cart (at least not until you've completed these steps!)
  2. Notify the track wrangler that there is likely a problem with conflicting cart variables when the new data is used with an old cart.
  3. Dump the cart variables (manipulate the URL to: http://hgwbeta.cse.ucsc.edu/cgi-bin/cartDump then hit enter) and save them in a file for people to look at.

hgTrackUi

Functionality (track controls)

Display Modes

  • If in a super-track, by default, composite overall display mode should be set to dense. Super-track should be set to hide.
  • If not in a super-track, by default composite overall display mode should be set to hide.
  • If multiple views, Kate wants these settings by default:
    • Peaks -> pack
    • Alignments -> hide
    • All else -> full
  • changing display mode of views should affect the subtrack list & hgTracks

Config Settings of Views

  • settings function correctly
  • settings of different views are independent
  • Signals, by default, should have the following settings (unless lab has requested otherwise or other good reason):
    • Data view scaling: use vertical viewing range (rather than auto-scale)
      • (Pre-QA skip) in dense, default fixed range should result in meaningful banding at full chromosome (not all gray)
    • Windowing function: mean + whiskers

Matrix

  • By default, matrix boxes should be fully checked or fully unchecked (not grayed), if not, this is trackDb setting issue that the wrangler should fix.
  • Matrix headers:
    • For human, Tier 1 and Tier 2 cell lines:
      • should be listed first (Tier 1 in alphabetical order followed by Tier 2 in alphabetical order)
      • should be labeled as Tier 1 or 2. The tier should follow the cell line name, in parentheses and bolded. No hyperlink, no italics, e.g. cellLineA (Tier 1)
      • matrix headers are links to a working hgEncodeVocab page for the item (cell line, factor, etc)
    • +/- buttons function correctly
    • selections in matrix result in appropriate selection changes in subtrack list

Subtrack list

  • adjusts according to matrix & view (hide -> non-hide) selections
  • 'only selected/visible' and 'all' radio selections function
  • sorting functions (clicking on column headings)
  • schema links work and has an "info" column

MetaData

  • make sure metaData is present by clicking on the down arrow (v)
  • check a few to make sure they have somewhat consistent fields
  • spot check a few fields to make sure they make sense
  • check on hgwdev for dccAccession number (aka UCSC Accession), if none, may not be ready for QA; ask wrangler.
    • expId & dccAccession should correspond, dccAccession = wgEncodeE<H or M><expId> (the E=experiment, H=human, M=mouse)
    • these #numbers should be the same among subtracks of the same "experiment," even across assemblies of an organism (e.g. same number on hg18 & hg19)
    • (Pre-QA skip) Check that expId is hidden in "..." on hgwbeta. If it isn't, there is an issue.
      • NOTE: expID should only be displayed on hgwdev.

Links

  • check that all links work, and (PRe-QA skip) where applicable, are relative

Content (.html description page)

Labels

  • Check labels adhere to Kate's instructions
    • Other resources: Style Guide and the Label spreadsheets on the soe google docs.

Sections

  • Make sure all sections are present, in order, and have the correct headings (the list below has the correct headings and is in the correct order)
  • Check grammar, spelling, readability, completeness, correctness
  • style should be consistent with the rest of the site
    • Description should be in a passive 3rd person voice
    • references to "data" are plural
    • value and units have space between them (e.g. 50 bp rather than 50bp)
  • links should be hyper linked text rather than just plain URLs
Description
  • Brief overall summary of track.
Display Conventions and Configuration
  • Contains info about each view in track
  • No description for views only available in downloads
  • link to BROKEN multi-view instructions if there are multiple viewing options.
  • Tracks with Bam alignments (in metadata, fileName will end with ".bam") should have a link to the Sam Format Specification and should explain any non-standard tags, those starting with X, Y or Z or that are not listed in the tag section
Methods
  • Make sure it is detailed enough.
Verification
  • optional
Release Notes
  • Optional for first release
  • Required for subsequent releases
  • Should start with "Release # (Month Year) of this track...."
  • Provides a description of the changes of this particular release.
Credits
  • Must have contact person
  • Name is a hyperlink to email
  • Email must be sanitized (using encodeEmail.pl script)
References
Data Release Policy
  • Standard language, Supertrack:
Data users may freely use ENCODE data, but may not, without prior consent, submit publications that use an unpublished ENCODE dataset until nine months following the release of the dataset. This date is listed in the Restricted Until column on the track configuration page and the download page. The full data release policy for ENCODE is available here.
  • Standard language, Track:
Data users may freely use ENCODE data, but may not, without prior consent, submit publications that use an unpublished ENCODE dataset until nine months following the release of the dataset. This date is listed in the Restricted Until column, above. The full data release policy for ENCODE is available here.

Links

  • Check standard links are present, and, where applicable, are relative:
    • ENCODE Data policy (in Data Release Policy section)
    • help for multi-view (Next to "Select views" in track Control Section in Display Conventions and Configuration section)
    • contact email (see #Credits for more info)

hgc details

Check the following for each view:

Accuracy of details

  • details that are displayed correspond with the record in the table

Makes sense

  • tables values seem correct

Useful

  • you understand what is being displayed
  • internal, non-functioning fields are not displayed (e.g. if all values in a field have "-1" as a placeholder, we shouldn't display that field)

Complete

  • all useful information from is present (there's nothing important that is missing)

Clear

  • details are presented and labeled clearly
  • layout is user friendly

Links

  • check that all links are relevant, work, and where applicable, are relative
  • standard links: downloads, metadata, schema

hgTracks

Display

Views (zoomed in/out)

  • check the display of all Views in all display modes when zoomed in to the base pair level & zoomed out to 1 million bp

Table coordinates + features

  • an items' cooridnates and other display features (exons, etc) display as expected/correctly based on table
    • a line from the table for comparing against the display can be obtained from schema or mysql db for regular tables
    • for bam files, the schema will only give you a filePath. You will need to use SamTools to obtain a point to test.
      1. add /hive/data/outside/samtools/svn_${MACHTYPE}/samtools to $PATH in your .bashrc
      2. run samtools on the command line using the fileName found in the schema (see following example). The output will give you the start position and then it gives you read length (in the CIGAR string); if the CIGAR string is simpe, e.g. 76M, add the read length, 76, to the start position to get the end position. If the CIGAR string is complicated, e.g. 43S17M494510N16M, just use the actual sequence to determine the length (paste sequence into Word & get word count) and add this to the start position to get the end position. This will give the point needed to put into the browser for testing purposes.
 samtools view -x filePath chrx:xxxx-xxxxxx | head
 samtools view -x /gbdb/hg19/bbi/wgEncodeCshlLongRnaSeqHuvecCellPapAlnRep1.bam chr1:2000000-3000000 | head
  • for big* files, you can't get individual record, but use bigWigInfo or bigBedInfo to get general stats, be sure bigWigs are version 4.

Searchable

  • Are items searchable; should they be? Most likely not for ENCODE. (position/search box at the top of the browser image)
  • Do a quick search of a subtrack in track search (button found at the bottom of the browser image) to make sure that it is interacting correctly.

Colors

  • For human, Tier 1 and Tier 2 cell lines should be displayed in a unique color (other than black)
  • it is OK if other tracks are in color, but not necessary

Defaults (composite/subtracks)

  • should this composite track be on by default? (For ENCODE, usually no)
  • check the which subtracks are set as default selection, make sure:
    • there aren't too many
    • important cell lines are on by default
  • default Tier 1 and Tier 2 subtracks should display first

Compare to hg18

  • If track is in hg19, compare a point on the hg19 browser of the track to the equivalent position in hg18.
    1. use "convert" from hg19 position to see the equivalent position in hg18.
    2. go back to your region in hg19, open new window and paste in hg18 equivalent position and compare hg19 to hg18.
  • Note: Comparisons to hg18 should be very cursory. Any differences should be noted in the redmine ticket, but not necessarily investigated unless a user also brings up an issue. The thinking behind this is that when there are differences, it is most likely an error with hg18, not hg19 and we are unnecessarily holding up hg19)

Performance

Chrom 1 Test (signal & experiment)

When position is set to all of chromosome 1, data of interest loads in less than 1 minute:

  • signal: check time of loading first signal subtrack
  • experiment: check time of loading all views for one experiment (e.g. Pol2 in GM12878 cells)?

Defaults at Gene Sized Region Test

Set position to a gene-size region with your track's default subtracks on and the default browser tracks on (easiest to reset cart, turn on track)

  • should display quickly and not be "too much" data

Data make sense

Compare subtracks within views

  • Do all the subtracks within a view somewhat correlate?

Compare subtracks of related views

  • For example:
    • Does the All Signal Raw Signal subtrack of an experiment really seem to comprise of the data in the Plus Raw Signal & Minus Raw Signal?
    • Do Peaks really represent the high Signal areas of the Signal View subtracks?

Do the data make sense Biologically?

  • Turn on other tracks to compare.
    • compare to the gene tracks
    • compare to subtracks of similar tracks
    • For example:
      • RNA-seq data should correlate with the exons in a genes track
      • TFBS tracks should correlate with the beginning of gene transcripts

Files

hgFileUi

  • 'Downloads' links on hgTrackUi should now go to hgFileUi
    • if not, ask wrangler to add "fileSortOrder" information to trackDb entry

file count

  • Check # of files displayed is correct (use "notes" file). Pre-QA skip.

download button

  • Make sure download button prompts a download (and doesn't take you to an error page)

useful columns w/ good titles

  • Columns are useful
  • Column titles are correct and make sense (e.g. dccAccession title is "UCSC Accession")

sort columns

  • Check the sorting of columns. Clicking on the title of the column should sort the table on that column.

file filter

  • Check the filtering of files

links

  • Check that the "Track Settings" link takes you back to the track's hgTrackUi page
  • Check that the navigation, file filter title links, and other links work
  • Make sure files.txt & md5Sum.txt links are present and function
  • Make sure the download server link goes to the download directory for that track see #download server for more info.

download server

  • linked from hgFileUi with the *download server* link
  • have wrangler remove index.html or preamble.html files from the current release directories if they exist (it is OK in older directories, e.g. release1 if this is a release3).

README

  • README should be displayed automatically followed by the list of files in the directory
  • contains a URL to the track's hgFileUi (you can double check by copying link, pasting in a browser, and changing hgdownload to hgdownload-test).
  • there may be more files/directories in here than seen in hgFileUi. This is OK. Because we are not dropping obsolete files, they will be present in this directory. Also, on hgdownload-test there will also be releaseN directories. These are part of the process of preparing a track and are OK. These, however, *won't* be pushed to the hgdownload upon release of the track.

Release to RR

Note: Cc the data wrangler for this track on all your pushes Cc encode@soe.ucsc.edu on your final push.

Release log

  • Release log field in PushQ:
    • should be the long label (or short if too long) and, if releaseN, release number in parentheses
    • must contain ENCODE (or it it won't show up on ENCODE downloads page)

Push download files

  • Push download files, index.html, files.txt and md5sum.txt (from hgwdev to hgdownload)
    • If this is a releaseN, even though there is a releaseN directory on hgwdev, do not create one on hgdownload (see #Download files for specifics)
  • Does this track have supplemental data? This would be data that is linked from the description pages, but isn't currently indicated in the notes file. If so, push this data also.
  • Note, this push can take hours

Prepare trackDb (release tags and metaDb)

  1. release tags:
    • If first release, in trackDb.wgEncode.ra (of the $db directory):
      remove 'alpha,beta' (no release tag necessary) from the <trackName>.ra include line
    • If subsequent release (releaseN):
      (see Three State TrackDb page for more info)
    1. in trackDb.wgEncode.ra
      • delete the include line that states: <trackName>.new.ra alpha,beta
      • remove the "public" tag from this line: <trackName>.ra public (no release tag necessary)
    2. in the $db directory, copy <trackName>.new.ra over <trackName>.ra
    3. copy <trackName>.new.html over <trackName>.html
    4. open <trackName>.ra and remove html line (pointed to <trackName>.new.html)
  2. metaDb, from /cluster/home/$usr/trackDb/$species/$db/metaDb:
    1. double check alpha vs. beta to make sure you have most updated metadata (diff beta/<trackName.ra> alpha/<trackName>.ra)
      • if diffs are due to next release, don't copy to beta, if diffs are for current release, copy to beta & double check in hgTrackUi, etc.
    2. copy metDb <trackName>.ra file from ~metaDb/beta -> metaDb/public
    3. add <trackName>.ra file name to the make file in ~metaDb/public
  3. commit/push changes
  4. On hgwbeta, make public DBS=$db from /cluster/home/$usr/trackDb/
    • announce it to browser-qa

Check on public

  1. Check track on hgwbeta-public
    • hgwbeta-public uses hgwbeta for the tables, but uses the CGIs that are on the RR.
  2. Run comparePublic.csh to check differences between trackDb_public and RR and hgwbeta.

Push tables

  • Push track tables from hgwbeta -> mysqlrr (not trackDb_public yet)

Push trackDb+friends

  • Push trackDb+friends and tableDescriptions (if forgotten, tableDescriptions is getting pushed out once a week by a designated QA'er) from hgwbeta -> mysqlrr
    • cc wrangler and encode@soe.ucsc.edu

PushQ, check on RR, Redmine

  1. click "push requested" in the pushQ record
  2. once all pushes complete, check track on RR
  3. click "Done!" on pushQ record
  4. Transfer pushQ entry to from the L queue of encodePushQ to the Main pushQ.
  5. Make a question for the wrangler asking them to update the ENCODE status to "Released" and then Close the ticket if there aren't any lingering issues.

ENCODE downloads

  • check ENCODE downloads page (human | mouse), if you track isn't there, add it:
    • /kent/src/hg/htdocs/ENCODE/downloads.html to add a line for your track and, if necessary, its super-track
      • super-track title should be a non-underlined link to the super-track hgTrackUi, for example:
<A style="text-decoration:none" HREF="http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=wgEncodeRnaSeqSuper" TARGET=_blank>RNA-seq</A>
  • push the following from hgwbeta -> RR:
/usr/local/apache/htdocs-hgdownload/ENCODE/downloads.html

Other info

Subsequent Release of Data (e.g. Release 2)

Periodically, released ENCODE tracks will be augmented with new data as labs complete experiments on new cell lines, etc. The new data will come in various formats: some will replace existing data, some will be brand new, some old data will be eliminated, etc.

Notes file

The data wrangler will create a notes file using the encodeMkChangeNotes script, check it into git, and place it here: kent/src/hg/makeDb/doc/encodeDcc$db/*.txt

This document should contain complete lists of each table and file and what its disposition is. The tables and files will fall into categories similar to this:

  • A) Untouched - are on public browser and should remain
  • B) Deprecated - are currently on RR but will no longer be needed and should not be referenced by the public site.
NOTE: NO FILES SHOULD BE REMOVED from the downloads directory on hgdownloads (RR).
This list is provided for completeness.
  • C) New - are only currently on test but will need to be pushed to the RR.
  • D) Additional items of note

This document may not match reality. It may be the case that some of the tables/files do not exist at all, the names are incorrect, they are not present on the machine as listed in the file, they do not match the list that is in the pushQ. The first challenge in QAing a subsequent ENCODE release is to determine if/how the notes file diverges from reality. To do this, compare the file to the "snapshot" of what is included in the release (and what you should QA), which can be found in the "release2" list in the downloads directory (hgwdev: /data/apache/htdocs/goldenPath/<db>/<trackName>/release<x>/). If the file differs from the downloads directory, then send that information to the data wrangler and pop the track into the B-queue while they sort it out. Otherwise, QA spends far too much time figuring out exactly what they are expected to QA.

Once the list is finalized, proceed with the QA work as outlined above. Note the additional steps in the #Files section for how to handle the /releaseN directory.

MetaDb changes

There also may be some metadata changes to fix errors or add information such as the GEO Series and GEO Sample.

  • Be sure to QA these metadata changes
  • Check for related redmine issues (addition of GEO accessions should definitely have a related issue for the addition of this info)
  • You can also do a diff to check for any other metadata changes.From kent/src/hg/makeDb/trackDb/<org>/<db>/metaDb, do the following diff:

diff beta/wgEncodeYourTrack alpha/wgEncodeYourTrack

Pushing Files

Pushing the three main types of files involved in ENCODE tracks.

gbdb files

Files of this form get pushed hgwdev -> hgnfs1

/gbdb/hg18/wib/wgEncode*.wib

Download files

Download files for an original release get pushed hgdownload-test on hgwdev -> hgdownload (list the entire file path as usual)

/usr/local/apache/htdocs-hgdownload/goldenPath/hg18/encodeDCC/wgEncode*/index.html
/usr/local/apache/htdocs-hgdownload/goldenPath/hg18/encodeDCC/wgEncode*/wgEncode*.[bed/wig].gz

When pushing download files for a subsequent release track (e.g. release 2), push files as follows (but in your request, list the from/to paths at the top followed by a list of the file names without the full path)

from hgwdev: /usr/local/apache/htdocs-hgdownload/goldenPath/<db>/encodeDCC/<trackName>/releaseN/
to hgdownload: /usr/local/apache/htdocs/goldenPath/<db>/encodeDCC/<trackName>/
(Note: no releaseN directory on hgdownload)
  • Once the files have been pushed you can check to see if the push was successful using this script: checkPushedFiles.csh

Other files

Files of this form get pushed hgwbeta -> RR. Because they used to be omitted from the pushQ entry often, the directories containing these files are now pushed weekly byKatrina on Fridays. So QAers no longer have to worry about pushing these. They are not in source control so go out way ahead of the track usually.

/usr/local/apache/htdocs/ENCODE/protocols/cell/*.pdf

Relative Links

In html on our site, you can create relative links (on dev, the link goes to the page on dev, on beta, it goes to beta, etc.) by using part of the path based on the your file's location in the source tree relative to the location of the file or cgi you are linking to.
For example from ~trackDb/human/hg19, here is how you point to:

  • a golden path file:
<A HREF="../../goldenPath/help/multiView.html" TARGET=_BLANK>here</A>.
  • cgi:
<A TARGET=_BLANK HREF="/cgi-bin/hgEncodeVocab?type=cellType">cell lines</A>
  • ENCODE protocols:
<A HREF="../../ENCODE/protocols/cell">ENCODE cell culture protocols </A>.
  • ENCODE portal:
<A TARGET=_BLANK HREF="/ENCODE/index.html">Encyclopedia of DNA Elements (ENCODE) Project</A>
  • ENCODE data release policy:
<A TARGET=_BLANK HREF="../ENCODE/terms.html" TARGET=_BLANK>here</A>

Old info

File Validation

  • No longer run, here are Tim's comments about QA running validateFiles: "To get these things through the pipeline, we run them through validateFiles, so I think your running them through again is one time too many. But if you are going to, then each lab and each file type may have negotiated limits (which may change between submissions). These limits are found in the relevant submission directory DAF files."
  • Old validateFiles process:

Test a smattering of different file types using this tool: validateFiles (type the program name without arguments to see the usage statement). If there are no errors, there will be no output. For example, for files of type tagAlign, invoke the tool like this:

validateFiles -type=tagAlign -genome=/gbdb/hg18/hg18.2bit /usr/local/apache/htdocs/goldenPath/hg18/encodeDCC/wgEncodeHudsonalphaChipSeq/wgEncodeHudsonalphaChipSeqAlignmentsRep1Gm12878Control.tagAlign.gz

For tagAligns there are several relevant validateFiles options:

 mismatches  - frequently 2 but negotiated for each lab.  Set this to 5 to be tolerant
 matchFirst - negotiated.  You should set this to 25 and even then you may need to adjust it
 nMatch - negotiated, but you should always have this parameter set.

If you want to be exact, then the metadata as seen on the downloads page tells which submission directory the file belongs to, and the most recent *.DAF (or *.daf) will have a validationSettings line in it which will include the settings that belong to each file type. Example:

 /hive/groups/encode/dcc/pipeline/encpipeline_prod/773/UtaChIPseqBOonlyV1.DAF

has the line:

 validationSettings allowReloads;validateFiles.tagAlign:mmCheckOnInN=100,mismatches=3,nMatch,matchFirst=25

This means that the tag aligns were validated with -mismatches=3 -nMatch -matchFirst=25