Big Bed/Wig QA: Difference between revisions

From Genecats
Jump to navigationJump to search
(Link changed from .soe or .cse to .gi)
(Undo revision 4728 by Daniel (talk))
Line 29: Line 29:
</pre>
</pre>


More background on phasing out BBI Link tables can be found in Redmine [http://redmine.gi.ucsc.edu/issues/15059 #15059].
More background on phasing out BBI Link tables can be found in Redmine [http://redmine.soe.ucsc.edu/issues/15059 #15059].


== QA-ing bigBed/bigWig tracks ==
== QA-ing bigBed/bigWig tracks ==

Revision as of 16:54, 29 August 2018

Overview of bigBed/bigWig tracks in the Browser

For more information on these two file types see the following links:

There are two different types of bigBed/bigWig tracks in the Genome Browser; those that have BBI Link tables and those that don't. Those with BBI Link tables will often have a single-line table that points to the bigBed/bigWig file. For example, the "tgpPhase1AccessibilityPilotCriteria" table in hg19:

mysql> select * from tgpPhase1AccessibilityPilotCriteria;
+-----------------------------------------------------------+
| fileName                                                  |
+-----------------------------------------------------------+
| /gbdb/hg19/1000Genomes/paired.end.mapping.1000G..pilot.bb |
+-----------------------------------------------------------+

Newer bigBed/bigWig tracks often don't even have an associated BBI link table. For example, the "tgpPhase3AccessibilityPilotCriteria" track for hg19 doesn't have an associated table:

mysql> show tables like "%tgpPhase3AccessibilityPilotCriteria%";
Empty set (0.04 sec)

But rather has a "bigDataUrl" line in trackDb/human/hg19/trackDb.ra that points to the /gbdb/ locaiton of the file:

    track tgpPhase3AccessibilityPilotCriteria  
    subTrack tgpPhase3Accessibility  
    shortLabel 1000G Accs Pilot  
    longLabel 1000 Genomes Project Phase 3 Paired-end Accessible Regions - Pilot Criteria  
    bigDataUrl /gbdb/hg19/1000Genomes/phase3/20141020.pilot_mask.whole_genome.bb

More background on phasing out BBI Link tables can be found in Redmine #15059.

QA-ing bigBed/bigWig tracks

Many of our standard QA scripts (countPerchrom.csh, qaGbTracks, etc.) either don't work or don't really do much when they encounter one of these bigBed/bigWig tracks.

However, you can use some Genome Browser command-line utilities to extract some information from these files and do some basic QA of these files.

validateFiles

First, you can use validateFiles to perform some basic validation of the bigBed/bigWig files. Here's the basic usage for validateFiles:

validateFiles -chromInfo=FILE -options -type=FILE_TYPE file1 [file2 [...]]

You can replace -chromInfo=FILE with -chromDb=$db to tell validate files to use the chromInfo table rather than the chrom.sizes. It's particularly nice to be able to just specify the db rather than having to specify the path to the chrom.sizes file (often in /hive/data/genomes/$db/chrom.sizes).

You will also need to specify the file type using -type=TYPE. For bigBed/bigWig here are the relevant sections from the usage message:

       bigWig       : Big Wig
                      See http://genome.ucsc.edu/goldenPath/help/bigWig.html

       bigBedN[+P]  : bigBED N  or bigBED N+ or bigBED N+P, similar to BED
                      See http://genome.ucsc.edu/goldenPath/help/bigBed.html

The "[+P]" for bigBed files will be the extra fields past the standard N BED files in your track. For example, the file /gbdb/hg19/bbi/patBulk.bb is a bigBed12+13. This means the first 12 fields are the standard BED fields and the last 13 are specialized fields that are specified in an "AutoSQL" or ".as" file. (For more on AutoSQL, see the AutoSQL Genomewiki Page.)

In addition to the required options described above, here are some options that are useful when QA-ing bigBed/bigWig files:

   -as=fields.as                If you have extra "bedPlus" fields, it's great to put a definition
                                of each field in a row in AutoSql format here. Applies to bed-related types.

   -tab                         If set, expect fields to be tab separated, normally
                                expects white space separator. Applies to bed-related types.

   -doReport                    Output report in filename.report

Note that the report files produced by -doReport are placed in the same directory as the bigBed/bigWig file. For example, if you ran validateFiles with the -doReport option on /gbdb/hg19/bbi/patNonBulk.bb, then the report files would end up in /gbdb/hg19/bbi/.

You can run validateFiles without any arguments to see the full usage message.

Here's an example of a command you might use to validate a bigBed file:

validateFiles -chromDb=hg19 -type=bigBed12+13 -as=/hive/data/genomes/hg19/bed/patents/patSummary.as -tab /gbdb/hg19/bbi/patNonBulk.bb

The program will report any errors it finds to stderr. If no problems are found, you should see a message like:

Error count 0

If your track consists of many files, it might be best to run them in a loop like so, where you can specify different autoSql files and types:

for file in $(cat files); do printf "checking file: %s\n" "$file"; \
if [[ "$file" == "/gbdb/danRer10/uniprot/unipAliSwissprot.bb" ]] || [[ "$file" == "/gbdb/danRer10/uniprot/unipAliTrembl.bb" ]]; then \
validateFiles -chromDb=danRer10 -type=bigBed12+33 -as=bigPsl.as -tab $file; \
elif [[ "$file" == "/gbdb/danRer10/uniprot/unipMut.bb" ]]; then \
validateFiles -chromDb=danRer10 -type=bigBed12+9 -as=bigBedMut.as  -tab $file; \
else validateFiles -chromDb=danRer10 -type=bigBed12+6 -as=bigBed.as -tab $file; fi; done
checking file: /gbdb/danRer10/uniprot/unipAliSwissprot.bb
Error count 0
checking file: /gbdb/danRer10/uniprot/unipAliTrembl.bb
Error count 0
...

bigBedInfo and bigWigInfo

Both bigBedInfo and bigWigInfo perform similar functions for their respective file types, which is to provide an overview of the data in the bigBed or bigWig file. While bigBedInfo is primarily dicussed in this section, bigWigInfo and it's options and output should be fairly similar.

To start, we'll take a look at the usage statement for bigBedInfo and then discuss some of the more useful options for QA:

bigBedInfo - Show information about a bigBed file.
usage:
   bigBedInfo file.bb
options:
   -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs
   -chroms - list all chromosomes and their sizes
   -zooms - list all zoom levels and their sizes
   -as - get autoSql spec
   -extraIndex - list all the extra indexes

bigWigInfo has a similar usage message.

The -chroms option is useful in that it will print a list of all chromosomes names found in the bigBed/bigWig file. You can compare this list to the full list of chromosomes for that database to see which chromosomes might be missing data. If you're not sure whether or not a chromosome should have data on it, ask the engineer responsible for the track/table/file.

Here's an example of a command you could use to get some basic summary info on a bigBed file:

$ bigBedInfo -chroms /gbdb/hg19/bbi/patNonBulk.bb

And here's the output of that command:

version: 4
fieldCount: 25
hasHeaderExtension: yes
isCompressed: yes
isSwapped: 0
extraIndexCount: 0
itemCount: 684,876
primaryDataSize: 134,443,250
zoomLevels: 0
chromCount: 24
        chr1 0 249250621
        chr10 1 135534747
        chr11 2 135006516
        chr12 3 133851895
        chr13 4 115169878
        chr14 5 107349540
        chr15 6 102531392
        chr16 7 90354753
        chr17 8 81195210
        chr18 9 78077248
        chr19 10 59128983
        chr2 11 243199373
        chr20 12 63025520
        chr21 13 48129895
        chr22 14 51304566
        chr3 15 198022430
        chr4 16 191154276
        chr5 17 180915260
        chr6 18 171115067
        chr7 19 159138663
        chr8 20 146364022
        chr9 21 141213431
        chrX 22 155270560
        chrY 23 59373566
basesCovered: 1,152,938,050
meanDepth (of bases covered): 20.639954
minDepth: 1.000000
maxDepth: 9068.000000
std of depth: 186.314036

The itemCount line indicates the number of items in the bigBed file, similar to the number of lines in a table. It may be useful to compare this count with that of the previous version of the file or table.

The basesCovered, as the name suggests, is the number of bases in the genome covered by items in the track. You can divide this number by the number of bases in the genome (found on the hgGateway "Sequences" page) to get an estimate of coverage, similar to featureBits. So in this case, our coverage would be (1,152,938,050/3,137,161,264)*100 = 36.75%.

The meanDepth, minDepth, maxDepth, and std of depth lines convey information about the depth of coverage for items in the bigBed file. For, example, the meanDepth above would indicate that, on average, every base covered by elements in the bigBed file is covered by 20 items.

Replicating featureBits, countPerChrom, and checkTableCoords measurements

Since most of the QA tools don't do much upon encountering bigBed files, you can convert your files to regular bed files and replicate the measurements yourself like so:

featureBits

If your track is a bigBed based track, then featureBits will not work. First you will need to turn your bigBed into a bed file, then, if necessary, account for tabs v. spaces, and lastly, if working with a bed12(+), split your bed into exons before running through featureBits:

$ bigBedToBed bed.bb out.bed
$ # if necessary:
$ # sed -i 's/ /_/g' out.bed
$ # split into exons (must be bed12 or less, change cut command if you have a bed4+ or 8+, etc)
$ cut --complement -f13- out.bed | bedToExons stdin out.exons.bed
$ featureBits [db] -countGaps out.exons.bed [gap]

If your track is a composite or super track of many bigBeds, you can run the above steps in a loop like so:

$ for file in ../*.bed; do printf "%s\n" "$file"; sed 's/ /_/g' $file | \
cut --complement -f13- | bedToExons stdin $file.exons.bed; \
featureBits hg38 -countGaps $file.exons.bed; \
featureBits hg38 -countGaps $file.exons.bed gap; done
../unipChain.bed
33610384 bases of 3209286105 (1.047%) in intersection
0 bases of 3209286105 (0.000%) in intersection

Although the above is much slower than using the poorly optimized awk below instead of featureBits:

$ time for file in bed/*; do fName="${file##*/}"; fName="${fName%.*}"; printf "%s\n" "$file"; \
sed 's/ /_/g' $file | sed 's/         /       tmp_name        /g' | cut --complement -f13- | \
bedToExons stdin exons/$fName.exons.bed; \
printf "featureBits %s -countGaps %s\n" "danRer10" "exons/$fName.exons.bed"; \
awk -v size=$(hgsql -Ne "select sum(size) from chromInfo" danRer10) \
'{cov+=$3-$2}END{printf "%d bases of %d (%0.2f) in intersection\n", cov, size, (100*cov/size)}' \
exons/$fName.exons.bed; featureBits danRer10 -countGaps exons/$fName.exons.bed gap; done
real	3m17.096s
user	2m56.763s
sys	0m13.088s


$ time for file in bed/*; do fName="${file##*/}"; fName="${fName%.*}"; printf "%s\n" "$file"; \ 
sed 's/ /_/g' $file | sed 's/         /       tmp_name        /g' | cut --complement -f13- | \
bedToExons stdin exons/$fName.exons.bed; featureBits -countGaps danRer10 exons/$fName.exons.bed; \
featureBits danRer10 -countGaps exons/$fName.exons.bed gap; done
...
real	6m12.304s
user	5m38.712s
sys	0m26.326s

countPerChrom

countPerChrom.csh runs on MySQL tables, and uses the separate utility graph.csh to actually make the nice histograms. You can run graph.csh manually on a two column file of chromosome counts computed from a grep like so:

for chr in {1..22} X Y M; do printf "chr%s %s\n" "$chr" "$(grep -wc "^chr${chr}" unipChain.bed)"; done > hg38.unipChain.chromCounts
graph.csh hg38.unipChain.chromCounts 35
 chr1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 
 chr2 xxxxxxxxxxxxxxxxxxxxx              
 chr3 xxxxxxxxxxxxxxxxxx                 
 chr4 xxxxxxxxxxxxx       
...

If your track is a composite or super track of many bigBeds, you can run the above steps in a loop like so:

$ for file in *.bed; do for chr in {1..22} X Y M; do printf "chr%s %s\n" "$chr" "$(grep -wc "^chr${chr}" $file)"; done > hg38.$file.chromCounts; done
$ for file in *.chromCounts; do graph.csh $file 35 > $file.graph; echo "each x=35" >> $file.graph; done
$ cat hg38.unipChain.bed.chromCounts.graph 

 chr1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 
 chr2 xxxxxxxxxxxxxxxxxxxxx              
 chr3 xxxxxxxxxxxxxxxxxx                 
 chr4 xxxxxxxxxxxxx       
...

positionalTblCheck and checkTableCoords

positionalTblCheck checks that tables are sorted via chromosome and chromStart. bedToBigBed does this check before running, so if a bigBed exists it will have to be sorted. Similarly, bedToBigBed also checks that all the coordinates in the input bed file are valid for the chrom.sizes file provided, so the checkTableCoords check can be skipped as well.

Other Tools That May Be Useful for QA of big* tracks

  • bigWigToBedGraph or bigWigToWig— these programs convert a bigWig file to ASCII bedGraph/WIG format.
  • bigWigSummary — this program extracts summary information from a bigWig file.
  • bigBedToBed — this program converts a bigBed file to ASCII BED format.
  • bigBedSummary — this program extracts summary information from a bigBed file.