Big Bed/Wig QA: Difference between revisions
(Link changed from .soe or .cse to .gi) |
|||
Line 29: | Line 29: | ||
</pre> | </pre> | ||
More background on phasing out BBI Link tables can be found in Redmine [http://redmine. | More background on phasing out BBI Link tables can be found in Redmine [http://redmine.soe.ucsc.edu/issues/15059 #15059]. | ||
== QA-ing bigBed/bigWig tracks == | == QA-ing bigBed/bigWig tracks == |
Revision as of 16:54, 29 August 2018
Overview of bigBed/bigWig tracks in the Browser
For more information on these two file types see the following links:
There are two different types of bigBed/bigWig tracks in the Genome Browser; those that have BBI Link tables and those that don't. Those with BBI Link tables will often have a single-line table that points to the bigBed/bigWig file. For example, the "tgpPhase1AccessibilityPilotCriteria" table in hg19:
mysql> select * from tgpPhase1AccessibilityPilotCriteria; +-----------------------------------------------------------+ | fileName | +-----------------------------------------------------------+ | /gbdb/hg19/1000Genomes/paired.end.mapping.1000G..pilot.bb | +-----------------------------------------------------------+
Newer bigBed/bigWig tracks often don't even have an associated BBI link table. For example, the "tgpPhase3AccessibilityPilotCriteria" track for hg19 doesn't have an associated table:
mysql> show tables like "%tgpPhase3AccessibilityPilotCriteria%"; Empty set (0.04 sec)
But rather has a "bigDataUrl" line in trackDb/human/hg19/trackDb.ra that points to the /gbdb/ locaiton of the file:
track tgpPhase3AccessibilityPilotCriteria subTrack tgpPhase3Accessibility shortLabel 1000G Accs Pilot longLabel 1000 Genomes Project Phase 3 Paired-end Accessible Regions - Pilot Criteria bigDataUrl /gbdb/hg19/1000Genomes/phase3/20141020.pilot_mask.whole_genome.bb
More background on phasing out BBI Link tables can be found in Redmine #15059.
QA-ing bigBed/bigWig tracks
Many of our standard QA scripts (countPerchrom.csh, qaGbTracks, etc.) either don't work or don't really do much when they encounter one of these bigBed/bigWig tracks.
However, you can use some Genome Browser command-line utilities to extract some information from these files and do some basic QA of these files.
validateFiles
First, you can use validateFiles to perform some basic validation of the bigBed/bigWig files. Here's the basic usage for validateFiles:
validateFiles -chromInfo=FILE -options -type=FILE_TYPE file1 [file2 [...]]
You can replace -chromInfo=FILE with -chromDb=$db to tell validate files to use the chromInfo table rather than the chrom.sizes. It's particularly nice to be able to just specify the db rather than having to specify the path to the chrom.sizes file (often in /hive/data/genomes/$db/chrom.sizes).
You will also need to specify the file type using -type=TYPE. For bigBed/bigWig here are the relevant sections from the usage message:
bigWig : Big Wig See http://genome.ucsc.edu/goldenPath/help/bigWig.html bigBedN[+P] : bigBED N or bigBED N+ or bigBED N+P, similar to BED See http://genome.ucsc.edu/goldenPath/help/bigBed.html
The "[+P]" for bigBed files will be the extra fields past the standard N BED files in your track. For example, the file /gbdb/hg19/bbi/patBulk.bb is a bigBed12+13. This means the first 12 fields are the standard BED fields and the last 13 are specialized fields that are specified in an "AutoSQL" or ".as" file. (For more on AutoSQL, see the AutoSQL Genomewiki Page.)
In addition to the required options described above, here are some options that are useful when QA-ing bigBed/bigWig files:
-as=fields.as If you have extra "bedPlus" fields, it's great to put a definition of each field in a row in AutoSql format here. Applies to bed-related types. -tab If set, expect fields to be tab separated, normally expects white space separator. Applies to bed-related types. -doReport Output report in filename.report
Note that the report files produced by -doReport are placed in the same directory as the bigBed/bigWig file. For example, if you ran validateFiles with the -doReport option on /gbdb/hg19/bbi/patNonBulk.bb, then the report files would end up in /gbdb/hg19/bbi/.
You can run validateFiles without any arguments to see the full usage message.
Here's an example of a command you might use to validate a bigBed file:
validateFiles -chromDb=hg19 -type=bigBed12+13 -as=/hive/data/genomes/hg19/bed/patents/patSummary.as -tab /gbdb/hg19/bbi/patNonBulk.bb
The program will report any errors it finds to stderr. If no problems are found, you should see a message like:
Error count 0
If your track consists of many files, it might be best to run them in a loop like so, where you can specify different autoSql files and types:
for file in $(cat files); do printf "checking file: %s\n" "$file"; \ if [[ "$file" == "/gbdb/danRer10/uniprot/unipAliSwissprot.bb" ]] || [[ "$file" == "/gbdb/danRer10/uniprot/unipAliTrembl.bb" ]]; then \ validateFiles -chromDb=danRer10 -type=bigBed12+33 -as=bigPsl.as -tab $file; \ elif [[ "$file" == "/gbdb/danRer10/uniprot/unipMut.bb" ]]; then \ validateFiles -chromDb=danRer10 -type=bigBed12+9 -as=bigBedMut.as -tab $file; \ else validateFiles -chromDb=danRer10 -type=bigBed12+6 -as=bigBed.as -tab $file; fi; done checking file: /gbdb/danRer10/uniprot/unipAliSwissprot.bb Error count 0 checking file: /gbdb/danRer10/uniprot/unipAliTrembl.bb Error count 0 ...
bigBedInfo and bigWigInfo
Both bigBedInfo and bigWigInfo perform similar functions for their respective file types, which is to provide an overview of the data in the bigBed or bigWig file. While bigBedInfo is primarily dicussed in this section, bigWigInfo and it's options and output should be fairly similar.
To start, we'll take a look at the usage statement for bigBedInfo and then discuss some of the more useful options for QA:
bigBedInfo - Show information about a bigBed file. usage: bigBedInfo file.bb options: -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs -chroms - list all chromosomes and their sizes -zooms - list all zoom levels and their sizes -as - get autoSql spec -extraIndex - list all the extra indexes
bigWigInfo has a similar usage message.
The -chroms option is useful in that it will print a list of all chromosomes names found in the bigBed/bigWig file. You can compare this list to the full list of chromosomes for that database to see which chromosomes might be missing data. If you're not sure whether or not a chromosome should have data on it, ask the engineer responsible for the track/table/file.
Here's an example of a command you could use to get some basic summary info on a bigBed file:
$ bigBedInfo -chroms /gbdb/hg19/bbi/patNonBulk.bb
And here's the output of that command:
version: 4 fieldCount: 25 hasHeaderExtension: yes isCompressed: yes isSwapped: 0 extraIndexCount: 0 itemCount: 684,876 primaryDataSize: 134,443,250 zoomLevels: 0 chromCount: 24 chr1 0 249250621 chr10 1 135534747 chr11 2 135006516 chr12 3 133851895 chr13 4 115169878 chr14 5 107349540 chr15 6 102531392 chr16 7 90354753 chr17 8 81195210 chr18 9 78077248 chr19 10 59128983 chr2 11 243199373 chr20 12 63025520 chr21 13 48129895 chr22 14 51304566 chr3 15 198022430 chr4 16 191154276 chr5 17 180915260 chr6 18 171115067 chr7 19 159138663 chr8 20 146364022 chr9 21 141213431 chrX 22 155270560 chrY 23 59373566 basesCovered: 1,152,938,050 meanDepth (of bases covered): 20.639954 minDepth: 1.000000 maxDepth: 9068.000000 std of depth: 186.314036
The itemCount line indicates the number of items in the bigBed file, similar to the number of lines in a table. It may be useful to compare this count with that of the previous version of the file or table.
The basesCovered, as the name suggests, is the number of bases in the genome covered by items in the track. You can divide this number by the number of bases in the genome (found on the hgGateway "Sequences" page) to get an estimate of coverage, similar to featureBits. So in this case, our coverage would be (1,152,938,050/3,137,161,264)*100 = 36.75%.
The meanDepth, minDepth, maxDepth, and std of depth lines convey information about the depth of coverage for items in the bigBed file. For, example, the meanDepth above would indicate that, on average, every base covered by elements in the bigBed file is covered by 20 items.
Replicating featureBits, countPerChrom, and checkTableCoords measurements
Since most of the QA tools don't do much upon encountering bigBed files, you can convert your files to regular bed files and replicate the measurements yourself like so:
featureBits
If your track is a bigBed based track, then featureBits will not work. First you will need to turn your bigBed into a bed file, then, if necessary, account for tabs v. spaces, and lastly, if working with a bed12(+), split your bed into exons before running through featureBits:
$ bigBedToBed bed.bb out.bed $ # if necessary: $ # sed -i 's/ /_/g' out.bed $ # split into exons (must be bed12 or less, change cut command if you have a bed4+ or 8+, etc) $ cut --complement -f13- out.bed | bedToExons stdin out.exons.bed $ featureBits [db] -countGaps out.exons.bed [gap]
If your track is a composite or super track of many bigBeds, you can run the above steps in a loop like so:
$ for file in ../*.bed; do printf "%s\n" "$file"; sed 's/ /_/g' $file | \ cut --complement -f13- | bedToExons stdin $file.exons.bed; \ featureBits hg38 -countGaps $file.exons.bed; \ featureBits hg38 -countGaps $file.exons.bed gap; done ../unipChain.bed 33610384 bases of 3209286105 (1.047%) in intersection 0 bases of 3209286105 (0.000%) in intersection
Although the above is much slower than using the poorly optimized awk below instead of featureBits:
$ time for file in bed/*; do fName="${file##*/}"; fName="${fName%.*}"; printf "%s\n" "$file"; \ sed 's/ /_/g' $file | sed 's/ / tmp_name /g' | cut --complement -f13- | \ bedToExons stdin exons/$fName.exons.bed; \ printf "featureBits %s -countGaps %s\n" "danRer10" "exons/$fName.exons.bed"; \ awk -v size=$(hgsql -Ne "select sum(size) from chromInfo" danRer10) \ '{cov+=$3-$2}END{printf "%d bases of %d (%0.2f) in intersection\n", cov, size, (100*cov/size)}' \ exons/$fName.exons.bed; featureBits danRer10 -countGaps exons/$fName.exons.bed gap; done real 3m17.096s user 2m56.763s sys 0m13.088s $ time for file in bed/*; do fName="${file##*/}"; fName="${fName%.*}"; printf "%s\n" "$file"; \ sed 's/ /_/g' $file | sed 's/ / tmp_name /g' | cut --complement -f13- | \ bedToExons stdin exons/$fName.exons.bed; featureBits -countGaps danRer10 exons/$fName.exons.bed; \ featureBits danRer10 -countGaps exons/$fName.exons.bed gap; done ... real 6m12.304s user 5m38.712s sys 0m26.326s
countPerChrom
countPerChrom.csh runs on MySQL tables, and uses the separate utility graph.csh to actually make the nice histograms. You can run graph.csh manually on a two column file of chromosome counts computed from a grep like so:
for chr in {1..22} X Y M; do printf "chr%s %s\n" "$chr" "$(grep -wc "^chr${chr}" unipChain.bed)"; done > hg38.unipChain.chromCounts graph.csh hg38.unipChain.chromCounts 35 chr1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx chr2 xxxxxxxxxxxxxxxxxxxxx chr3 xxxxxxxxxxxxxxxxxx chr4 xxxxxxxxxxxxx ...
If your track is a composite or super track of many bigBeds, you can run the above steps in a loop like so:
$ for file in *.bed; do for chr in {1..22} X Y M; do printf "chr%s %s\n" "$chr" "$(grep -wc "^chr${chr}" $file)"; done > hg38.$file.chromCounts; done $ for file in *.chromCounts; do graph.csh $file 35 > $file.graph; echo "each x=35" >> $file.graph; done $ cat hg38.unipChain.bed.chromCounts.graph chr1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx chr2 xxxxxxxxxxxxxxxxxxxxx chr3 xxxxxxxxxxxxxxxxxx chr4 xxxxxxxxxxxxx ...
positionalTblCheck and checkTableCoords
positionalTblCheck checks that tables are sorted via chromosome and chromStart. bedToBigBed does this check before running, so if a bigBed exists it will have to be sorted. Similarly, bedToBigBed also checks that all the coordinates in the input bed file are valid for the chrom.sizes file provided, so the checkTableCoords check can be skipped as well.
Other Tools That May Be Useful for QA of big* tracks
- bigWigToBedGraph or bigWigToWig— these programs convert a bigWig file to ASCII bedGraph/WIG format.
- bigWigSummary — this program extracts summary information from a bigWig file.
- bigBedToBed — this program converts a bigBed file to ASCII BED format.
- bigBedSummary — this program extracts summary information from a bigBed file.