Wiggle BED to variableStep format conversion

From genomewiki
Revision as of 17:19, 15 June 2007 by Hiram (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

The BED wiggle input format is the most inefficient input format for wiggle tracks. For each definition line the number of data pointes created is chromEnd-chromStart which can add up rapidly to the 300,000,000 data point input limit to wiggle custom tracks.

In many cases the variableStep and fixedStep formats are sufficient to display the same data that may have been encoded in the BED format. If the BED formatted data has a consistent data item size, say for example 400 bases for each data line, it can be converted to a variableStep format with this simple script:

#!/bin/sh

SPAN=400
S=inputFile.BED.format.txt
R=outputFile.variableStep.format.txt
rm -f ${R}
head -45 ${S} | egrep "^browser|^track" > ${R}
grep "^chr" ${S} | cut -f1 | sort -u > chr.list
cat chr.list | while read C
do
    echo "variableStep chrom=${C} span=${SPAN}" >> ${R}
    awk '{if (match($1,"^'"${C}"'$")) { print } }' ${S} | sort -k2n | awk '
{
    printf "%d\t%g\n", $2+1, $4
}
' >> ${R}
done

This assumes there is a single track definition in this file: inputFile.BED.format.txt and the BED items were properly specified as zero-relative half-open coordinates, where base position 1 of chr4 for example would be specified as: chr4 0 1. The coordinate system for variableStep and fixedStep input formats is one-relative coordinates where base position 1 of a chromosome is specified as 1.

This would convert a BED format line, for example:

chr4  399  800 3.14159

which occupies 400 data points in the internal representation of the wiggle data, to the variableStep format:

variableStep chrom=chr4 span=400
400  3.14159

which occupies a single data point in the internal representation of the wiggle data.