Wiggle BED to variableStep format conversion: Difference between revisions

From genomewiki
Jump to navigationJump to search
No edit summary
m (Change 399 to 400 in the bed, and change 400 to 401 in the wig. Then the two can really represent 400 data points, and be more appropriate.)
 
(5 intermediate revisions by 2 users not shown)
Line 1: Line 1:
The '''BED''' [http://genome.ucsc.edu/goldenPath/help/wiggle.html wiggle input format] is the most inefficient input format for wiggle tracks.  For each definition line the number of data points created is <em>chromEnd-chromStart</em> which can add up rapidly to the 300,000,000 data point input limit to wiggle custom tracks.
The '''BED''' [http://genome.ucsc.edu/goldenPath/help/wiggle.html wiggle input format] is the most inefficient input format for wiggle tracks.  For each definition line the number of data points created is <em>chromEnd-chromStart</em> which can add up rapidly to the 300,000,000 data point input limit to wiggle custom tracks.
Please note the addition of the new (August 2008) [http://genome.ucsc.edu/goldenPath/help/bedgraph.html bedGraph] format to assist in overcoming this limit.


In many cases the '''variableStep''' and '''fixedStep''' formats are sufficient to display the same data that may have been encoded in the '''BED''' format.  If the '''BED''' formatted data has a consistent data item size, say for example 400 bases for each data line, it can be converted to a '''variableStep''' format with this simple script:
In many cases the '''variableStep''' and '''fixedStep''' formats are sufficient to display the same data that may have been encoded in the '''BED''' format.  If the '''BED''' formatted data has a consistent data item size, say for example 400 bases for each data line, it can be converted to a '''variableStep''' format with this simple script:
Line 22: Line 24:
done
done
</pre>
</pre>
A slightly modified version of this script [[Image:bedToWig.sh]] converts all bed files in the current directory.


This assumes there is a single track definition in this file: ''inputFile.BED.format.txt'' and the '''BED''' items were properly specified as [http://genome.ucsc.edu/FAQ/FAQtracks#tracks1 zero-relative half-open] coordinates, where base position 1 of chr4 for example would be specified as: ''chr4 0 1''.  The coordinate system for '''variableStep''' and '''fixedStep''' input formats is ''one-relative'' coordinates where base position 1 of a chromosome is specified as 1.
This assumes there is a single track definition in this file: ''inputFile.BED.format.txt'' and the '''BED''' items were properly specified as [http://genome.ucsc.edu/FAQ/FAQtracks#tracks1 zero-relative half-open] coordinates, where base position 1 of chr4 for example would be specified as: ''chr4 0 1''.  The coordinate system for '''variableStep''' and '''fixedStep''' input formats is ''one-relative'' coordinates where base position 1 of a chromosome is specified as 1.
Line 28: Line 32:


<pre>
<pre>
chr4  399 800 3.14159
chr4  400 800 3.14159
</pre>
</pre>


Line 35: Line 39:
<pre>
<pre>
variableStep chrom=chr4 span=400
variableStep chrom=chr4 span=400
400 3.14159
401 3.14159
</pre>
</pre>


Line 42: Line 46:


The use of the '''span''' specification in '''variableStep''' or '''fixedStep''' formats should be consistent for all data points.  It should be the same for all data points.  You do '''''not''''' want to mix different ''spans'' of data together in the same input submission.  The ''span'' specification in wiggle formats has a very specific use and is not to be used to specify arbitrarily sized data items (that's what the '''BED''' format is for).  See also: [http://genome-test.cse.ucsc.edu/~hiram/wigPresentation/slide160.html Pre-calculated Zoom (Span)] from [http://genome-test.cse.ucsc.edu/~hiram/wigglePresentation.html Wiggle Data Tracks].
The use of the '''span''' specification in '''variableStep''' or '''fixedStep''' formats should be consistent for all data points.  It should be the same for all data points.  You do '''''not''''' want to mix different ''spans'' of data together in the same input submission.  The ''span'' specification in wiggle formats has a very specific use and is not to be used to specify arbitrarily sized data items (that's what the '''BED''' format is for).  See also: [http://genome-test.cse.ucsc.edu/~hiram/wigPresentation/slide160.html Pre-calculated Zoom (Span)] from [http://genome-test.cse.ucsc.edu/~hiram/wigglePresentation.html Wiggle Data Tracks].
See also:  [[Image:FixStepToBedGraph_pl.txt]] to convert fixedStep wigAscii to bedGraph.
[[Category:Technical FAQ]]

Latest revision as of 13:47, 8 September 2016

The BED wiggle input format is the most inefficient input format for wiggle tracks. For each definition line the number of data points created is chromEnd-chromStart which can add up rapidly to the 300,000,000 data point input limit to wiggle custom tracks.

Please note the addition of the new (August 2008) bedGraph format to assist in overcoming this limit.

In many cases the variableStep and fixedStep formats are sufficient to display the same data that may have been encoded in the BED format. If the BED formatted data has a consistent data item size, say for example 400 bases for each data line, it can be converted to a variableStep format with this simple script:

#!/bin/sh

SPAN=400
S=inputFile.BED.format.txt
R=outputFile.variableStep.format.txt
rm -f ${R}
head -45 ${S} | egrep "^browser|^track" > ${R}
grep "^chr" ${S} | cut -f1 | sort -u > chr.list
cat chr.list | while read C
do
    echo "variableStep chrom=${C} span=${SPAN}" >> ${R}
    awk '{if (match($1,"^'"${C}"'$")) { print } }' ${S} | sort -k2n | awk '
{
    printf "%d\t%g\n", $2+1, $4
}
' >> ${R}
done

A slightly modified version of this script File:BedToWig.sh converts all bed files in the current directory.

This assumes there is a single track definition in this file: inputFile.BED.format.txt and the BED items were properly specified as zero-relative half-open coordinates, where base position 1 of chr4 for example would be specified as: chr4 0 1. The coordinate system for variableStep and fixedStep input formats is one-relative coordinates where base position 1 of a chromosome is specified as 1.

This would convert a BED format line, for example:

chr4  400  800 3.14159

which occupies 400 data points in the internal representation of the wiggle data, to the variableStep format:

variableStep chrom=chr4 span=400
401  3.14159

which occupies a single data point in the internal representation of the wiggle data.


The use of the span specification in variableStep or fixedStep formats should be consistent for all data points. It should be the same for all data points. You do not want to mix different spans of data together in the same input submission. The span specification in wiggle formats has a very specific use and is not to be used to specify arbitrarily sized data items (that's what the BED format is for). See also: Pre-calculated Zoom (Span) from Wiggle Data Tracks.

See also: File:FixStepToBedGraph pl.txt to convert fixedStep wigAscii to bedGraph.