How Liftover Works

From genomewiki
Revision as of 06:40, 15 August 2020 by Galt (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

2020-08-14

I was just looking at the liftOver code, and testing it to answer user questions, some useful notes can be had here for people wondering how to use it or how it works. This is not intended to be exhaustive in detail.

hgLiftOver is a CGI and liftOver command-line utility both share much code through the library file kent/src/hg/lib/liftOver.c. The CGI can be used for interactive liftOvers. The command-line utility can be used by users or automated pipelines.

The main idea is that we have chains which are derived from alignments between two organisms. We can use them to take a region from one assembly version or organism and map or translate it to the equivalent location in the other assembly or orgamism. The alignments that make up the chain are created with BLAT for closely related organisms or different assemblies, and otherwise lastz is used.

We think of this as from and to, so there is a liftOver chain on hg38 that translates from hg38 to panTro6, for example. So that would be /gbdb/hg38/bed/liftOver/hg38ToPanTro6.over.chain.gz This can lift from hg38 region to equivalent region on panTro6.

By convention, for the hg38ToPanTro6.over.chain.gz chain, the target is hg38 (human) and query is panTro6 (chimp).

You can use either coordinates like browser positions, or you can use BED files. I think it can also do PSL files, but lifting only one side. Perhaps some othe formats are supported too.

BED4 is one level with the name, BED6 has score and strand. BED8 has ThickStart and ThickEnd marking where the coding region starts and ends. BED12 has fields for describing multiple exons including number of blocks, block sizes list, and block offsets.

By default liftOver picks only one single chain which has the best score and biggest overlap with the region being lifted. It will not look at other chains at all.

A chain consists of segments which are mapped from target chrom+offset to query chrom+offset. The segment will have a size that is the size of the block on both target and query. The segments within one chain will be sorted in ascending order by target start, and segments will not overlap, but may have gaps between them.

So it loops through all the segments of the chain until it reaches segments that intersect with the region to be lifted. If those segments contain both the region start and end, so that they are well-defined, and the coordinates on the mapped side (panTro6) are thus known, then the region can be lifted, otherwise there is an error. Even if the chain segment only missed containing the region start or end by just one base, it is still considered a failed mapping, so that is strict. It does not care about the middle of that region, such as whether there might be gaps there, it only cares about the start and end points.

For BED4,6,8 it is just a single region chrom, chromStart, chromEnd that must be lifted by successfully finding its start and ends inside the chosen chain's segments.

BED12 (or linked features) have multiple exons defined that are considered linked. Gaps between exons blocks are usually considered introns and are drawn with the herring-bone graphic on the genome browser. liftOver will go exon by exon, and each exon is region that must have both the start and end existing in the chain segments. If any endpoints are missing, the alignment is dropped, unless minBlocks is set very low, in which case the output region may succeed, but the exon with a missing start or end is removed. It makes no effort to say, hey 90% of the exon overlaps with a chain segments, so we will just lift what we can.

-minBlocks=0.5

Setting minBlocks low can allow more BED12 liftovers to succeed, albeit with dropped exons. minBlocks can be lowered with BED12 to allow things to pass. This means the fraction of exons passing over the total exons. If you have two exons, and one passes and the other does not, then that is a rate of 1/2 = 0.5. You have to lower minBlocks from 1 to 0.5 to see it in the output, and even then one exon will have disappeared from the lifted result. It is the minimum limit for the fraction of exons that pass.

-multiple option

If you specify -multiple, it will repeat the search on the rest of the liftOver chains loaded that overlap the region. This means that you can get multiple output rows going to different chromosomes and/or chromosome locations for a single input region, and they will have come from different chains.

-minMatch=0.95 Quick-Discard - Many chains are discarded quickly with an error if the chain->score is too low: chain->score < minMatch * bedSize So set minMatch to something lower if you want more chains explored, but avoid minMatch=0.0 which causes segfault.

liftOver Chains Are Not All The Chains. The liftover chains are just a subset off all the chains in the full chains track in the browser. They are filtered by net-syntenic. So it does not have access to every chain that you can see in the region.

hg38.panTro6.all.chain.gz becomes filtered to hg38ToPanTro6.over.chain.gz which is used by liftover. liftover filtering uses these programs: chainPreNet, chainNet, netSyntenic, netChainSubset, chainStitchId.

The filtered liftover chain looks a lot like the net track (instead of chain track) in the browser, so maybe you can just use that instead of wondering what stuff made it through the filtering.

In case it was not obvious, all lifts that succeed happen on one chain. So if there is not a single chain that covers your entire region, you will either lose exons with minBlocks, or it will fail the lift entirely. Splitting your region into multiple chromsomes or chains cannot happen directly, but you can use -multiple to achieve the same thing if you need to. Related output rows from using -multiple will share the same name from BED column 4.

Of course, the unmapped output is identical to the input, because it was not lifted at all.

"#Boundary problem: need 2, got 1, diff 1, mapped 0.5" "mapped 0.5" tells you what you need to use to set -minBlocks=0.5 to succeed. "need 2, got 1, diff 1," diff=difference, i.e. the between the total number of exons and the number that you got. This is also the number that did not pass, and will be dropped in the output.