Lastz/chain/net/multiz considerations/caveats/restrictions/limitations

From genomewiki
Revision as of 18:25, 18 December 2017 by Hiram (talk | contribs) (continuing with multiz notes)
Jump to navigationJump to search

Introduction

The lastz/chain/net/multiz processing pipeline is the primary alignment procedure used at the U.C. Santa Cruz Genomics Institute to produce multiple alignments. There are a number of considerations that should be taken into account by consumers of the resulting data that could certainly affect conclusions drawn from such analysis.

lastz

As with any alignment algorithm, the choice of parameters for lastz is critical to the results produced by this alignment program. Typically the parameters chosen fall into three categories based on the phylogenetic distance and/or clade relationship of target and query sequences.

  1. primate to primate alignments (e.g. human<->chimp)
  2. closer phylogenetic relationship (e.g. human<->mouse)
  3. more distant phylgenetic relationship (e.g. human<->fish)

Parameters used for lastz are recorded in the UCSC Multiple Alignments page.

One specific parameter that is interesting is the --masking=<count> (aka M=<count>) ususlly set at a count of 254. From the lastz documentation:

 Dynamically mask the target sequence by excluding any positions that appear in too many alignments from further consideration for seeds.

 Specifically, a cumulative count is maintained of the number of times each target location is aligned. After each query sequence and strand is
 processed, any locations that have been output in at least <count> alignment blocks are masked, so they will be excluded from the seeding stage
 for subsequent query sequences. Since repetition discovered while processing one sequence strand is only masked for subsequent sequence
 strands, this option has no effect on the first strand of the first sequence in the query file.

Therefore, any target sequence that has more than 254 matching alignments to the query sequence will output only those first 254 alignments. Query matches beyond that number will not be included in the output.

chain

net

multiz

The first important point to consider for multiz results is that these alignments are classified as reference based alignments. Only alignments between the query sequences and the target sequence are included in the final result of the multiple alignment. The inputs to multiz are only the pair-wise alignments of the query to target sequence alignments. There is no indication of potential relationships between the query sequences with the other query sequences. There could be important alignment relationships between query sequence which will not be seen in the multiz result.

The second important point about the multiz alignments is the choice of chain/net used as input to multiz. The choice is one of three types of chain/net results:

  1. syntenic net - higher quality genome assemblies, not too phylogentically distant
  2. reciprocal best - when the query genome is poor quality, high contig count, lower N50 size
  3. net - the fundamental net alignment, used for phylogenetically distant species

The type of chain/net used in the alignment is captured in the UCSC Multiple Alignments page.

The syntenic net and reciprocal best net are subsets of the fundamental net alignment.