Lastz DEF file parameters

From genomewiki
Jump to navigationJump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Adjusting SEQ_LAP, SEQ_CHUNK and SEQ_LIMIT parameters for the lastz/chain/net pipeline.

The goal of adjusting these parameters is to obtain a reasonable number of cluster jobs.

  • SEQ1_LAP is always 10000
  • SEQ2_LAP is always 0

This sets the amount of overlapping sequence in the resulting partitioned sequences. The overlap of 10,000 for SEQ1 will help prevent artifical breaks in alignments due to boundary issues.

Typical CHUNK sizes are 10000000 or 20000000 for both SEQ1_CHUNK and SEQ2_CHUNK. This sets the limit of total sequence size allowed in one alignment. The typical run-time of one alignment and your cluster characteristics will help determine how much sequence you want in one alignment. cluster efficiency wants to have run-time of one alignment be short enough to allow jobs to cycle through the cluster at a reasonable rate to allow appropriate cluster sharing with other users. If CHUNK size of 20000000 makes a run-time too long, use the smaller size of 10000000.

The LIMIT is the adjustment that determines the number of chunks. This is related to the number of contigs in your assembly. The LIMIT means, when placing multiple sequences into one alignment, before the CHUNK size limit is reached for the total sum of sequences, how many individual sequences should be placed in one CHUNK. Very high contig count assemblies could use LIMITS up to 1000, low contig count assemblies could use LIMITS as low as 5. Again, this is a trade off for run-time of one alignment and cluster efficiency. If both genome assemblies are high contig counts, this will require a high LIMIT for both, and will take an extra amount of cluster time.

To see the effect of CHUNK size and LIMIT, use the option to the command: -stop=partition which will stop the process after chopping up both genome sequences. Save the output of that procedure into a log file, it will contain the statement:

 cluster batch jobList size: 95567 = 227 * 421

where the 227 is the number of lines in run.blastz/target.lst and the 421 is the number of lines in run.blastz/query.lst

In general, a cluster batch jobList size of something less than 100000 is reasonable for a cluster of 500 to 1000 CPUs/cores.

To experiment repeatedly with this partition step, remove the run.blastz directory to start over.