Lastz tuning procedure: Difference between revisions

Revision as of 17:06, 19 March 2015

Introduction

For the lastz/chain/net procedure at UCSC, we attempt to tune the lastz parameters when the target and query species are phylogenetic distant from either human or mouse since the normal lastz default parameters are already tuned for human and mouse alignments.

The procedure will be:

extract 'genscan' proteins from each pair of species to align (can use any gene table)
blat the proteins to each other, select the highest scoring alignments
for each highest scoring alignment, extract the full DNA sequence for each gene, coding and non-coding, plus 5,000 bases upstream of the transcript plus extra DNA sequence on each end for the shorter sequence to get them nearly the same size. Concatenate all the sequences together to produce one single sequence representing all of these gene sequences, one file for each species
Run the lastz_D tuning procedure for four different collections of these sequences:
1. top 100 alignments
2. top 200 alignments
3. top 300 alignments
4. top 400 alignments
Compare the resulting output from each of those four trials to verify they are consistent and produce similar parameters. Sometimes one of those results will be radically different. From the set of at least three results that are consistent, choose the one with the largest number of alignments. Usually this is the top-400, sometimes it is the top-300. If none of them are consistent, simply use lastz standard defaults. This isn't a perfect procedure, sometimes lastz standard defaults will produce more alignment in a full chain/net procedure.

Fetch Protein Fasta

To fetch protein fasta sequence, assuming you have the kent userApps and $HOME/.hg.conf set to:

db.host=genome-mysql.cse.ucsc.edu
db.user=genomep
db.password=password
central.db=hgcentral

For example on hg38:

hgsql -N -e 'select * from genscan;' hg38 | cut -f2- > hg38.genes.gp
rsync -a -P rsync://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit ./
getRnaPred -peptides -genomeSeqs=hg38.2bit hg38 hg38.genes.gp all hg38.genes.pep

BLAT proteins

The blat command is:

blat -prot -oneOff=1 ${target}.genes.pep ${query}.genes.pep -out=maf ${target}.${query}.oneOff.maf

Scan the resulting maf file:

File:MafScoreSizeScan pl.txt mafScoreSizeScan.pl ${target}.${query}.oneOff.maf > mafScoreSizeScan.list

running lastz_D tuning

This step requires the 2bit files for target and query sequence. For UCSC assemblies, they can be obtained as indicated above:

rsync -a -P rsync://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit ./

This topAll.sh script will run each of the four groups of sequences:

File:TopAll sh.txt topAll.sh ${target} ${query}

Requires access to scripts File:SelectedFasta sh.txt and File:AdjustSizes pl.txt the kent command twoBitToFa the lastz_D binary, and 2bit sequence files for both target and query. Also the files create_scores_file.control and expand_scores_file.py from the lastz package.

@@ Line 39: / Line 39: @@
 ==running lastz_D tuning==
-This step is going to need 2bit files for each sequence.  For UCSC assemblies, they can be obtained:
+This step requires the 2bit files for target and query sequence.  For UCSC assemblies, they can be obtained as indicated above:
+ rsync -a -P rsync://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit ./
 This topAll.sh script will run each of the four groups of sequences:
@@ Line 45: / Line 46: @@
 Requires access to scripts [[File:SelectedFasta_sh.txt]] and [[File:AdjustSizes_pl.txt]]
-the kent command '''twoBitToFa''' the '''lastz_D''' binary, and unmasked 2bit sequence files for both
+the kent command '''twoBitToFa''' the '''lastz_D''' binary, and 2bit sequence files for both
-target and query.  The '''adjustSizes.pl''' script looks for access to chrom.sizes
+target and query.  Also the files create_scores_file.control and expand_scores_file.py from
-files for both target and query.  This could be improved by getting the chrom.sizes out
+the lastz package.
-of the unmasked 2bit files.

Lastz tuning procedure: Difference between revisions

Revision as of 17:06, 19 March 2015

Contents

Introduction

Fetch Protein Fasta

BLAT proteins

running lastz_D tuning

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

related sites

hosted projects

Tools