Lastz tuning procedure: Difference between revisions
(changed cse to soe and genome-source fixes) |
|||
(17 intermediate revisions by one other user not shown) | |||
Line 16: | Line 16: | ||
# Compare the resulting output from each of those four trials to verify they are consistent and produce similar parameters. Sometimes one of those results will be radically different. From the set of at least three results that are consistent, choose the one with the largest number of alignments. Usually this is the top-400, sometimes it is the top-300. If none of them are consistent, simply use lastz standard defaults. This isn't a perfect procedure, sometimes lastz standard defaults will produce more alignment in a full chain/net procedure. | # Compare the resulting output from each of those four trials to verify they are consistent and produce similar parameters. Sometimes one of those results will be radically different. From the set of at least three results that are consistent, choose the one with the largest number of alignments. Usually this is the top-400, sometimes it is the top-300. If none of them are consistent, simply use lastz standard defaults. This isn't a perfect procedure, sometimes lastz standard defaults will produce more alignment in a full chain/net procedure. | ||
==Fetch Protein Fasta== | ==Fetch Protein Fasta== | ||
To fetch protein fasta sequence, assuming you have the kent [http://genome-source. | To fetch protein fasta sequence, assuming you have the kent [http://genome-source.soe.ucsc.edu/gitlist/kent.git/raw/master/src/userApps/README userApps] and [http://genome.ucsc.edu/goldenPath/help/mysql.html $HOME/.hg.conf] set to: | ||
db.host=genome-mysql. | db.host=genome-mysql.soe.ucsc.edu | ||
db.user=genomep | db.user=genomep | ||
db.password=password | db.password=password | ||
Line 25: | Line 25: | ||
For example on hg38: | For example on hg38: | ||
hgsql -N -e 'select * from genscan;' hg38 | cut -f2- > hg38. | hgsql -N -e 'select * from genscan;' hg38 | cut -f2- > hg38.genes.gp | ||
rsync -a -P rsync://hgdownload. | rsync -a -P rsync://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit ./ | ||
getRnaPred -peptides -genomeSeqs=hg38.2bit hg38 hg38. | getRnaPred -peptides -genomeSeqs=hg38.2bit hg38 hg38.genes.gp all hg38.genes.pep | ||
==BLAT proteins== | ==BLAT proteins== | ||
The blat command is: | The blat command is: | ||
blat -prot -oneOff=1 ${target}. | blat -prot -oneOff=1 ${target}.genes.pep ${query}.genes.pep -out=maf ${target}.${query}.oneOff.maf | ||
Scan the resulting maf file: | Scan the resulting maf file: | ||
[[ | [[File:MafScoreSizeScan_pl.txt]] mafScoreSizeScan.pl ${target}.${query}.oneOff.maf > mafScoreSizeScan.list | ||
ave mafScoreSizeScan.list | grep "^Q3" | awk '{print $2}' | sed -e 's/.000000//' > mafScoreSizeScan.Q3 | |||
'''ave''' is a ''kent'' user apps command. | |||
==running lastz_D tuning== | |||
This step requires the 2bit files for target and query sequence. For UCSC assemblies, they can be obtained as indicated above: | |||
rsync -a -P rsync://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit ./ | |||
This topAll.sh script will run each of the four groups of sequences: | |||
[[File:TopAll_sh.txt]] topAll.sh ${target} ${query} | |||
Requires access to scripts [[File:SelectedFasta_sh.txt]] '''selectedFasta.sh''' and [[File:AdjustSizes_pl.txt]] '''adjustSizes.pl''' | |||
the kent commands '''twoBitToFa, twoBitInfo''' the '''lastz_D''' binary, and 2bit sequence files for both | |||
target and query. Also the files [[File:Create_scores_file_control.txt]] '''create_scores_file.control''' and [[File:Expand_scores_file_py.txt]] '''expand_scores_file.py''' from the lastz package. | |||
==survey results== | |||
The script [[File:MatrixSummary_pl.txt]] '''matrixSummary.pl''' will scan all '''*.txt''' files in a directory to summarize the resulting score matrix values from the four lastz_D tuning operations. It is assuming the only '''*.txt''' files present are the '''$target.$query.tuning.topNNN.txt''' files. It will output the average of the score matrix, the range of the values in the score matrix, and the range in percent of the values. Check this output to see if the values are radically different, or they are smoothly similar. If they are all similar, use the highest number topNNN values. Someones one will be oddly different, ignore it and use the next highest topNNN values. | |||
==see also== | |||
Copies of the scripts for the UCSC compute environment in the [http://genome-source.soe.ucsc.edu/gitlist/kent.git/tree/master/src/hg/utils/automation/lastz_D kent source tree] |
Latest revision as of 07:24, 1 September 2018
Introduction
For the lastz/chain/net procedure at UCSC, we attempt to tune the lastz parameters when the target and query species are phylogenetic distant from either human or mouse since the normal lastz default parameters are already tuned for human and mouse alignments.
The procedure will be:
- extract 'genscan' proteins from each pair of species to align (can use any gene table)
- blat the proteins to each other, select the highest scoring alignments
- for each highest scoring alignment, extract the full DNA sequence for each gene, coding and non-coding, plus 5,000 bases upstream of the transcript plus extra DNA sequence on each end for the shorter sequence to get them nearly the same size. Concatenate all the sequences together to produce one single sequence representing all of these gene sequences, one file for each species
- Run the lastz_D tuning procedure for four different collections of these sequences:
- top 100 alignments
- top 200 alignments
- top 300 alignments
- top 400 alignments
- Compare the resulting output from each of those four trials to verify they are consistent and produce similar parameters. Sometimes one of those results will be radically different. From the set of at least three results that are consistent, choose the one with the largest number of alignments. Usually this is the top-400, sometimes it is the top-300. If none of them are consistent, simply use lastz standard defaults. This isn't a perfect procedure, sometimes lastz standard defaults will produce more alignment in a full chain/net procedure.
Fetch Protein Fasta
To fetch protein fasta sequence, assuming you have the kent userApps and $HOME/.hg.conf set to:
db.host=genome-mysql.soe.ucsc.edu db.user=genomep db.password=password central.db=hgcentral
For example on hg38:
hgsql -N -e 'select * from genscan;' hg38 | cut -f2- > hg38.genes.gp rsync -a -P rsync://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit ./ getRnaPred -peptides -genomeSeqs=hg38.2bit hg38 hg38.genes.gp all hg38.genes.pep
BLAT proteins
The blat command is:
blat -prot -oneOff=1 ${target}.genes.pep ${query}.genes.pep -out=maf ${target}.${query}.oneOff.maf
Scan the resulting maf file:
File:MafScoreSizeScan pl.txt mafScoreSizeScan.pl ${target}.${query}.oneOff.maf > mafScoreSizeScan.list ave mafScoreSizeScan.list | grep "^Q3" | awk '{print $2}' | sed -e 's/.000000//' > mafScoreSizeScan.Q3
ave is a kent user apps command.
running lastz_D tuning
This step requires the 2bit files for target and query sequence. For UCSC assemblies, they can be obtained as indicated above:
rsync -a -P rsync://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit ./
This topAll.sh script will run each of the four groups of sequences:
File:TopAll sh.txt topAll.sh ${target} ${query}
Requires access to scripts File:SelectedFasta sh.txt selectedFasta.sh and File:AdjustSizes pl.txt adjustSizes.pl the kent commands twoBitToFa, twoBitInfo the lastz_D binary, and 2bit sequence files for both target and query. Also the files File:Create scores file control.txt create_scores_file.control and File:Expand scores file py.txt expand_scores_file.py from the lastz package.
survey results
The script File:MatrixSummary pl.txt matrixSummary.pl will scan all *.txt files in a directory to summarize the resulting score matrix values from the four lastz_D tuning operations. It is assuming the only *.txt files present are the $target.$query.tuning.topNNN.txt files. It will output the average of the score matrix, the range of the values in the score matrix, and the range in percent of the values. Check this output to see if the values are radically different, or they are smoothly similar. If they are all similar, use the highest number topNNN values. Someones one will be oddly different, ignore it and use the next highest topNNN values.
see also
Copies of the scripts for the UCSC compute environment in the kent source tree