TextReplace: Difference between revisions

From genomewiki
Jump to navigationJump to search
No edit summary
 
No edit summary
 
(5 intermediate revisions by 2 users not shown)
Line 1: Line 1:
Sometimes you want to get these darn AK.... refseq names translated to something more readable. Using a file of <refseq>tab<name> (from the kgxref-table?) I translate them to normal names with the following script.  
Sometimes you want to get these darn AK.... refseq names translated to something more readable. Using a file of <refseq>tab<name> (from the kgxref-table?) I translate them to normal names with the following script.  


Unlike an sql query this appends ALL names for given refseq and can be used on virtually any text file where you want to translate anything into something else. Isn't there a Unix-Command for this somewhere?
Unlike an sql query this appends ALL names for given refseq and can be used on virtually any text file where you want to translate anything into something else. Isn't there a Unix-Command for this somewhere?  
----
(I think you are referring to the sed command --Hiram )
----
No I'm not referring to sed. I've tried to generate sed-scripts using gawk (many s/from/to/ constructs) and it was very slow. Replacing 18000 strings with one big sed scripts takes ages. This script is reasonable fast while still being very simple. Or are you referring to a special usage of sed that I'm now aware of here? --max
----
<pre>
#!/usr/bin/python


<pre>
from sys import *
from sys import *
from optparse import OptionParser
from optparse import OptionParser
Line 9: Line 15:


# === COMMAND LINE INTERFACE, OPTIONS AND HELP ===
# === COMMAND LINE INTERFACE, OPTIONS AND HELP ===
parser = OptionParser("%prog [options] replaceList textfile: split lines into wo
parser = OptionParser("%prog [options] replaceList textfile: split lines from textfile into words and try to replace words using a replacement list-textfile (format: from tab to).")
parser.add_option("-s", "--splitChars", dest="splitChars", action="store", help=
parser.add_option("-s", "--splitChars", dest="splitChars", action="store", help="use these ch
aracters to split textfile when searching for matches", default="\t ")


(options, args) = parser.parse_args()
(options, args) = parser.parse_args()
Line 17: Line 24:


# ----------- MAIN --------------
# ----------- MAIN --------------
if args==[]:  
if args==[]:
     parser.print_help()
     parser.print_help()
     exit(1)
     exit(1)
Line 41: Line 48:
if txtFName!="stdin":
if txtFName!="stdin":
     txtFile = open(txtFName, "r")
     txtFile = open(txtFName, "r")
else:  
else:
     txtFile = stdin
     txtFile = stdin
for l in txtFile:
for l in txtFile:
Line 55: Line 62:


</pre>
</pre>
[[Category:User Developed Scripts]]

Latest revision as of 23:50, 18 March 2007

Sometimes you want to get these darn AK.... refseq names translated to something more readable. Using a file of <refseq>tab<name> (from the kgxref-table?) I translate them to normal names with the following script.

Unlike an sql query this appends ALL names for given refseq and can be used on virtually any text file where you want to translate anything into something else. Isn't there a Unix-Command for this somewhere?


(I think you are referring to the sed command --Hiram )


No I'm not referring to sed. I've tried to generate sed-scripts using gawk (many s/from/to/ constructs) and it was very slow. Replacing 18000 strings with one big sed scripts takes ages. This script is reasonable fast while still being very simple. Or are you referring to a special usage of sed that I'm now aware of here? --max


#!/usr/bin/python

from sys import *
from optparse import OptionParser
import re

# === COMMAND LINE INTERFACE, OPTIONS AND HELP ===
parser = OptionParser("%prog [options] replaceList textfile: split lines from textfile into words and try to replace words using a replacement list-textfile (format: from tab to).")
parser.add_option("-s", "--splitChars", dest="splitChars", action="store", help="use these ch
aracters to split textfile when searching for matches", default="\t ")

(options, args) = parser.parse_args()
splitChars = options.splitChars
splitCharsRe = re.compile(splitChars)

# ----------- MAIN --------------
if args==[]:
    parser.print_help()
    exit(1)

replFName = args[0]
txtFName = args[1]

# read repl file into dict
replFile = open(replFName,"r")
repl = {}
for l in replFile:
    if l.startswith("#"):
        continue
    (fromStr, toStr) = l.strip().split("\t")
    if fromStr not in repl:
        repl[fromStr] = toStr
    else:
        repl[fromStr] += "," + toStr

replFile.close()

# iterate over lines of textfile and replace
if txtFName!="stdin":
    txtFile = open(txtFName, "r")
else:
    txtFile = stdin
for l in txtFile:
    if l.startswith("#"):
        continue
    # fs = l.split()
    fs = splitCharsRe.split(l.strip())
    for field in fs:
        if field in repl:
            l = l.replace(field, repl[field])
    print l,