Unix environment

From genomewiki
Jump to navigationJump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Working in the UNIX environment

Editor

The most important tool will most likely be your editor. It doesn't matter what you want to use, but whatever it is, learn it well. vi and emacs are the most common editors used in the unix environment. Your choice of editor will become critical when you use your shell command line in its editing mode. There are very good tutorials on the internet for your editor. There is a VI quick start command listing in genomewiki. Or type "vimtutor" at the command line. See also: Editor War. <embedurl>http://www.cse.iitm.ac.in/~osslab/joomla/images/stories/vim-editor_logo.png{width=100}{height=100}</embedurl><embedurl>http://macin.files.wordpress.com/2008/10/carbon-emacs-icon.png{width=100}{height=100}</embedurl>

Shell

There are two shells in common use on unix: bash and tcsh. Next to your editor, your shell command line is going to be a critical element of your efficiency using unix. You will want the command line editing features turned on for your command line to recognize your favorite editor commands. Learn how to use your command line editing feature.

Understand what stdout, stderr and stdin are and how to control their input and output in compound shell commands. There are very good bash and tcsh tutorials on the internet. You will never again compose a long command line just to find out it has a typo error in it, and then have to type the whole thing in again. Use your command line editor to rapidly fix the typo to repeat the corrected command.

Also, verify that you can easily cut and paste between your shell command line and other applications on your desktop. This function depends upon what kind of desktop you operate. Each desktop may have different mechanisms for this function.

Customize your login shell with a good .rc file. For example, note customizations in Hiram's file:

~hiram/.bashrc.hiram

Do not place passwords or other sensitive information in these files. chmod them to 644 and allow your work to inspire others.

See also: [Bash vs. Csh]

Regular Expressions

You will be using regular expressions in your editor, your shell and in other commands. Just about everything. You will need to know how to use them. You can get pretty far with a minimal familiarity of the basics. Keep a reference handy for the odd cases where you need to use the more extensive operations.

Commands

Finding good commands

You can use the apropos command to find a command related to some function you would like to perform. The apropos command performs a simple string search, of a given word, through the documentation pages to output a single line header of commands matching that string. The command may be related to what you want to accomplish.

<embedurl>http://farm1.static.flickr.com/186/451709341_0930c677e0.jpg?v=0{width=510}{height=330}</embedurl>

Use the man someCommand command to view the manual page for someCommand

[hiram@hgwdev /tmp] apropos apropos
apropos              (1)  - search the whatis database for strings
man                 (rpm) - A set of documentation tools: man, apropos and whatis.
[hiram@hgwdev /tmp] man apropos
apropos(1)                                                          apropos(1)
NAME
      apropos - search the whatis database for strings
... etc ...

Fun tricks with man. Try this:

man -Pcat man | cat -A | less

Then:

man -Pcat man | col -b | cat -A | less

The (1) in apropos(1) indicates which section of the manuals the command can be found in. The first set of books were labeled with the integers 1 2 3 4 ... Therefore section (1) indicates the first book. This numbering system broke down after the manuals became electronic documents without any real instantiation in an actual book.

grep

The grep command is used to find lines in text files, or in streaming output from previous pipeline commands. Given a string, any line matching that string is printed out. The output can be the inverse, printing out any lines that do not match the string.

If you want your string to instead be an actual regular expression, use grep with the -E argument, or use the egrep command which is equivalent to grep -E.

Example: select only the bed format lines from a custom track file so they can be used with hgLoadBed. Removing any lines that begin with track or browser, an example of the inverse function with the -v argument:

egrep -v "^track|^browser" customTrack.txt > file.bed

Need to use egrep for this type of or pattern with the pipe alternatives "^track|^browser"

Alternatively, if it is known that all chromosome names start with "chr", select only those lines:

grep "^chr" customTrack.txt > file.bed

Both these examples assume there is only one track defined in customTrack.txt

To efficiently scan an entire directory hierarchy of files, use the following find | xargs grep pipeline:

find . -type f -print0 | xargs --null grep "<your string>"

The -print0 argument to find combined with the --null argument on xargs makes this pipeline work properly even if the file names include blanks. If you know all your file names have no blanks, omit the -print0 and --null. Blanks in unix file names are discouraged since they always need to be taken care of in special ways when working with a command line which interprets whitespace as the delimiter for separate strings. The usual work-around is to use underscore in place of_blanks_in_file_names.

To grep the contents of a manual page. For example, find the string apropos in the man manual page:

man -Pcat man | col -b | grep -i apropos

See also: Sieve

sed

The sed command is used to perform batch text editing in files. The simple format:

sed -e "s/search string/replace string/g" someFile.txt > result.txt

Will replace all occurrences of "search string" with "replace string" in someFile.txt with the result going to result.txt. "g" means global -- will replace every instance of the string on a line, not just the first. If this looks somewhat familiar to the perl syntax:

result =~ s/search string/replace string/g;

it is because perl copied this syntax from sed. The search string can be a regular expression. For example, given a bed file from Ensembl with Ensembl chromosome names (numbers), change them to UCSC chromosome names, prefixed with "chr":

sed -e "s/^\([0-9XY][0-9]*\)/chr\1/; s/^MT/chrM/; s/^Un/chrUn/" ensembl.bed > ucsc.bed

The backslash paren syntax \(...\) captures that match in a sequentially numbered buffer which can later be used in the replace string as \1 where 1 is the sequential buffer count. The carat ^ anchors the match to the beginning of the line. Multiple operations are separated by the semicolon; Know your regular expressions well to get the best results from sed. It is a very powerful and useful tool.

awk

<embedurl>http://upload.wikimedia.org/wikipedia/commons/0/0b/Riesenalk.JPG{width=100}{height=181}</embedurl> The awk command is a simple programming language useful in processing text files or pipeline streams of text.

From the man page for awk, the fundamental definition is pretty simple:

An  AWK program consists of a sequence of pattern-action statements and optional function definitions.
  pattern   { action statements }
  function name(parameter list) { statements }

If pattern is omitted, it matches all input lines. If the { action statements } are omitted, the default action is { print }.

pattern can be a regular expression, or a variety of other matching expressions. The man page tells all. I'm guessing a google search for awk tutorial would lead to some useful exercises.

For example, given a bed file, select out lines where the score is above 650:

awk '$5 > 650' file.bed > result.bed

By default, awk separates input lines into fields by white space. To work with tab separated files with tab as the field separator, set the FS and OFS variables to tab for awk. For example, convert an Ensembl gff file into a UCSC bed6 file, fixing the chromosome names and using the frame offset as the bed score column:

zcat Homo_sapiens.GRCh37.57.gtf.gz \
       | awk -F'\t' -v 'OFS=\t' '{
sub(/^MT/,"M",$1)
sub(/^/,"chr",$1)
sub(/\./,0,$8)
print $1,$4,$5,$3,$8,$7
}'

The chromosome names are converted with the first two sub() substitutions, the third sub() converts the frame "." to "0" to print out a number for the bed score column as a frame number.

Example, compute a running sum of a column of numbers in a file, for example a chrom.sizes file:

awk -v 'OFS=\t' '{sum += $2; print sum,$0}' chrom.sizes

Example, find the longest line in a file:

awk '{print length($0),$0}' chrom.sizes | sort -rn | head -1
28 chr19_gl000209_random        159169

Screen

The screen command is a useful virtual terminal to allow you to take your terminal session from one location to another.

To start a virtual screen, simply enter the screen command:

$ screen

Your existing terminal session will appear to be erased, your command prompt will appear as if you had just started a new login session. You can work in this terminal as if it was an ordinary login session. The benefit is that if you lose your connection, for example from a temporary airport WiFi connection, your terminal will remain active doing whatever you last asked your command line to do, and you can come back to that terminal when your connectivity has been restored.

To reattach yourself to a background screen terminal:

$ screen -r -d

Your existing screen will appear to be erased and the current contents and state of your screen terminal will reappear right where you left off.

Multiple screen sessions: I personally do not like to run more than one screen session at a time. If you can manage keeping track of multiple screens, you can always start a new one with the simple screen command. I keep myself from accidentally starting multiple screens by always trying to start one with screen -r -d which will fail if there are no existing screens, or it attaches to my one running screen session.

If you are in a screen session and you would like to put it in the background to return to your actual terminal session, enter the special control character: Ctrl-a then the letter d. The Control-a keystroke is recognized by screen to enter its command mode, the letter d is the command to detach.

If you need to preserve Ctrl-a as a character you ordinarily use in your editor or some terminal application, you can alter this default character by creating a ~.screenrc file. For example, use Ctrl-t instead:

# change the Ctrl-a default to Ctrl-t
escape "^Tt"

I must admit, I've found screen to be quite useful even though I only know how to enter and exit it. It has a gazillion options and can do many other fantastic tricks. Right now, I'm quite happy with my limited knowledge of it.

One Problem: Sometimes I forget that I'm in a screen session and I use my normal exit shell keystroke of Ctrl-d which causes that screen session to exit completely. If I had an existing running process in that shell, it may continue to run to completion if the disappearance of its controlling terminal doesn't bother it. But I can't get back control of that orphaned process to see any completion or error messages from it. This can be a problem if those messages weren't being sent to a log file in the first place. Maybe I should remap Ctrl-d to mean something else when I'm in screen. That might help.

See also: screen advice and GNU screen wiki

See also

Unix-Haters Handbook (copy available on Hiram's bookshelf)