Scripting standards

From Genecats
Jump to navigationJump to search

These are some guidelines for creating and editing shell scripts for Genome Browser engineers. Some items are specific to bash; most apply to bash and tcsh (or csh, ksh, etc.). As with all genomewiki pages, editing, adding to, and reorganizing this page is encouraged.

Send error messages to stderr

This is way easier to do in bash than in tcsh. See | Csh Programming Considered Harmful. To send a message to stderr in bash:

 echo "ERROR:  some informative error message" >&2

Use functions

This is only for bash, but it is a compelling reason to use bash rather than another shell: functions can be defined in bash scripts: http://www.tldp.org/LDP/Bash-Beginners-Guide/html/Bash-Beginners-Guide.html#chap_11. They can either be defined in the file that uses them, or they can be defined in a separate library file that is sourced at the top of any file that uses them. We don't currently have a library file of bash functions, but we should consider making one. For some example functions, see the src/utils/qa/doGenbankTests script.

Keep functions short. If any function (in any language) is longer than 15 or so lines, it probably needs to be broken into more than one function. Talk to Markd if you think you have a good case for a longer function. :)

Use meaningful variable, function, and program names

There are two files in the kent tree (in src/utils/qa), qaConfig.csh and qaConfig.bash, that are sourced by almost all of the QA scripts. They are a place to define variable names for things that change periodically and are referenced in many scripts, such as machine names. Use the names defined in the files and add to them.

Also remember to use camelCase instead of underscores and try to follow the abbreviation guidelines outlined for names in kent/src/README.

Comments

Try to make self-documenting code by choosing meaningful variable and function names so that what code is doing can be gleaned just from reading the code itself. That said, add comments whenever they can add clarity. DO NOT check in chunks of commented-out code (you can always retrieve older code with git). There is generally no need to include Redmine ticket numbers in comments.

Use an informative exit code

Only exit 0 at the end of a script. Exit some non-zero number if a problem is detected.

Make scripts composable and modular

It's good if the output from one program can become the input to another program. Avoid adding formatting and comments to the output of a script that will make it difficult to use by another another script. Instead of sending output to a file, send output to standard out, so it can either be piped to another program (it can always be redirected to a file by the user). Make sure errors go to standard error, so that if output is piped or redirected, the errors will still be noticed.

If a script is long and/or doing multiple things, consider separating it into multiple scripts.

Special parameters

Shells have special variables defined that can give you access to useful information: http://unixhelp.ed.ac.uk/scrpt/scrpt2.2.2.html and http://www.tldp.org/LDP/Bash-Beginners-Guide/html/Bash-Beginners-Guide.html#sect_03_02_05.

One nice special parameter is $0. Use it to refer to the script that is being called (in a usage statement, for instance). If you don't want to see the entire path, you can use the basename command:

 basename $0

Temporary files

Use the special parameter $$, the process ID, to create unique filenames when your script creates a temporary file. For example, instead of sending output to a file called "chromList", send it to a file called "chromList$$". That way you will avoid clobbering any pre-existing file with the same name.

Put temporary files into some known location that is expressly for this purpose (one place is /tmp, but we should ask the admins if there is a preferred location), rather than in the user's current working directory.

If scripts that make temporary files are killed before they finish running, they can leave the temporary files sitting around. To keep this from happening, trap kill signals at the beginning of your script and remove the temp files before exiting.

In tcsh use onintr: http://docstore.mik.ua/orelly/unix/unixnut/c05_035.htm In bash use trap: http://www.linuxjournal.com/content/use-bash-trap-statement-cleanup-temporary-files

Set options in bash scripts

It's a good idea to add set -eEu -o pipefail near the top of most bash scripts. These options mean:

  • -e exit immediately if a simple command fails
  • -E inherit error trap functions
  • -u treat unset variables as errors rather than expanding them to be empty variables
  • -o pipefail if this option isn't set, pipelines will have the exit status of only the right-most command. If it is set, and one of the commands in the middle of a pipeline fails, the whole pipeline will fail.

Parentheses have different meanings in tcsh and bash

In tcsh, parentheses are frequently a normal part of the syntax (such as in if and foreach statements). In bash, enclosing something in parentheses tells the shell to spawn a subshell (see http://tldp.org/LDP/abs/html/subshells.html). A bracket, [, in bash means "test," so if you need to check for a specific condition (for instance, that a variable is equal to something specific), use brackets. Run man test to see available options and syntax. If you aren't using the test command, there is no reason to use any parentheses or brackets at all. For example:

 if grep someWord someFile
 then
     doAThing
 fi

Parentheses preceded by a dollar sign mean the same thing as backticks in bash: command substitution. See: http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_03_04.html#sect_03_04_04.