Scripting standards: Difference between revisions

From Genecats
Jump to navigationJump to search
(this is a start . . . basically pasting in notes)
 
(fleshed out everything)
 
Line 1: Line 1:
These are some general guidelines for creating and editing shell scripts for Genome Browser engineers.  Some items are specific to bash; most apply to bash and tcsh (or csh, ksh, etc.).  As with all genomewiki pages, editing and adding to this page is encouraged.
These are some guidelines for creating and editing shell scripts for Genome Browser engineers.  Some items are specific to bash; most apply to bash and tcsh (or csh, ksh, etc.).  As with all genomewiki pages, editing, adding to, and reorganizing this page is encouraged.


==Send error messages to stderr==
This is way easier to do in bash than in tcsh.  See [http://www.faqs.org/faqs/unix-faq/shell/csh-whynot/ | Csh Programming Considered Harmful].  To send a message to stderr in bash:


==Send errors to stderr==
  echo "ERROR: some informative error message" >&2
This is way easier to do in bash than in tcsh:


==Set options in bash scripts==
==Use functions==
This is only for bash, but it is a compelling reason to use bash rather than another shell:  functions can be defined in bash scripts: http://www.tldp.org/LDP/Bash-Beginners-Guide/html/Bash-Beginners-Guide.html#chap_11.  They can either be defined in the file that uses them, or they can be defined in a separate library file that is sourced at the top of any file that uses them.  We don't currently have a library file of bash functions, but we should consider making one.  For some example functions, see the src/utils/qa/doGenbankTests script.
 
Keep functions short. If any function (in any language) is longer than 15 or so lines, it probably needs to be broken into more than one function.  Talk to Markd if you think you have a good case for a longer function. :)
 
==Use meaningful variable, function, and program names==
* Choose clarity over brevity: http://37signals.com/svn/posts/3250-clarity-over-brevity-in-variable-and-method-names
* The name of a function or program should say what it does, and it should be obvious what it returns.  Examples:  databaseExists() returns true or false, getShortLabel() returns a name.
* Don't use multiple similar-sounding names.
* There is A LOT more to be said on this topic.  Check out http://www.objectmentor.com/resources/articles/Naming.pdf.
 
There are two files in the kent tree (in src/utils/qa), qaConfig.csh and qaConfig.bash, that are sourced by almost all of the QA scripts.  They are a place to define variable names for things that change periodically and are referenced in many scripts, such as machine names.  Use the names defined in the files and add to them.
 
Also remember to use camelCase instead of underscores and try to follow the abbreviation guidelines outlined for names in kent/src/README.
 
==Comments==
Try to make self-documenting code by choosing meaningful variable and function names so that what code is doing can be gleaned just from reading the code itself.  That said, add comments whenever they can add clarity.  DO NOT check in chunks of commented-out code (you can always retrieve older code with git).  There is generally no need to include Redmine ticket numbers in comments.
 
==Use an informative exit code==
Only exit 0 at the end of a script.  Exit some non-zero number if a problem is detected.


It's a good idea to add '''set -eEu -o pipefail''' near the top of most bash scriptsThese options mean:
==Make scripts composable and modular==
It's good if the output from one program can become the input to another program.  Avoid adding formatting and comments to the output of a script that will make it difficult to use by another another script.  Instead of sending output to a file, send output to standard out, so it can either be piped to another program (it can always be redirected to a file by the user)Make sure errors go to standard error, so that if output is piped or redirected, the errors will still be noticed.


* -e exit immediately if a simple command fails
If a script is long and/or doing multiple things, consider separating it into multiple scripts.
* -E inherit error trap functions
* -u treat unset variables as errors rather than expanding them to be empty variables
* -o pipefail if this option isn't set, pipelines will have the exit status of only the right-most command.  If it is set, and one of the commands in the middle of a pipeline fails, the whole pipeline will fail.


==Special parameters==
Shells have special variables defined that can give you access to useful information: http://unixhelp.ed.ac.uk/scrpt/scrpt2.2.2.html and http://www.tldp.org/LDP/Bash-Beginners-Guide/html/Bash-Beginners-Guide.html#sect_03_02_05.


One nice special parameter is $0.  Use it to refer to the script that is being called (in a usage statement, for instance).  If you don't want to see the entire path, you can use the basename command:


  basename $0


- exit 0 at the end of a successful script, exit non-zero if a problem is encountered
==Temporary files==
Use the special parameter '''$$''', the process ID, to create unique filenames when your script creates a temporary file.  For example, instead of sending output to a file called "chromList", send it to a file called "chromList$$".  That way you will avoid clobbering any pre-existing file with the same name.


- use functions; we should consider making bash function library files
Put temporary files into some known location that is expressly for this purpose (one place is /tmp, but we should ask the admins if there is a preferred location), rather than in the user's current working directory.


- keep functions short. If any function (in any language) is longer than 15 or so lines, it probably needs to be broken into other functionsAsk Markd if you think you have a good case for a longer function.
If scripts that make temporary files are killed before they finish running, they can leave the temporary files sitting aroundTo keep this from happening, trap kill signals at the beginning of your script and remove the temp files before exiting.


- use meaningful variable and function names
In tcsh use onintr:  http://docstore.mik.ua/orelly/unix/unixnut/c05_035.htm
- put machine names in qaConfig.csh and qaConfig.bash
In bash use trap: http://www.linuxjournal.com/content/use-bash-trap-statement-cleanup-temporary-files


- trap kill signals and clean up your mess
==Set options in bash scripts==


- use PIDs $$ .. . could also put files into /tmp (or ask admins where best place is)
It's a good idea to add '''set -eEu -o pipefail''' near the top of most bash scripts. These options mean:


- make scripts composeable . . . the output of one program could become input in another program
* -e exit immediately if a simple command fails
* -E inherit error trap functions
* -u treat unset variables as errors rather than expanding them to be empty variables
* -o pipefail if this option isn't set, pipelines will have the exit status of only the right-most command.  If it is set, and one of the commands in the middle of a pipeline fails, the whole pipeline will fail.


- make self-documenting code:  choose meaningful variable and function names, even if they are kind of long. Don't add gratuitous comments when it's obvious from a function name what the code does:
==Parentheses have different meanings in tcsh and bash==
In tcsh, parentheses are frequently a normal part of the syntax (such as in if and foreach statements). In bash, enclosing something in parentheses tells the shell to spawn a subshell (see http://tldp.org/LDP/abs/html/subshells.html).  A bracket, '''[''', in bash means "test," so if you need to check for a specific condition (for instance, that a variable is equal to something specific), use brackets.  Run '''man test''' to see available options and syntax.  If you aren't using the test command, there is no reason to use any parentheses or brackets at all.  For example:


<pre># check to see if the database exists
  if grep someWord someFile
databaseExists() {
  then
   code...
      doAThing
}</pre>
   fi


- no magic numbers
Parentheses preceded by a dollar sign mean the same thing as backticks in bash:  command substitution.  See: http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_03_04.html#sect_03_04_04.

Latest revision as of 09:09, 29 August 2013

These are some guidelines for creating and editing shell scripts for Genome Browser engineers. Some items are specific to bash; most apply to bash and tcsh (or csh, ksh, etc.). As with all genomewiki pages, editing, adding to, and reorganizing this page is encouraged.

Send error messages to stderr

This is way easier to do in bash than in tcsh. See | Csh Programming Considered Harmful. To send a message to stderr in bash:

 echo "ERROR:  some informative error message" >&2

Use functions

This is only for bash, but it is a compelling reason to use bash rather than another shell: functions can be defined in bash scripts: http://www.tldp.org/LDP/Bash-Beginners-Guide/html/Bash-Beginners-Guide.html#chap_11. They can either be defined in the file that uses them, or they can be defined in a separate library file that is sourced at the top of any file that uses them. We don't currently have a library file of bash functions, but we should consider making one. For some example functions, see the src/utils/qa/doGenbankTests script.

Keep functions short. If any function (in any language) is longer than 15 or so lines, it probably needs to be broken into more than one function. Talk to Markd if you think you have a good case for a longer function. :)

Use meaningful variable, function, and program names

There are two files in the kent tree (in src/utils/qa), qaConfig.csh and qaConfig.bash, that are sourced by almost all of the QA scripts. They are a place to define variable names for things that change periodically and are referenced in many scripts, such as machine names. Use the names defined in the files and add to them.

Also remember to use camelCase instead of underscores and try to follow the abbreviation guidelines outlined for names in kent/src/README.

Comments

Try to make self-documenting code by choosing meaningful variable and function names so that what code is doing can be gleaned just from reading the code itself. That said, add comments whenever they can add clarity. DO NOT check in chunks of commented-out code (you can always retrieve older code with git). There is generally no need to include Redmine ticket numbers in comments.

Use an informative exit code

Only exit 0 at the end of a script. Exit some non-zero number if a problem is detected.

Make scripts composable and modular

It's good if the output from one program can become the input to another program. Avoid adding formatting and comments to the output of a script that will make it difficult to use by another another script. Instead of sending output to a file, send output to standard out, so it can either be piped to another program (it can always be redirected to a file by the user). Make sure errors go to standard error, so that if output is piped or redirected, the errors will still be noticed.

If a script is long and/or doing multiple things, consider separating it into multiple scripts.

Special parameters

Shells have special variables defined that can give you access to useful information: http://unixhelp.ed.ac.uk/scrpt/scrpt2.2.2.html and http://www.tldp.org/LDP/Bash-Beginners-Guide/html/Bash-Beginners-Guide.html#sect_03_02_05.

One nice special parameter is $0. Use it to refer to the script that is being called (in a usage statement, for instance). If you don't want to see the entire path, you can use the basename command:

 basename $0

Temporary files

Use the special parameter $$, the process ID, to create unique filenames when your script creates a temporary file. For example, instead of sending output to a file called "chromList", send it to a file called "chromList$$". That way you will avoid clobbering any pre-existing file with the same name.

Put temporary files into some known location that is expressly for this purpose (one place is /tmp, but we should ask the admins if there is a preferred location), rather than in the user's current working directory.

If scripts that make temporary files are killed before they finish running, they can leave the temporary files sitting around. To keep this from happening, trap kill signals at the beginning of your script and remove the temp files before exiting.

In tcsh use onintr: http://docstore.mik.ua/orelly/unix/unixnut/c05_035.htm In bash use trap: http://www.linuxjournal.com/content/use-bash-trap-statement-cleanup-temporary-files

Set options in bash scripts

It's a good idea to add set -eEu -o pipefail near the top of most bash scripts. These options mean:

  • -e exit immediately if a simple command fails
  • -E inherit error trap functions
  • -u treat unset variables as errors rather than expanding them to be empty variables
  • -o pipefail if this option isn't set, pipelines will have the exit status of only the right-most command. If it is set, and one of the commands in the middle of a pipeline fails, the whole pipeline will fail.

Parentheses have different meanings in tcsh and bash

In tcsh, parentheses are frequently a normal part of the syntax (such as in if and foreach statements). In bash, enclosing something in parentheses tells the shell to spawn a subshell (see http://tldp.org/LDP/abs/html/subshells.html). A bracket, [, in bash means "test," so if you need to check for a specific condition (for instance, that a variable is equal to something specific), use brackets. Run man test to see available options and syntax. If you aren't using the test command, there is no reason to use any parentheses or brackets at all. For example:

 if grep someWord someFile
 then
     doAThing
 fi

Parentheses preceded by a dollar sign mean the same thing as backticks in bash: command substitution. See: http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_03_04.html#sect_03_04_04.