Revision as of 13:00, 15 April 2010

Installing a Browser Mirror on Linux

These instructions are based on the official mirror instructions as well as various HOWTOs found in src/unzipped/product of the source tree.

There are scripts in the source tree src/product/scripts to aid in the mirroring process and perform many of these steps periodically by setting up cron jobs. Crontab

Outline

Download
- Copy all html files to /var/www/html (this path can be adapted to your system, but things are easier if you stick with this default)
- Copy all cgi-bin programs to /var/www/cgi-bin (this path can be adapted to your system, but see above)
- Choose the assemblies you want to mirror, and for these...
  - Copy the corresponding mysql databases to your mysql directory (e.g. /var/lib/mysql) and change their owner to the mysql user
  - Copy the corresponding /gbdb/<assembly> directories to the /gbdb directory (cannot be changed, create a link /gbdb if you have to store them somewhere else)
Configuration
- Tweak your apache configuration to make it execute the cgi-bin scripts, activate XBitHack. Then restart apache.
- Create a trash directory (users can upload their custom tracks into this) and make it owned by the apache user
- Setup mysql permissions for the mysql-databases you downloaded and
- Add the mysql users you created to /var/www/cgi-bin/hg.conf
- Set a defaultGenome name in your cgi-bin/hg.conf file (see below)
- Tweak your hgcentral database (=user interface) to some reasonable defaults (e.g. default genome, blat servers, ...)

Prerequisites: software and diskspace

It is assumed that you have both an Apache2 web server and MySql 5.x installed.
A full mirror will require a lot of disk space (Mar 2010: >4.5 Terrabyte) Additional space will also be required for future database expansion. A partial mirror will need less.
Note the rsync commands listed below must be executed with sufficient write permissions to create their assigned directories and files.
Have enough free diskspace: You can check the space requirements of a genome by issueing these two commands (in this example, for the assembly hg18). You need to add the two reported "Total filesize"-lines:

 rsync -nah --stats rsync://hgdownload.cse.ucsc.edu/mysql/hg18
 rsync -nah --stats rsync://hgdownload.cse.ucsc.edu/gbdb/hg18

If you want to mirror only a part of an assembly, see Minimal_Browser_Installation

Get Executables

More details for these methods can be found in the official mirror docs and README.building.source.

Option 1: Use rsync to get a copied of the compiled binaries.

  rsync -avzP rsync://hgdownload.cse.ucsc.edu/cgi-bin/ /var/www/cgi-bin

This command will grab the x86_64 binaries. These binaries work with any 64bit processor. If the binaries work for you they represent the easiest way forward.

To make the binaries work in Debian lenny, execute these two commands in /usr/lib as root:

   ln -s libcrypto.so.0.9.8 libcrypto.so.6
   ln -s libssl.so.6 libssl.so.0.9.8

Option 2: Get the jksrc files and compile the executables yourself. Compiling the source can be a challenge if things don't work out of the box. However by compiling the jksrc tree you will get many useful tools and scripts which will be of value for bioinformatic and admin tasks.

Suggestion: Try to Download jksrc. Attempt to compile the source files. If this initially proves to be problematic, use rsync to get the pre-compiled executables and compile the source later. You will find several useful scripts and some of the browser documentation in /src/product (in the jksrc archive). The scripts and docs in the source tree are available whether or not you successfully compile the entire tree.

Configure Apache server

In order to support SSI it is necessary to set the XBitHack. Add the following somewhere in /etc/httpd/conf/httpd.conf (redhat) or /etc/apache2/apache2.conf (debian/ubuntu)

      XBitHack on
      <Directory /var/www/html>
      Options +Includes
      </Directory>

If you're already running a webserver, you have to run the genome browser as a virtual host, with its own domain name (ask your system administrators what that means). You have to adapt ScriptAlias and most other paths of your apache config to non-default directories, to avoid conflicts with your main apache files. Here is an example extract httpd.conf, with paths moved to subdirectories of /var/www/genome running on the virtual host genome.myuniversity.com, which means that you can continue to run your existing webserver in /var/www under a different name:

<VirtualHost *:80>
    ServerAdmin mymail@somewhere.com
    DocumentRoot /var/www/genome/html
    ServerName genome.myuniversity.com
    ErrorLog logs/genome-error_log
    CustomLog logs/genome-access_log common
    ScriptAlias /cgi-bin/ /var/www/genome/cgi-bin/
    XBitHack on
    <Directory "/var/www/genome/cgi-bin">
       AllowOverride None
       Order allow,deny
       Allow from all
       Options +ExecCGI
    </Directory>
    <Directory /home/data/www/genome/html>
       Options +Includes
     </Directory>
</VirtualHost>

Find the location of your web pages. This should be /var/www/html by default (if you don't use virtual hosts). Set the enviromental variable if desired.

     export WEBROOT="/var/www/html"

Find the location of your cgi-bin directory. This should be /var/www/cgi-bin (if you don't use virtual hosts). Set the enviromental variable if desired.

     export CGI-BIN="/var/www/cgi-bin"

Next, find the location of your MySQL data. This should be located in /var/lib/mysql. Set the enviromental variable if desired.

     export MYSQLDATA="/var/lib/mysql"

Note: These variables can be set in /etc/profile so they will be available globally to all users. Also they can be skipped entirely if absolute paths are used instead.

Get all the html files

Test the rysnc connection:

   rsync -navz --progress rsync://hgdownload.cse.ucsc.edu

Determine the destination of the copy ($WEBROOT) and fire off the production copy. The trailing slash is important!

   rsync -avzP rsync://hgdownload.cse.ucsc.edu/htdocs/ $WEBROOT/

Obtain the /gbdb data file area

You will need the portions of /gbdb used by the browser. Replace XXX with the assemblies you want to mirror. The trailing slash is important.

     rsync -avzP --delete --max-delete=20 rsync://hgdownload.cse.ucsc.edu/gbdb/XXX/ /gbdb/XXX/

In any case, you should always download the /gbdb/genbank directory, as many assemblies link into it.

Download assembly database tables

These instructions should be followed in conjunction with /src/product/README.mysql.setup.

There are two ways to install the tables.

Build from textfiles: The first involves building the tables from the assembly dumps (optionally downloaded above). This method is not covered here, please see the official mirror docs
Direct syncing: The second and preferable method involves rsyncing the binary tables themselves. This is faster and a lot more convenient. Use this method if possible. You also do not need to create any databases in this case, they are automatically created by downloading the mysql files.

Caveats for direct syncing:

Your MySql version must be compatible with the table version (currently 5.0.x)
The hgcentral (and others?) table which is found in /var/lib/mysql/ must receive special handling (covered later).
For your own locally created tracks loaded into the database, use the trackDb_localTracks table to avoid the UCSC updated trackDb tables.
The actual download size of the tables is more than simply downloading the text dumps of the assemblies. This is because of the extensive use of indexes in the tables.

To proceed with syncing the tables directly issue the following command:

         rsync -avzP --delete --max-delete=20 rsync://hgdownload.cse.ucsc.edu/mysql/XXX/ /var/lib/mysql/XXX/

where XXX is the name of each database to be mirrored. You will need to generate a list of tables to be mirrored. Note you can NOT simply sync with hgdownload.cse.ucsc.edu/mysql since the mysql directory contains a number of files and sub directories which are specific to each instance of the mysql database.

An unedited list of potential tables to be mirrored can be found by issuing the command:

         rsync -v --dry-run rsync://hgdownload.cse.ucsc.edu/mysql

This list will then have to be edited so that only the correct tables are mirrored.

Download other database tables

You will usually need other databases in addition to just the genome assembly databases:

hgcentral: primary database the browser uses to find everything else, also contains dynamic user/session "cart" data
- Essential
sp090821, etc ... - "Swiss-Prot" aka UniProt database obtained from files at ftp.expasy.org/databases/uniprot/
- used in UCSC genes track on various databases
uniprot: the newest version of the Swiss-Prot databases, can simply be a symlink to the newest sp* database directory
- Used in UCSC genes track on various databases
go - The Gene Ontology database, obtained from: http://www.godatabase.org/dev/database/
- used in the UCSC genes track
proteins090821, etc. - a combination of the UniProt data mentioned above and data from HGNC http://www.genenames.org/
- Used in the human UCSC genes track and proteome browser
visiGene: virtual microscope for mice sections
- Usually not needed
proteome: should merely be a symlink to the most recent proteins090821 database.

Download them, like the assembly database tables above.

You will usually need hgcentral, hgFixed, proteome (symlink), genbank, uniProt. You better download all of them now, to avoid any error messages later on.

Grant Mysql rights

After the tables have been created it is necessary to add the required users along with their associated permissions. The entire process of MySQL configuration is described in /src/product/README.mysql.setup as found in jksrc. In brief 3 users are required. These users are readonly, readwrite, browser. These users are configured as follows:

User	MySql Permission	Databases	Used by
browser	SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, ALTER	All except hgcentral	developers
readonly	SELECT	All except hgcentral	CGI scripts
readwrite	SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, ALTER	hgcentral	browser(?)

Do not forget the grant rights for the databases hgFixed, genbank, proteome, uniProt and proteinsxxxx as well (See also Example Mysql Grants).

Each database must have these 3 users added with the associated permissions. The easiest way to accomplish this is to use the script ex.MySQLUserPerms.sh which can be found in src/products in jksrc. The script sets the permissions on each database listed by name. NOTE:This script must be edited before use!. The script handles each database explicitly by name. It is likely that the script does not contain the latest set of database names. A current list of database names must be generated and any which are missing will need to be added. Also future updates to the database may require additional changes to the script. As an alternative , it is possible at the cost of a small amount of security to set the permissions globally using *.* edits. An example of the required edit to the script so that permissions are added globally is:

   ${MYSQL} -e "GRANT SELECT, UPDATE on *.* to browser@localhost IDENTIFIED BY 'password';"mysql

After the edits are made the script will add these 3 users to all of the databases found used by the browser. These permissions are limited to localhost for security reasons. ex.MySQLUserPerms.sh is heavily documented and should be read to make sure that the changes discussed above are understood.

Setup hg.conf

After adding the MySql Users it is necessary to add hg.conf to the cgi-bin directory. hg.conf contains username/password information and is required by various cgi-bin programs.

A sample hg.conf can be found here. Go to the cgi-bin directory and execute this command:

   sudo wget http://genome-test.cse.ucsc.edu/~kent/src/unzipped/product/ex.hg.conf -O hg.conf

The default user/password combinations and permissions can be changed, however doing so will require editing of other scripts which have the user/passwords hardcoded in them (notably ex.MySQLUserPerms.sh). It is probably best to keep the defaults at least until one knows what one is doing.

In hg.conf you will need to set the document root:

   browser.documentRoot=/var/www/html

The actual path could be different depending on your actual root directory. After an appropriate hg.conf has been created, it should be installed in /var/html/cgi-bin and the permissions set to 600 (my setup has the file owner/group of apache/apache).

If you're not on the UCSC campus: Comment out any bottleneck statements from hg.conf, as they will break the MAF alignment view.

Make sure you activate the custom track database statements: Using_custom_track_database

Set up the "hgcentral" tables

Download the schema for the hgcentral database (hgcentral.sql) http://genome-test.cse.ucsc.edu/~kent/src/unzipped/product/ex.hg.conf

Create a hgcentral database

     mysql> create database hgcentral

Add the hgcentral tables

      mysql -youraccountoptions hgcentral < hgcentral.sql

Create a user/password with the ability to update and insert. This user is currently "readwrite". The script ex.MySQLUserPerms.sh will add this usercreate this user to hgcentral.

(optional) Set correct location of Blat Servers. By design the location of the Blat Servers is incomplete in hgCentral. This prevents their over use or abuse. In order to implement Blat you will either need to connect to the UCSC servers or setup your own. You will need permission to connect to the UCSC Blat Servers. Please see a discussion of the requirements and restrictions in the official docs (TODO: find correct link). The following sql command will update the table to point at the ucsc servers

    mysql> USE hgcentral;

    mysql> UPDATE blatServers SET host=CONCAT(host,".cse.ucsc.edu");

Please get permission before using Blat Servers!

If you are not mirroring the primary human database and would like to have a different default genome displayed on your gateway page, enter the defaultGenome specification in the cgi-bin/hg.conf file. For example if you have only dm3:

defaultGenome=D. melanogaster

Find the name to use from defaultDb in hgcentral:

hgsql -e "select genome from defaultDb;" hgcentral

Create a "trash" directory

The cgi programs use a temporary area to create and store images used by the browser. This directory is by default looked for in /var/www/trash. You should make this directory and allow the user that runs the web server write access to it. As a point of maintenance this directory will need to be cleaned out from time to time.

If you have adapted your /var/www/ directory to something else (e.g. /var/www/genome/html) you should create your trash directory there.

mkdir /var/www/trash
chown apache:apche var/www/trash

You will also need a symlink from your document root html hierarchy to this trash directory. The trash directory can actually be on any filesystem and /var/www/trash can be a symlink to that location.

If it doesn't work...

"Nothing shows up, not even the blue side menu"
- If they don't then the documentRoot statement in the Apache configuration file is wrong or you have forgotten to restart your Apache daemon ("/etc/init.d/httpd restart" or "/etc/init.d/apache restart")
"There is something, but I cannot see the menu where I can select the right genome"
- Use the script printEnv.pl in your cgi-bin directory to see if it is working
- Verify that Apache is really executing the cgi-bin programs and not just downloading them. When you access cgi-bin/hgGateway with a browser and they are not executed, check this:
  - Does your Apache Configuration file contain a ScriptAlias directive? This directive has to reference the cgi-bin directory.
  - There is no use in trying Apache's "AddHandler cgi-script .cgi" directive as often suggested by the Apache documentation. It doesn't work because the cgi scripts to not end in .cgi
  - An alternative is to create a file .htaccess in the cgi-bin directory and add the line "DefaultType application/x-httpd-cgi" (make sure that .htaccess interpretation is activated in apache for this directory, see the apache documentation)
- Have you restarted your web server after you updated your apache configuration file?
- Check if the tables genomes/clades in your hgcentral table make sense (todo: extend this section)
"I am seeing error messages relating to gbdb... "
- Make sure that the apache user can access the /gbdb directory
  - Check the file permissions of the /gbdb directory in respect to the user under which apache is running (debian/ubuntu: www-data)
  - If you are not sure how to interpret file permissions, login as the apache user ("sudo su apache") and try to read any file in /gbdb. Also check if you can write to your trash directory (echo test > test)
"I see mysql-access errors everywhere"
- Verify that the Mysql user you added to hg.conf can read the mysql databases and can write to the customTracks database.
- To test this, log into mysql (mysql -u <your user from hg.conf> -p) and try to access the assembly that does not work ("select * from hg19.chromInfo;")
- For other Mysql error messages, add this line "JKSQL_TRACE=on" in "hg.conf" to debug sql messages

Browser Installation: Difference between revisions

Revision as of 13:00, 15 April 2010

Contents

Installing a Browser Mirror on Linux

Outline

Prerequisites: software and diskspace

Get Executables

Configure Apache server

Get all the html files

Obtain the /gbdb data file area

Download assembly database tables

Download other database tables

Grant Mysql rights

Setup hg.conf

Set up the "hgcentral" tables

Create a "trash" directory

If it doesn't work...

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

related sites

hosted projects

Tools