Using custom track database
Using the Custom Track Database
A new feature of the genome browser as of March 2007 is the ability to use a data base for custom tracks. Up to this date, custom track data has been kept in files in the /trash/ct/ directory. This article discusses the steps required to enable this function.
Summary configuration
- database loader binaries hgLoadBed, hgLoadWiggle and wigEncode are installed in /cgi-bin/loader/ - these are installed via the normal 'make cgi' in the source tree kent/src/hg/ directory.
- an empty customTrash database has been created on the MySQL host - create this manually once, the MySQL host name is a configuration item, the database name customTrash is not a configuration item
- temporary read-write data directory /data/tmp has been created with read/write/delete enabled for the Apache server effective user, this directory name is a configuration item
- configuration items are specified in /cgi-bin/hg.conf/ - this will turn on the function
- for command line access to the database, create a special ~/.hg.ct.conf to be used with the environment variable HGDB_CONF
- create a cron job to run a cleaner script to expire and remove older tables from the database - dbTrash command is used for this purpose
Host and database name
For performance and security considerations, the MySQL host for the custom track database can be a separate machine from the ordinary MySQL host that usually serves up the assembly databases or the hgcentral database. It is not required that the custom track database be on a separate MySQL server. The specification of the host machine is placed in the /cgi-bin/hg.conf file, for example a host machine called "ctdbhost":
customTracks.host=ctdbHost
The database name used on this host is fixed at customTrash which is a define in the source tree file hg/inc/customTrack.h
/cgi-bin/hg.conf configuration items
The following items must be specified in /cgi-bin/hg.conf to enable this function:
customTracks.host=ctdbhost customTracks.user=ctdbuser customTracks.password=ctdbpasswd customTracks.useAll=yes
Establish this user account and password in MySQL with db and user privileges:
Select, Insert, Update, Delete, Create, Drop, Alter for example with your MySQL root user account: hgsql -hctdbhost -uroot -p -e "GRANT SELECT,INSERT,UPDATE,DELETE,CREATE,DROP,ALTER" on customTrash TO ctdbuser@yourWebHost IDENTIFIED by 'ctdbpasswd';" mysql
Optionally, a temporary read-write directory used during database loading can be specified:
customTracks.tmpdir=/data/tmp
The default for this is /data/tmp and should be created with read/write/delete access for the Apache server effective user. It should be on a local filesystem for best access speed, not via NFS.
Database loaders
The database loaders used to load custom tracks are the standard loader commands found in the source tree, hgLoadBed, hgLoadWiggle and wigEncode. They are installed into /cgi-bin/loader/ with a 'make cgi' from the source tree directory kent/src/hg/ These loaders are used by the cgi binaries hgCustom, hgTracks, and hgTables to load custom tracks into the database. They are operated in an exec'd pipeline fashion, the code details can be see in src/hg/lib/customFactory.c
Command line access
Since the MySQL host may be different than your ordinary MySQL host, you will need to create a unique $HOME/.ct.hg.conf file to be used in the case where you want to manipulate this separate database with the kent source tree command line tools. This unique .ct.hg.conf is merely a copy of your normal .hg.conf file but with a different host/username/password specified:
db.host=ctdbhost db.user=ctdbuser db.password=ctdbpasswd
Remember to set the priviledges on this hg.conf file at 600:
chmod 600 $HOME/.ct.hg.conf
To enable the use of this file for subsequent command line operations, set the environment variable HGDB_CONF to point to this file, for example in the bash shell:
export HGDB_CONF=$HOME/.ct.hg.conf
With that in place, you can examine the contents of the customTrash database:
hgsql -e "show tables;" customTrash
This unique hg.conf file will also be used by the cleaner command dbTrash
Cleaner script
The database and the temporary data directory /data/tmp need to be kept clean. This is similar to the current cleaner script you have running on your /trash filesystem. In this case there is a specific source tree utility used to access and clean the database. The temporary data directory /data/tmp would stay clean if each and every loaded custom track was successfully loaded. In the case of badly formatted or illegal data submitted for the custom track, the database loaders do not remove their temporary files from /data/tmp This /data/tmp directory can be kept clean with, for example, an hourly cron job that performs:
find /data/tmp -type f -amin +10 -exec rm -f {} \;
This would remove any file not accessed in the past 10 minutes.
The database cleaner command dbTrash should be run as a cron job encapsulated in a shell script something like this, which maintains a record of items cleaned to enable later analysis of custom track database usage statistics:
#!/bin/sh DS=`date "+%Y-%m-%d"` YYYY=`date "+%Y"` MM=`date "+%m"` export DS YYYY MM mkdir -p /data/trashLog/ctdbhost/${YYYY}/${MM} RESULT="/data/trashLog/ctdbhost/${YYYY}/${MM}/${DS}.txt" export RESULT /cluster/bin/x86_64/dbTrash -age=48 -drop -verbose=2 > ${RESULT} 2>&1
Running this once a day will remove any tables not accessed within the past 48 hours. The dbTrash command is found in the source tree in kent/src/hg/dbTrash
The /trash directory can be kept clean with the following two commands,
one to implement an 8 hour expiration time on most files, the second to
implement a 48 hour expiration time on custom track files:
find /trash \! \( -regex "/trash/ct/.*" -or -regex "/trash/hgSs/.*" \) \ -type f -amin +480 -exec rm -f {} \; find /trash \( -regex "/trash/ct/.*" -or -regex "/trash/hgSs/.*" \) \ -type f -amin +2880 -exec rm -f {} \;
metaInfo and history
You will note two special and persistent tables in the customTrash database: metaInfo and history. The metaInfo table records a time of last use for each custom track table and a useCount for statistics. The time of last use is used by the cleaner utility dbTrash to expire older tables. The history table is the same as the history table in the normal assembly databases. The loader commands, hgLoadBed and hgLoadWiggle record into the history table each time they load a track. The cleaner command dbTrash also records in the history table statistics about what it is removing.
Turning On Considerations
Please note, if there are currently existing custom tracks in /trash/ct/ files, at the time of adding the configuration items to /cgi-bin/hg.conf/ those existing tracks will be converted to database versions upon their next use by the user. At the time the hg.conf file is changed, these conversions will take place. Don't do any fancy tricks with the date/time stamp on the hg.conf file, it needs to be the current date/time at the moment of edit. We switched our Round Robin WEB servers with an 'at' job so they would all have the same date/time.
Use of trash files with the database on
When the custom tracks database is in use, there are still small files kept in /trash/ct/ which become the reference pointers to the actual database tables belonging to that custom track. The standard trash cleaner script should still be kept running to clean these files.
Known difficulties
For the case of a custom track submission that contains more than one track set of data, in the case where one of the sets of data is illegal and causes a loading problem, even though some sets of data may have loaded successfully, the submitting user will see an error about the corrupted data, and they would need to correct their data submission to get all tracks successfully loaded.
It remains to be seen just how good the error reporting system is for illegal data.