Trash cleaners: Difference between revisions

From Genecats
Jump to navigationJump to search
(full list of email addresses notified)
(added link to new customDb diagnosis page)
 
(19 intermediate revisions by one other user not shown)
Line 4: Line 4:
older files from the '''/trash/''' directory into a complex set of interlocking scripts.  This
older files from the '''/trash/''' directory into a complex set of interlocking scripts.  This
discussion outlines the procedures and lock files that keep the system running safely.
discussion outlines the procedures and lock files that keep the system running safely.
==diagnosing health==
http://genomewiki.ucsc.edu/genecats/index.php/Custom_trash_database_machine


==recovery from problems==
==recovery from problems==


Login to '''rrnfs1''' and check if there are any currently running processes:
'''WARNING:''' You do '''not''' want to go around testing out commands on this system.
The trash filesystem can sometimes have literally millions of files in it and a simple '''ls'''
command can be a huge problem for the performance of the system.  Be very wary and
careful of how you work on this vital system.
 
Login to '''hgnfs1''' and check if there are any currently running processes:
  ps -ef | grep -i qateam
  ps -ef | grep -i qateam


Line 33: Line 41:


They are left here for perusal, they are the listings of the files that were removed
They are left here for perusal, they are the listings of the files that were removed
during this cycle of the system.  If you only see these two files here, the system
during the previous cycle of the system.  If you only see these two files here, the system
should have completed successfully.  When it fails, it will leave some of the other
should have completed successfully.  When it fails, it will leave some of the other
temporary files.  In fact, these removed file listings are archived
temporary files.  In fact, these removed file listings are archived
Line 40: Line 48:


When any of these scripts encounter problems and do not remove their lock files, the system
When any of these scripts encounter problems and do not remove their lock files, the system
remains off until the lock files can be manually removed.  email is sent to '''hiram,galt,pauline,braney''' when
remains off until the lock files can be manually removed.  email is sent to '''hiram,galt,chmalee,braney,jgarcia''' when
they are in this state as a reminder to check them.  The log files should be examined to see
they are in this state as a reminder to check them.  The log files should be examined to see
if there is any real problem.  The usual case is that some bottleneck was in place somewhere,
if there is any real problem.  The usual case is that some bottleneck was in place somewhere,
the scripts merely ran into themselves after one of them failed.  In this case, a simple
the scripts merely ran into themselves after one of them failed.  In this case, go to the directory
removal of the lock files will allow the system to continue at their next cron job invocation:
'''/home/qateam/trashCleaners/hgwbeta''' and create this file:
'''/var/tmp/qaTeamTrashMonitor.pid'''
<pre>
'''/export/userdata/cleaner.pid'''
  cd ~/trashCleaners/hgwbeta
  date > force.run
</pre>
This will cause the system to run during the next cycle.


==Primary trash directory==
==Primary trash directory==


The current trash directory NFS server is on the server: '''rrnfs1'''
The current trash directory NFS server is on the server: '''hgnfs1'''


You can login to that machine via the '''qateam''' user.
You can login to that machine via the '''qateam''' user.


A cron job running under the '''root''' user calls the scripts in the '''qateam''' directory.
A cron job running under the '''root''' user calls the scripts in the '''qateam''' directory.
It is currently running once very 4 hours, at times: 00:10 04:10 08:10 12:10 16:10 20:10
It is currently running once very 8 hours, at times: 04:10 16:10
The cluster admins maintain this '''root''' cron tab entry, it is a single command:
The cluster admins maintain this '''root''' cron tab entry, it is a single command:


Line 73: Line 84:
The '''trashCleanMonitor.sh''' script uses a lock file to prevent it from overrunning an existing
The '''trashCleanMonitor.sh''' script uses a lock file to prevent it from overrunning an existing
running instance of these scripts.  When this lock file exists, the system will not start a new
running instance of these scripts.  When this lock file exists, the system will not start a new
instance of the cleaners.  It sends email to '''hiram,galt,pauline,braney''' as an alert that the cleaners are overrunning
instance of the cleaners.  It sends email to '''hiram,galt,chmalee,braney,jgarcia''' as an alert that the cleaners are overrunning
themselves.  They normally will not overrun themselves if everything is OK.  If a previous instance
themselves.  They normally will not overrun themselves if everything is OK.  If a previous instance
failed, the lock file remains in place to keep the cleaners off until the error is recognized and
failed, the lock file remains in place to keep the cleaners off until the error is recognized and
Line 80: Line 91:
==hgwbeta cleaner==
==hgwbeta cleaner==


This first script '''hgwbeta/trashCleanMonitor.sh''' has a simple jobIt scans the '''namedSessionDb'''
This first script '''hgwbeta/trashCleanMonitor.sh''' has become very simple with recent (2019) updates
table in hgcentralbeta to take care of the trash files that belong to a saved session on hgwbeta.
to the custom track database systemThis script does call the '''trashCleaner.csh''' script which used
Trash files that are used from a saved session are moved out of the trash directory into
to have a job of moving files that belonged to sessions, but this is no longer necessary. The script
'''/export/userdata/ct/beta/'''
has become a no-op doing nothing.
with a symlink left in the primary trash directory:
 
'''/export/trash/ct/someFile -> ../../userdata/ct/beta/someFile'''
There is a log created by this process in:


The actual script that does this scanning, moving files, and symlinks is called from '''hgwbeta/trashCleanMonitor.sh'''
in order to ''monitor'' the successful result of the called script:
'''/home/qateam/trashCleaners/hgwbeta/trashCleaner.csh'''
The '''trashCleanMonitor.sh''' verifies that script has completed successfully via not only its return code,
but also the last line of the log file written by the script which must read: '''SUCCESS'''.  The log file written
by this script can be found in:
  '''/export/userdata/betaLog/YYYY/MM/cleanerLog.YYYY-MM-DDTHH.txt'''
  '''/export/userdata/betaLog/YYYY/MM/cleanerLog.YYYY-MM-DDTHH.txt'''
where YYYY is the year, MM the month, DD the date, HH the hour at the time the script runs.
where YYYY is the year, MM the month, DD the date, HH the hour at the time the script runs.
Line 109: Line 114:
This called script:
This called script:
  '''/home/qateam/trashCleaners/rr/trashCleaner.csh'''
  '''/home/qateam/trashCleaners/rr/trashCleaner.csh'''
performs the job of scanning the '''namedSessionDb''' table in hgcentral for any sessions for the RR system
performs the job of running '''dbTrash''' to clean up the customTrash database tables with a
to do the same move and symlink trick as mentioned above.  The moved files end up in:
72 hour timeout limit.
'''/export/userdata/ct/rr/'''
with symlinks from trash:
'''/export/trash/ct/someFile -> ../../userdata/ct/rr/someFile'''
 
That scanning for files also causes an access to the associated custom trash database tables for each
track.  This updates the last accessed time in the custom trash database metaInfo table.  This last access
time is important for the removal of older custom trash database tables.
 
After those files are taken care of, the removal of trash database tables begins.  This is done with
the command '''/home/qateam/bin/x86_64/dbTrash''' with a 72 hour expiration time.  Since the scanning of
the named session touched the last access time of these database tables, they will survive this 72 hour
expiration time.


After the custom trash database tables are cleaned, the removal of trash files begins.  For performance purposes,
After the custom trash database tables are cleaned, the removal of trash files begins.  For performance purposes,
the scanning of files and times in /export/trash/ needs to be done with a minimum of impact to the filesystem.
the scanning of files and times in /export/trash/ needs to be done with a minimum of impact to the filesystem.
There is a single '''find -type f''' command run on the /export/trash/ filesystem performed by a called script:
There is a single '''find -type f''' command run on the /export/trash/ filesystem performed by a called script:
  '''/home/qateam/cronScripts/trashMonV2.sh'''
  '''/home/qateam/cronScripts/trashMonV3.sh'''
That file list is used by a perl script to discover the last access times of the files in trash via a '''stat'''
That file list is used by a perl script to discover the last access times of the files in trash via a '''stat'''
function in:
function in:
Line 148: Line 141:
The RR '''trashCleaner.csh''' script accumulates log files into:
The RR '''trashCleaner.csh''' script accumulates log files into:
  '''/export/userdata/rrLog/YYYY/MM/cleanerLog.YYYY-MM-DDTHH.txt'''
  '''/export/userdata/rrLog/YYYY/MM/cleanerLog.YYYY-MM-DDTHH.txt'''
The '''udcCache''' is cleaned by the script '''HOME/cronScripts/udcClean.sh''' which leaves a number
of empty directories which are cleaned by the script '''HOME/trashCleaners/rr/cleanUdcEmptyDirs.sh'''
All symLinks in trash directories are recorded by the script '''HOME/trashCleaners/rr/symLinkRecording.sh'''
to facilitate recovery if the trash filesystem were to fail and be lost.  And file size measurements
are made on session files with the script '''HOME/trashCleaners/rr/sessionMeasurement.sh'''


When this script completes successfully, it removes the lock file: '''/export/userdata/cleaner.pid'''
When this script completes successfully, it removes the lock file: '''/export/userdata/cleaner.pid'''


The caller '''trashCleanMonitor.sh''' verifies a successful return code from '''trashCleaner.csh'''
The caller '''trashCleanMonitor.sh''' verifies a successful return code from '''trashCleaner.csh'''
and a '''SUCCESS''' message in the cleanerLog file.  If anything is failing, email is sent to '''hiram,galt,pauline,braney'''
and a '''SUCCESS''' message in the cleanerLog file.  If anything is failing, email is sent to '''hiram,galt,chmalee,braney,jgarcia'''
 
==hourly cleaning==
 
Most files in the trash filesystem are not related to custom tracks or sessions.  They are
completely temporary, such as '''.png''' images used in the genome browser graphic
display.  There are a lot of these files and they accumulate rapidly.  They are cleaned
on an hourly basis with the script:
'''HOME/trashCleaners/quickRelease/cleanVolatiles.sh'''
called from a '''root''' user cron job.  Accumulating activity logs in
'''/export/userdata/quickLog/YYYY/MM/cleanerLog.YYYY-MM-DDTHH.txt'''
This script protects itself with a lock file in: '''/export/userdata/quickCleaner.pid'''
 
This script is cleaning files only in trash directories: '''hgtIdeo hgtSide hgt''' and
making listings of files with the command: '''ls -U'''
 
Verification that this script is running properly is checked by the qateam user cron job:
'''48 * * * * ~/trashCleaners/quickRelease/watchIt.sh'''
sending email to '''hiram''' if it is found to not be running properly.


==trash measurement==
==trash measurement==


To keep track of use statistics on the trash filesystem, the script mentioned above:
To keep track of use statistics on the trash filesystem, the script mentioned above:
  '''/home/qateam/cronScripts/trashMonV2.sh'''
  '''/home/qateam/cronScripts/trashMonV3.sh'''
is used by the trash cleaners and is also used by itself just to periodically measure
is used by the trash cleaners and is also used by itself just to periodically measure
the trash filesystem.
the trash filesystem.


Since the trash cleaners are only running once every 4 hours, this measurement script
Since the trash cleaners are only running once every 12 hours, this measurement script
is run during hours when the cleaners are not running.  It is on the crontab of
is run during hours when the cleaners are not running.  It is on the crontab of
the qateam user on rrnfs1:
the qateam user on hgnfs1:
  '''42 1,2,5,6,9,10,13,14,17,18,21,22 * * * nice -n 19 ~/cronScripts/measureTrash.sh'''
  '''43 0,13 * * * nice -n 19 ~/cronScripts/measureTrashV3.sh'''


This '''measureTrash.sh''' script is calling '''/home/qateam/cronScripts/trashMonV2.sh'''
This '''measureTrashV3.sh''' script is calling '''/home/qateam/cronScripts/trashMonV3.sh'''
and removing the temporary access time file created in '''/var/tmp/'''
and removing the temporary access time file created in '''/var/tmp/'''


Line 172: Line 190:
overrunning their use of the measurement system: '''/export/userdata/cleaner.pid'''
overrunning their use of the measurement system: '''/export/userdata/cleaner.pid'''


The script '''trashMonV2.sh''' also has a lock file to prevent it from overrunning itself:
The script '''trashMonV3.sh''' also has a lock file to prevent it from overrunning itself:
  '''/var/tmp/qaTeamTrashMonitor.pid'''
  '''/var/tmp/qaTeamTrashMonitor.pid'''


There is an additional measurement script running that has nothing to do with the trash cleaning:
There is an additional measurement script running that has nothing to do with the trash cleaning:
  '''2,7,12,17,22,27,32,37,42,47,52,57 * * * * /home/qateam/cronScripts/ctFileMon.sh'''
  '''2,22,42 * * * * /home/qateam/cronScripts/ctFileMon.sh'''
It can run every five minutes because it is using a side-effect in the '''stat''' command,
It makes a simple measurement of custom track files with the command:
when run on a directory name, the indicated ''size'' is actually the file count in the directory.
'''ls -U /export/trash/ct | wc -l'''
This side-effect is not available on all types of filesystems, it just happens to works here.
These measurements are accumulating in log files in
These measurements are accumulating in log files in
  '''/home/qateam/trashLog/ct/YYYY/ctFileCount.YYYY-MM.txt'''
  '''/home/qateam/trashLog/ct/YYYY/ctFileCount.YYYY-MM.txt'''
A measurement is made of the total data sizes consumed by sessions with the script:
'''HOME/cronScripts/footPrint.sh'''
once a day with the cron job:
'''20 19 * * * /home/qateam/cronScripts/footPrint.sh'''


==customdb==
==customdb==
Line 194: Line 216:
Thereby, they are not cleaned out by the above mentioned '''dbTrash''' command in the trash cleaner
Thereby, they are not cleaned out by the above mentioned '''dbTrash''' command in the trash cleaner
system running on ''rrnfs1''.  The cron job running here:
system running on ''rrnfs1''.  The cron job running here:
  '''53 1,3,5,7,9,11,13,15,17,19,21,23 * * * /data/home/qateam/customTrash/cleanLostTables.sh'''
  '''53 1,5,9,13,17,21 * * * /data/home/qateam/customTrash/cleanLostTables.sh'''
finds these ''lost'' tables by comparing the file listing of MySQL table files in:
finds these ''lost'' tables by comparing the file listing of MySQL table files in:
  '''/data/mysql/customTrash/'''
  '''/data/mysql/customTrash/'''
Line 205: Line 227:
Log files are maintained of this cleaning activity in:
Log files are maintained of this cleaning activity in:
  '''/data/home/qateam/customTrash/log/YYYY/MM/'''
  '''/data/home/qateam/customTrash/log/YYYY/MM/'''
The sizes of all data tables in all custom track databases is measured by the cron job:
'''47 23 * * * /data/home/qateam/cronScripts/measureDb.sh'''
which uses the script: '''HOME/cronScripts/measureDbSizes.pl'''
maintaining a log of activity in:
'''HOME/trashLog/YYYY/MM/dbSizes.YYYY-MM-DDTHH.txt"
this log file is copied to '''hgnfs1''' to be used by the session footprint measurement
script to add these database tables sizes to the session file sizes.


==euroNode==
==euroNode==
Line 212: Line 242:


lockFile maintained in:
lockFile maintained in:
  '''/data2/userdata/cleaner.pid'''
  '''/data/userdata/cleaner.pid'''
  '''/var/tmp/qaTeamTrashMonitor.pid'''
  '''/var/tmp/qaTeamTrashMonitor.pid'''


I don't think I have yet turned on the special ''lost'' table cleaner on the euroNode
Logs accumulate in: '''/data/userdata/euroNodeLog/YYYY/MM/cleanerLog.YYYY-MM-DDTHH.txt
as mentioned above in the customdb section. Would be good to check to see if the
 
number of tables in the '''customTrash''' database is a constantly growing number.
Running three times a day at 00:10 08:10 16:10
 
Additional qateam account cron jobs:
'''03 0 * * * /home/qateam/cronScripts/measureDb.sh'''
'''15 20 * * * /home/qateam/customTrash/cleanLostTables.sh'''
'''15 7 * * * /home/qateam/trashCleaners/euroNode/symLinkRecording.sh'''
'''2,7,12,17,22,27,32,37,42,47,52,57 * * * * ~/cronScripts/ctFileMon.sh'''
'''28 1 * * 3 /home/qateam/cronScripts/sessionFootPrint.sh goForIt'''
 
The session footprint data recorded once a week is copied to hgwdev into:
'''/hive/data/inside/euroNode/sessionFootPrint/YYYY/MM/'''
 
==asiaNode==
 
Same system in place on the '''asiaNode''' machine.  Script called from root cron tab:
'''/home/qateam/trashCleaners/asiaNode/trashCleanMonitor.sh'''
 
lockFile maintained in:
'''/data/userdata/cleaner.pid'''
'''/var/tmp/qaTeamTrashMonitor.pid'''
 
Logs accumulate in: '''/data/userdata/asiaNodeLog/YYYY/MM/cleanerLog.YYYY-MM-DDTHH.txt
 
Running once a day at approximately 04:00 to 09:00 at a variable start time.
I don't know why it is a variable start time.
 
Additional qateam account cron jobs related to the cleaning process:
'''03 7 * * * ~/cronScripts/measureDb.sh'''
'''15 7 * * * ~/trashCleaners/asiaNode/symLinkRecording.sh'''
'''2,12,22,32,42,52 * * * * ~/cronScripts/ctFileMon.sh'''
'''23 7 * * * ~/trashCleaners/asiaNode/recordSessionScan.sh'''
'''29 8 * * 3 ~/cronScripts/sessionFootPrint.sh goForIt'''
'''53 2 * * * ~/customTrash/cleanLostTables.sh'''
 
The session footprint data recorded once a week is copied to hgwdev into:
'''/hive/data/inside/asiaNode/sessionFootPrint/YYYY/MM/'''


==hgwdev==
==hgwdev==

Latest revision as of 21:25, 17 November 2020

Overview

The trash cleaning system at UCSC has evolved from a simple one-line cron job that removed older files from the /trash/ directory into a complex set of interlocking scripts. This discussion outlines the procedures and lock files that keep the system running safely.

diagnosing health

http://genomewiki.ucsc.edu/genecats/index.php/Custom_trash_database_machine

recovery from problems

WARNING: You do not want to go around testing out commands on this system. The trash filesystem can sometimes have literally millions of files in it and a simple ls command can be a huge problem for the performance of the system. Be very wary and careful of how you work on this vital system.

Login to hgnfs1 and check if there are any currently running processes:

ps -ef | grep -i qateam

It may be the case that a previous instance simply hasn't completed yet. Let it finish, you do not want to interrupt this system.

If there is nothing running, check the most recent log file to see if there is any message about the problem in:

/export/userdata/rrLog/YYYY/MM/cleanerLog.YYYY-MM-DDTHH.txt
/export/userdata/betaLog/YYYY/MM/cleanerLog.YYYY-MM-DDTHH.txt

Or the temporary files under construction in /var/tmp/ may have the error message from a failed command. Typical file names you may find there:

-rw-rw-rw- 1 85743056 Jul 18 10:46 refreshList.O18591
-rw-rw-rw- 1  1663224 Jul 18 10:46 sessionFiles.g18585
-rw-rw-rw- 1      935 Jul 18 10:46 saveList.g18588
-rw-rw-rw- 1  1963973 Jul 18 10:46 alreadySaved.d18582
-rw-rw-rw- 1 31782116 Jul 18 11:00 trash.atime.S24127
-rw-rw-rw- 1 25326398 Jul 18 11:01 one.hour.S24127
-rw-rw-rw- 1  9604147 Jul 18 11:01 eight.hour.S24127

You will always find these two files here:

-rw-rw-rw- 1 5133861 Jul 18 11:01 rr.8hour.egrep
-rw-rw-rw- 1  650758 Jul 18 11:02 rr.72hour.egrep

They are left here for perusal, they are the listings of the files that were removed during the previous cycle of the system. If you only see these two files here, the system should have completed successfully. When it fails, it will leave some of the other temporary files. In fact, these removed file listings are archived as logs in:

/export/userdata/rrLog/removed/YYYY/MM/

When any of these scripts encounter problems and do not remove their lock files, the system remains off until the lock files can be manually removed. email is sent to hiram,galt,chmalee,braney,jgarcia when they are in this state as a reminder to check them. The log files should be examined to see if there is any real problem. The usual case is that some bottleneck was in place somewhere, the scripts merely ran into themselves after one of them failed. In this case, go to the directory /home/qateam/trashCleaners/hgwbeta and create this file:

  cd ~/trashCleaners/hgwbeta
  date > force.run

This will cause the system to run during the next cycle.

Primary trash directory

The current trash directory NFS server is on the server: hgnfs1

You can login to that machine via the qateam user.

A cron job running under the root user calls the scripts in the qateam directory. It is currently running once very 8 hours, at times: 04:10 16:10 The cluster admins maintain this root cron tab entry, it is a single command:

 /home/qateam/trashCleaners/hgwbeta/trashCleanMonitor.sh searchAndDestroy

This hgwbeta/trashCleanMonitor.sh script is going to clean trash files for hgwbeta custom tracks, and then call the primary RR trashCleanMonitor.sh to do the big job of cleaning the RR custom tracks.

WARNING: You do not want to go around testing out commands on this system. The trash filesystem can sometimes have literally millions of files in it and a simple ls command can be a huge problem for the performance of the system. Be very wary and careful of how you work on this vital system.

Cleaner lock file

The trashCleanMonitor.sh script uses a lock file to prevent it from overrunning an existing running instance of these scripts. When this lock file exists, the system will not start a new instance of the cleaners. It sends email to hiram,galt,chmalee,braney,jgarcia as an alert that the cleaners are overrunning themselves. They normally will not overrun themselves if everything is OK. If a previous instance failed, the lock file remains in place to keep the cleaners off until the error is recognized and taken care of. The complete cleaner system must finish successfully to remove the lock file.

hgwbeta cleaner

This first script hgwbeta/trashCleanMonitor.sh has become very simple with recent (2019) updates to the custom track database system. This script does call the trashCleaner.csh script which used to have a job of moving files that belonged to sessions, but this is no longer necessary. The script has become a no-op doing nothing.

There is a log created by this process in:

/export/userdata/betaLog/YYYY/MM/cleanerLog.YYYY-MM-DDTHH.txt

where YYYY is the year, MM the month, DD the date, HH the hour at the time the script runs.

Upon successful completion of the hgwbeta/trashCleaner.csh script the monitor script runs an exec command for the primary RR cleaning script

exec /home/qateam/trashCleaners/rr/trashCleanMonitor.sh searchAndDestroy

the RR cleaner

The same monitor calling script setup is working for the RR cleaner. The primary script:

/home/qateam/trashCleaners/rr/trashCleanMonitor.sh

requires the lock file initiated by the beta cleaner to exist. This script will not run if the lock file /export/userdata/cleaner.pid does not exist.

This called script:

/home/qateam/trashCleaners/rr/trashCleaner.csh

performs the job of running dbTrash to clean up the customTrash database tables with a 72 hour timeout limit.

After the custom trash database tables are cleaned, the removal of trash files begins. For performance purposes, the scanning of files and times in /export/trash/ needs to be done with a minimum of impact to the filesystem. There is a single find -type f command run on the /export/trash/ filesystem performed by a called script:

/home/qateam/cronScripts/trashMonV3.sh

That file list is used by a perl script to discover the last access times of the files in trash via a stat function in:

/home/qateam/dataAnalysis/betterTrashMonitor/fileStatsFromFind.pl

This method has been tested to show that it works very rapidly through very large file listings.

Those measuring scripts, as a side effect, maintain logs of data sizes for everything in trash. Those logs are accumulating in:

/home/qateam/trashLog/YYYY/MM/YYYY-MM-DD.HH:MM:SS

The result of the scanning scripts is a file listing with the last access time in seconds as temporary files in /var/tmp/

A simple awk of that last access time listing for the threshold expiration time produces a list of files to remove from the trash directory. Two different expiration times are in effect for different sections of the trash directory. Short lived files that are one-time use only by the browser are removed with an hour of expiration time. Custom track trash files and other files associated with browser generated data that can be used repeatedly by a user session are expired on a 64 hour expiration timeout.

The RR trashCleaner.csh script accumulates log files into:

/export/userdata/rrLog/YYYY/MM/cleanerLog.YYYY-MM-DDTHH.txt

The udcCache is cleaned by the script HOME/cronScripts/udcClean.sh which leaves a number of empty directories which are cleaned by the script HOME/trashCleaners/rr/cleanUdcEmptyDirs.sh

All symLinks in trash directories are recorded by the script HOME/trashCleaners/rr/symLinkRecording.sh to facilitate recovery if the trash filesystem were to fail and be lost. And file size measurements are made on session files with the script HOME/trashCleaners/rr/sessionMeasurement.sh

When this script completes successfully, it removes the lock file: /export/userdata/cleaner.pid

The caller trashCleanMonitor.sh verifies a successful return code from trashCleaner.csh and a SUCCESS message in the cleanerLog file. If anything is failing, email is sent to hiram,galt,chmalee,braney,jgarcia

hourly cleaning

Most files in the trash filesystem are not related to custom tracks or sessions. They are completely temporary, such as .png images used in the genome browser graphic display. There are a lot of these files and they accumulate rapidly. They are cleaned on an hourly basis with the script:

HOME/trashCleaners/quickRelease/cleanVolatiles.sh

called from a root user cron job. Accumulating activity logs in

/export/userdata/quickLog/YYYY/MM/cleanerLog.YYYY-MM-DDTHH.txt

This script protects itself with a lock file in: /export/userdata/quickCleaner.pid

This script is cleaning files only in trash directories: hgtIdeo hgtSide hgt and making listings of files with the command: ls -U

Verification that this script is running properly is checked by the qateam user cron job:

48 * * * * ~/trashCleaners/quickRelease/watchIt.sh

sending email to hiram if it is found to not be running properly.

trash measurement

To keep track of use statistics on the trash filesystem, the script mentioned above:

/home/qateam/cronScripts/trashMonV3.sh

is used by the trash cleaners and is also used by itself just to periodically measure the trash filesystem.

Since the trash cleaners are only running once every 12 hours, this measurement script is run during hours when the cleaners are not running. It is on the crontab of the qateam user on hgnfs1:

43 0,13 * * * nice -n 19 ~/cronScripts/measureTrashV3.sh

This measureTrashV3.sh script is calling /home/qateam/cronScripts/trashMonV3.sh and removing the temporary access time file created in /var/tmp/

It is also honoring the lock file used by trash cleaners to prevent it from overrunning their use of the measurement system: /export/userdata/cleaner.pid

The script trashMonV3.sh also has a lock file to prevent it from overrunning itself:

/var/tmp/qaTeamTrashMonitor.pid

There is an additional measurement script running that has nothing to do with the trash cleaning:

2,22,42 * * * * /home/qateam/cronScripts/ctFileMon.sh

It makes a simple measurement of custom track files with the command:

ls -U /export/trash/ct | wc -l

These measurements are accumulating in log files in

/home/qateam/trashLog/ct/YYYY/ctFileCount.YYYY-MM.txt

A measurement is made of the total data sizes consumed by sessions with the script:

HOME/cronScripts/footPrint.sh

once a day with the cron job:

20 19 * * * /home/qateam/cronScripts/footPrint.sh

customdb

The custom track database server is the customdb machine. You can login there with the qateam user.

This MySQL server has a couple of cron jobs running to help keep the customTrash database cleaned. These are qateam user cron jobs.

The customTrash database accumulates lost tables from failed custom track loads on the RR system. Their meta information doesn't get added to the metaInfo table in customTrash Thereby, they are not cleaned out by the above mentioned dbTrash command in the trash cleaner system running on rrnfs1. The cron job running here:

53 1,5,9,13,17,21 * * * /data/home/qateam/customTrash/cleanLostTables.sh

finds these lost tables by comparing the file listing of MySQL table files in:

/data/mysql/customTrash/

with the information in the metaInfo table. Files found that do not have metaInfo entries are candidates for removal. They are candidates because they are not removed immediately, but rather timed out from their last accessed time, just in case they are in process and may become legitimate tables. The expire time is 72 hours. The script cleanLostTables.sh uses a perl script to do the file finding and comparison with metaInfo:

/data/home/qateam/customTrash/lostTables.pl -age=72

Log files are maintained of this cleaning activity in:

/data/home/qateam/customTrash/log/YYYY/MM/

The sizes of all data tables in all custom track databases is measured by the cron job:

47 23 * * * /data/home/qateam/cronScripts/measureDb.sh

which uses the script: HOME/cronScripts/measureDbSizes.pl maintaining a log of activity in:

HOME/trashLog/YYYY/MM/dbSizes.YYYY-MM-DDTHH.txt"

this log file is copied to hgnfs1 to be used by the session footprint measurement script to add these database tables sizes to the session file sizes.

euroNode

Same system in place on the euroNode machine. Script called from root cron tab:

/home/qateam/trashCleaners/euroNode/trashCleanMonitor.sh

lockFile maintained in:

/data/userdata/cleaner.pid
/var/tmp/qaTeamTrashMonitor.pid

Logs accumulate in: /data/userdata/euroNodeLog/YYYY/MM/cleanerLog.YYYY-MM-DDTHH.txt

Running three times a day at 00:10 08:10 16:10

Additional qateam account cron jobs:

03 0 * * * /home/qateam/cronScripts/measureDb.sh
15 20 * * * /home/qateam/customTrash/cleanLostTables.sh
15 7 * * * /home/qateam/trashCleaners/euroNode/symLinkRecording.sh
2,7,12,17,22,27,32,37,42,47,52,57 * * * * ~/cronScripts/ctFileMon.sh
28 1 * * 3 /home/qateam/cronScripts/sessionFootPrint.sh goForIt

The session footprint data recorded once a week is copied to hgwdev into:

/hive/data/inside/euroNode/sessionFootPrint/YYYY/MM/

asiaNode

Same system in place on the asiaNode machine. Script called from root cron tab:

/home/qateam/trashCleaners/asiaNode/trashCleanMonitor.sh

lockFile maintained in:

/data/userdata/cleaner.pid
/var/tmp/qaTeamTrashMonitor.pid

Logs accumulate in: /data/userdata/asiaNodeLog/YYYY/MM/cleanerLog.YYYY-MM-DDTHH.txt

Running once a day at approximately 04:00 to 09:00 at a variable start time. I don't know why it is a variable start time.

Additional qateam account cron jobs related to the cleaning process:

03 7 * * * ~/cronScripts/measureDb.sh
15 7 * * * ~/trashCleaners/asiaNode/symLinkRecording.sh
2,12,22,32,42,52 * * * * ~/cronScripts/ctFileMon.sh
23 7 * * * ~/trashCleaners/asiaNode/recordSessionScan.sh
29 8 * * 3 ~/cronScripts/sessionFootPrint.sh goForIt
53 2 * * * ~/customTrash/cleanLostTables.sh

The session footprint data recorded once a week is copied to hgwdev into:

/hive/data/inside/asiaNode/sessionFootPrint/YYYY/MM/

hgwdev

Same system in place on the hgwdev machine. Script called from root cron tab:

/cluster/home/qateam/trashCleaners/hgwdev/trashCleanMonitor.sh

with logs accumulating in:

/data/apache/userdata/log/YYYY/MM/cleanerLog.YYYY-MM-DDTHH.txt

lockFile:

/data/apache/userdata/cleaner.pid

hgwalpha

Same system in place on the hgwalpha machine. Script called from root cron tab:

/cluster/home/qateam/trashCleaners/hgwalpha/trashCleanMonitor.sh

with logs accumulating in:

/data/apache/userdata/hgwalphaLog/YYYY/MM/cleanerLog.YYYY-MM-DDTHH.txt

lockFile:

/data/apache/userdata/cleaner.pid

log analysis

There is a vast network of cron jobs running on Hiram's account on hgwdev that is processing the logs produced by all these trash cleaners and measurement scripts to construct the bigBed and bigWig files saved in a session to display updating tracks in the browser showing all this activity and even more with the processed Apache logs and MySQL server process list measurements.