Custom trash database machine

From Genecats
Jump to navigationJump to search

Hiram gave the following brain dump in response to Brian Lee's question "is there a way to do a quick check of customDb's health?":

This discussion applies to any of our machines where qateam cronjobs are running that make these measurements. Just about every machine has these kinds of measurements.

You should have a login qateam@customdb available.

On that machine, there are various load monitors. If you can't remember any of this, and I certainly can't from one day to the next, check the cronjobs running:

  $ crontab -l | grep -v "^#" | sort -u

Notice the first two on that output:

* * * * * /data/home/qateam/loadLog/measureLoad.sh
* * * * * /data/home/qateam/loadLog/sqlProcess/measureProcessList.sh

Every minute of every day, the system load and the SQL process load is measured. To find out where the log files are, check those scripts to see where they deposit their results.

For example, history of generic system load from the measureLoad.sh script is available in:

   cd $HOME/loadLog/2020/11/customdb.load.txt

This is a record of load average from the 'w' command

$ tail -3 *.txt 1605642001 2020-11-17 11:40:01 0.04, 0.36, 0.42 1605642061 2020-11-17 11:41:01 0.11, 0.32, 0.40 1605642121 2020-11-17 11:42:01 0.58, 0.39, 0.42

The three numbers after the time are the system load average for:

  1. 1 minute
  2. 5 minutes
  3. 15 minutes

You can consider these numbers as a measure of CPUs in use during those averaging times. To see the number of CPUs on the computer, use the 'top' command, as it is running press the number one '1' digit key to have the display open up the number of CPUs. This won't work if your text window is too small and there are too many CPUs (such as on hgwdev). In this case, count them manually:

  grep -w processor /proc/cpuinfo | wc -l
  16

There are 16 CPUs on this computer. Any of the load averages would need to reach 16 or above to indicate the computer is completely busy. This raises the question, how busy has it been during November:

For the 1 minute average:

  $ ave -col=4 customdb.load.txt

Q1 0.170000 median 0.540000 Q3 2.450000 average 1.586044 min 0.000000 max 8.030000 count 23812 total 37766.880000 standard deviation 1.921336

Note the highest load was 8.03 i.e. the computer became half busy at some point. For the 5 minute average:

ave -col=5 customdb.load.txt Q1 0.220000 median 0.570000 Q3 2.220000 average 1.589406 min 0.020000 max 7.130000 count 23813 total 37848.520000 standard deviation 1.883710

Indicates it had a sustained load of at least 7.13 for 5 minutes. And the 15 minute load average:

 ave -col=6 customdb.load.txt

Q1 0.250000 median 0.570000 Q3 2.000000 average 1.593567 min 0.050000 max 6.560000 count 23814 total 37949.210000 standard deviation 1.867608

Indicates it had a sustained load of 6.56 for at least 15 minutes at some point. To see this high load period, sort the file on the load average numbers: $ sort -k6,6n customdb.load.txt | tail 1604220841 2020-11-01 01:54:01 7.53, 6.79, 6.39 1604215381 2020-11-01 00:23:01 7.00, 6.54, 6.40 1604221201 2020-11-01 01:00:01 6.12, 6.42, 6.40 1604305681 2020-11-02 00:28:01 6.17, 6.54, 6.40 1604215261 2020-11-01 00:21:01 6.96, 6.64, 6.42 1604220901 2020-11-01 01:55:01 7.35, 6.89, 6.45 1604221081 2020-11-01 01:58:01 5.89, 6.68, 6.47 1604305621 2020-11-02 00:27:01 7.37, 6.81, 6.47 1604221021 2020-11-01 01:57:01 6.72, 6.97, 6.55 1604220961 2020-11-01 01:56:01 7.80, 7.13, 6.56

This indicates there was a high load in the early morning hours of Nov 1, 2 Perhaps Russian/Iranian/Chinese hacking just before the election, call Trump's lawyers.

The other load of interest here, is the SQL process load from the measureProcessList.sh script. Example logs in:

  $HOME/loadLog/sqlProcess/customdb/2020/11

-rw-r--r-- 1 127372 Nov 15 11:29 customdb.overLoaded.txt -rw-r--r-- 1 791202 Nov 17 11:56 customdb.processCount.txt

The processCount is a single number:

$ tail customdb.processCount.txt 1605642504 2020-11-17 11:48:24 5 1605642564 2020-11-17 11:49:24 6 1605642625 2020-11-17 11:50:25 3 1605642684 2020-11-17 11:51:24 4 1605642744 2020-11-17 11:52:24 4 1605642804 2020-11-17 11:53:24 3 1605642864 2020-11-17 11:54:24 4 1605642924 2020-11-17 11:55:24 2 1605642984 2020-11-17 11:56:24 3 1605643044 2020-11-17 11:57:24 3

Again, to see the extremes:

  ave -col=4 customdb.processCount.txt

Q1 4.000000 median 6.000000 Q3 9.000000 average 7.319073 min 2.000000 max 315.000000 count 23819 total 174333.000000 standard deviation 7.485191

The max indicates highest load of 315. Typically this is under 10 as can be seen by the average and median measurements. When were those high numbers:

$ sort -k4,4n customdb.processCount.txt | tail 1604991504 2020-11-09 22:58:24 150 1604991444 2020-11-09 22:57:24 152 1604991564 2020-11-09 22:59:24 156 1604991324 2020-11-09 22:55:24 166 1605468264 2020-11-15 11:24:24 213 1605468564 2020-11-15 11:29:24 252 1605468384 2020-11-15 11:26:24 313 1605468324 2020-11-15 11:25:24 315 1605468444 2020-11-15 11:27:24 315 1605468504 2020-11-15 11:28:24 315

Indicating bursts of activity on Nov 09 and Nov 15

For customdb, the other thing that can slow it down is for the trash system to be failing in some silent way and file are accumulating in the customTrash database. All the customTrash databases are measured at least once a day to see how many tables they have.

These counts are summarized somewhere else that I now forget where, probably on hgnfs1 where the database tables are combined with user session information to produce a disk size footprint for each user and each session.

Checking the number of customTrash tables in the 2020*.txt.gz files

Checking in the directory: $HOME/customTrash/log/2020/11/

  zgrep "customTrash table count" 2020-11-*.txt.gz | sed -e 's/:53.*.gz://;'

2020-11-01T01# customTrash table count: 128618, lost table count: 60119 2020-11-01T05# customTrash table count: 129412, lost table count: 60653 2020-11-01T09# customTrash table count: 127547, lost table count: 58556 2020-11-01T13# customTrash table count: 127735, lost table count: 58573 2020-11-01T17# customTrash table count: 127220, lost table count: 57942 2020-11-01T21# customTrash table count: 128394, lost table count: 58963 2020-11-02T01# customTrash table count: 127843, lost table count: 58139 2020-11-02T05# customTrash table count: 128385, lost table count: 57852 2020-11-02T09# customTrash table count: 129591, lost table count: 58658 2020-11-02T13# customTrash table count: 132086, lost table count: 60637 ... 2020-11-15T21# customTrash table count: 193959, lost table count: 92526 2020-11-16T01# customTrash table count: 193642, lost table count: 91766 2020-11-16T05# customTrash table count: 192187, lost table count: 89449 2020-11-16T09# customTrash table count: 191281, lost table count: 88223 2020-11-16T13# customTrash table count: 188919, lost table count: 85484 2020-11-16T17# customTrash table count: 187278, lost table count: 83584 2020-11-16T21# customTrash table count: 185332, lost table count: 81439 2020-11-17T01# customTrash table count: 179484, lost table count: 75197 2020-11-17T05# customTrash table count: 175944, lost table count: 71139 2020-11-17T09# customTrash table count: 174185, lost table count: 69039

This indicates there are about 50,000 more tables at the end of the month then were at the beginning of the month. That is an interesting increase. This should be interpreted in context to see if other months have a similar increase.

Total custom trash database data sizes are recorded here. For example, the growth in volatile customTrash:

zgrep "^customTrash" dbSizes.2020-11-*.gz | sed -e "s/T23.txt.gz//;" dbSizes.2020-11-01:customTrash 126551 456884857435 54386554120391 dbSizes.2020-11-02:customTrash 135921 474613595009 54408243234027 dbSizes.2020-11-03:customTrash 146404 486740207086 54430975320234 dbSizes.2020-11-04:customTrash 152239 511814924204 54492084892069 dbSizes.2020-11-05:customTrash 155152 536851590272 54562325429356 dbSizes.2020-11-06:customTrash 152610 537054864197 54582122028555 dbSizes.2020-11-07:customTrash 151503 539655525249 54594384240545 dbSizes.2020-11-08:customTrash 145158 536174098173 54596668242866 dbSizes.2020-11-09:customTrash 150068 548705077822 54626614572709 dbSizes.2020-11-10:customTrash 161357 580738630621 54667650436160 dbSizes.2020-11-11:customTrash 168511 608853763357 54712208402398 dbSizes.2020-11-12:customTrash 176309 613404617039 54724222211728 dbSizes.2020-11-13:customTrash 185797 629828952644 54750836619522 dbSizes.2020-11-14:customTrash 193967 632577360578 54758332752420 dbSizes.2020-11-15:customTrash 191794 684326729777 54815540942291 dbSizes.2020-11-16:customTrash 178136 672132558933 54864740256784

The second column there is the table counts, third column is total data size, fourth column sum of total data sizes in *all* custom databases including those saved in sessions.

For example, to see all the databases for one measurement:

zgrep "^cus" dbSizes.2020-11-16T23.txt.gz customData01 17027 1733623853211 1733623853211 customData02 16542 1583661469806 3317285323017 customData03 16616 1505077569155 4822362892172 customData04 17343 1942792764551 6765155656723 customData05 17231 2038955718209 8804111374932 customData06 15297 1818661140667 10622772515599 customData07 16224 1747626883443 12370399399042 customData08 15074 1778492464429 14148891863471 customData09 16252 1615325673918 15764217537389 customData10 15567 1765177995654 17529395533043 customData11 15793 1825295432730 19354690965773 customData12 13257 1553064680189 20907755645962 customData13 17289 1735929427934 22643685073896 customData14 16036 1595059807477 24238744881373 customData15 14314 1681875719223 25920620600596 customData16 15916 1506578284153 27427198884749 customData17 14883 1484911766640 28912110651389 customData18 16804 1558753395873 30470864047262 customData19 15970 1670341988368 32141206035630 customData20 16906 1990258887228 34131464922858 customData21 16025 1713172309009 35844637231867 customData22 18756 1817254998867 37661892230734 customData23 16382 2229348900580 39891241131314 customData24 16759 2209291331072 42100532462386 customData25 17106 1747200198566 43847732660952 customData26 16743 1797549572142 45645282233094 customData27 17395 2113219548418 47758501781512 customData28 15610 1572740030335 49331241811847 customData29 16865 1892634727883 51223876539730 customData30 19081 1464686339024 52688562878754 customData31 14461 1504044819097 54192607697851 customTrash 178136 672132558933 54864740256784

Let's see if the number of tables is increasing in saved session data:

for F in dbSizes.*; do printf "%s\t" "$F"; zgrep "^customData" ${F} | ave -col=2 stdin | grep total; done | sed -e 's/dbSizes.//; s/T23.txt.gz//;' 2020-11-01 total 502450.000000 2020-11-02 total 502869.000000 2020-11-03 total 503204.000000 2020-11-04 total 503473.000000 2020-11-05 total 503773.000000 2020-11-06 total 504038.000000 2020-11-07 total 504068.000000 2020-11-08 total 504144.000000 2020-11-09 total 504451.000000 2020-11-10 total 504659.000000 2020-11-11 total 504830.000000 2020-11-12 total 504961.000000 2020-11-13 total 505137.000000 2020-11-14 total 505217.000000 2020-11-15 total 505275.000000 2020-11-16 total 505524.000000

A couple thousand new saved session tables in the first half of November. Not entirely unusual.

As you can see, it would be possible to delve into these logs to identify trends and specific hot spots. I see nothing particularly unusual except for the growth of 50,000 tables in the month which seems to be due to increase activity of loading temporary custom tracks that do not go into sessions.