Apache error log output: Difference between revisions
(Adding Lou's tips about seeing an IP's actions) |
No edit summary |
||
Line 88: | Line 88: | ||
[[Category:Browser QA]] | [[Category:Browser QA]] | ||
[[Category:Browser QA CGI]] | [[Category:Browser QA CGI]] | ||
== Ways to track Genome Browser usage stats: == | |||
===Using cooccurrenceCounts.c and filterTop.py=== | |||
This method was deduced by Chris Lee and Max Haussler. It uses a combination of coocurrenceCounts.c which created a usage matrix out of the error logs, and then filterTop.py which filters out a certain number of tracks from the matrix to make the data more digestible. | |||
====cooccurrenceCounts==== | |||
This program can be found in the following directory: | |||
<pre> | |||
kent/src/hg/logCrawl/dbTrackAndSearchUsage/coocurrenceCounts</pre> | |||
You may have to do a 'make' in the directory to compile it. If you get an error, it may mean you are missing a x86_64/ directory in your bin. Hiram was able to quickly diagnose this. The program can be ran with no arguments to show a usage statement that includes examples of how to run it: | |||
<pre> | |||
$ cooccurrenceCounts | |||
coocurrenceCounts - Experiment to build a coocurrence matrix out of lists. | |||
usage: | |||
cooccurrenceCounts -db=dbName inFile outFile | |||
options: | |||
-db=XXX = ucsc db name to filter on | |||
-tc=XXX = min (ie >=) filter for track counts | |||
-hc=XXX = min (ie >=) filter for hgsid counts | |||
-hgsids = don't make track matrix, instead print a tab separated | |||
file of hgsids:hgsidUseCount and track usages, for example: | |||
hgsid:hgsidUseCount trackCount1 trackCount2 ... | |||
-verbose=X = logging level, log messages print to stderr | |||
... | |||
</pre> | |||
In order to run the program, you must first put together a list of what error logs you are interested in extracting from. The best way to do this is to use makeLogSymLinks.sh. First you will want to make a directory in hive: | |||
<pre> | |||
$ mkdir /hive/users/lrnassar/ErrorLogs</pre> | |||
Next you will run makeLogSymLinks.sh to populate the directory with symlinks to all the error logs of interest (it can also be ran with no arguments for a usage statement). I recommend running it from Within your new directory: | |||
<pre> | |||
$ cd /hive/users/lrnassar/ErrorLogs | |||
$ makeLogSymLinks.sh 20180107 12 error | |||
</pre> | |||
This will create symlinks to all the error logs starting with the 7th of January (the date has to match a file in /hive/data/inside/wwwstats/RR), and going forward from there 12 months. Also I state I want the error logs, which is what we extract the data from. This particular command created symlinks to ~200 files (all of 2018). | |||
Next you will want to run cooccurrenceCounts. It is recommended you make a new directory for the output, in this case I chose to do it in (/hive/users/lrnassar/MatrixData). | |||
<pre> | |||
$ /hive/users/lrnassar/MatrixData | |||
$ cooccurrenceCounts -db=hg19 -hc=400 -tc=100 ../ErrorLogs/ out.txt -verbose=2 | |||
</pre> | |||
The -hc flag stipulates how many times an hgsid needs to be present for the data not to be filtered out. In this case I chose a very high number to try and filter out bots, as I was using the entire 2018's worth of data. Worth noting the example usage statements are not as high. The -tc filters out tracks that do not occur very often. Likewise the chosen cutoff is higher than the examples. For reference, the above command took about 15 minutes to run. |
Revision as of 23:32, 1 April 2019
Here are some examples of errors we often see but that are of no concern:
This is seen right after a push of new CGIs and is caused by browsers having a stale version of the CGIs/css file. A bunch of these will show up right after the push, but they should subside after several minutes as browser grab new versions.
Cannot find correct version of file 'HGStyle.css'; this is due to an installation error, referer: http://hgw0.gi.ucsc.edu/
File does not exist: /data/apache/htdocs/js/jquery.imgareaselect-v291.js, referer: http://genome.ucsc.edu/cgi-bin/hgTracks
We also see these often but they tend to right themselves as the load on the browser waxes and wanes.
Too many popAbortHandlers, referer: http://genome.ucsc.edu/cgi-bin/hgTables?position=chr4:150237197-150248873&hgsid=353372405&refGene=pack&hgFind.matches=NM_023119,
How to look at logs
When you are on a machine you can look at the logs by following this file:
- /usr/local/apache/logs/error_log
Older logs with names like this:
- /usr/local/apache/logs/error_log.1
- /usr/local/apache/logs/error_log.2
Example tail command:
- tail -f /usr/local/apache/logs/error_log | grep -v "CGI_TIME\|trackLog"
How to find historic logs
If you wanted to look at historical stuff there is a place here on hive:
- /hive/data/inside/wwwstats/RR/2018/
- /hive/data/inside/wwwstats/RR/2019/
Let's say you wanted to look at specific events on Feb. 14th on the hgw1 machine for 2019:
- zgrep "Feb 14" /hive/data/inside/wwwstats/RR/2019/hgw1/error_log.20190211.gz | less
Ask Matt for more ways to use these logs, such as building a temporary directory of simlinks to the logs for specific date ranges to do more complicated querries.
Example of accessing /hive/data/inside/wwwstats/RR/ for IP
Error log location for hgw2:
/hive/data/inside/wwwstats/RR/2019/hgw2
The last two directories are for $year and $machine
Within that directory:
-rw-rw-r-- 1 hiram genecats 498M Mar 28 10:40 access_log -rw-rw-r-- 1 hiram genecats 891M Mar 28 10:40 error_log -rw-r--r-- 1 hiram protein 54M Jan 14 03:10 access_log.20190106.gz -rw-rw-r-- 1 hiram genecats 48M Jan 20 03:30 access_log.20190114.gz -rw-rw-r-- 1 hiram genecats 58M Jan 28 03:41 access_log.20190120.gz ...
There is an access_log, which is the latest log (within the last week). Same for the error_log. Once a log is >week old, it gets compressed and the file name is 2019-month-day(start of week). Ex. access_log.20190120.gz.
If you wanted to research what a recent offending IP was doing, you would check the different machines, in this case we saw there is some activity on hgw2, and do the following:
$cat access_log | awk '{print $1}' | sort | uniq -c | sort -r | head 33386 54.82.172.170 28818 130.91.194.123 16993 130.223.121.12 (Offender) (16993 hits on hgw2)
Ex. Summarizing specific 130.223.121.12 IP actions
$cat access_log | grep "130.223.121.12" | awk '{print $7}' | cut -f 3 -d '/' | cut -f 1 -d "?" | sort | uniq -c | sort -r | head 16905 hgTables 18 hgt 14 hgTracks
So this 130.223.121.12 IP is hammering hgTables.
Ex. Checking specific 130.223.121.12 IP requests
$cat access_log | grep "130.223.121.12" | head -n 3 130.223.121.12 - - [25/Mar/2019:08:12:27 -0700] "POST /cgi-bin/hgTables? HTTP/1.1" 200 1222212 "-" "Python-urllib/2.7" 1257605 microseconds 130.223.121.12 - - [25/Mar/2019:08:12:30 -0700] "POST /cgi-bin/hgTables? HTTP/1.1" 200 167045 "-" "Python-urllib/2.7" 2063636 microseconds 130.223.121.12 - - [25/Mar/2019:08:12:31 -0700] "POST /cgi-bin/hgTables? HTTP/1.1" 200 681933 "-" "Python-urllib/2.7" 2170569 microseconds
The results show Python-urllib/2.7 which shows they are using python to programmatically query, and the POST is a different way to input parameters without using the URL, so we don't know what they're searching for. We do see that they are doing multiple queries a minute without waiting for a response, which is more evident as they currently have an almost 60 minute bottleneck.
Ways to track Genome Browser usage stats:
Using cooccurrenceCounts.c and filterTop.py
This method was deduced by Chris Lee and Max Haussler. It uses a combination of coocurrenceCounts.c which created a usage matrix out of the error logs, and then filterTop.py which filters out a certain number of tracks from the matrix to make the data more digestible.
cooccurrenceCounts
This program can be found in the following directory:
kent/src/hg/logCrawl/dbTrackAndSearchUsage/coocurrenceCounts
You may have to do a 'make' in the directory to compile it. If you get an error, it may mean you are missing a x86_64/ directory in your bin. Hiram was able to quickly diagnose this. The program can be ran with no arguments to show a usage statement that includes examples of how to run it:
$ cooccurrenceCounts coocurrenceCounts - Experiment to build a coocurrence matrix out of lists. usage: cooccurrenceCounts -db=dbName inFile outFile options: -db=XXX = ucsc db name to filter on -tc=XXX = min (ie >=) filter for track counts -hc=XXX = min (ie >=) filter for hgsid counts -hgsids = don't make track matrix, instead print a tab separated file of hgsids:hgsidUseCount and track usages, for example: hgsid:hgsidUseCount trackCount1 trackCount2 ... -verbose=X = logging level, log messages print to stderr ...
In order to run the program, you must first put together a list of what error logs you are interested in extracting from. The best way to do this is to use makeLogSymLinks.sh. First you will want to make a directory in hive:
$ mkdir /hive/users/lrnassar/ErrorLogs
Next you will run makeLogSymLinks.sh to populate the directory with symlinks to all the error logs of interest (it can also be ran with no arguments for a usage statement). I recommend running it from Within your new directory:
$ cd /hive/users/lrnassar/ErrorLogs $ makeLogSymLinks.sh 20180107 12 error
This will create symlinks to all the error logs starting with the 7th of January (the date has to match a file in /hive/data/inside/wwwstats/RR), and going forward from there 12 months. Also I state I want the error logs, which is what we extract the data from. This particular command created symlinks to ~200 files (all of 2018).
Next you will want to run cooccurrenceCounts. It is recommended you make a new directory for the output, in this case I chose to do it in (/hive/users/lrnassar/MatrixData).
$ /hive/users/lrnassar/MatrixData $ cooccurrenceCounts -db=hg19 -hc=400 -tc=100 ../ErrorLogs/ out.txt -verbose=2
The -hc flag stipulates how many times an hgsid needs to be present for the data not to be filtered out. In this case I chose a very high number to try and filter out bots, as I was using the entire 2018's worth of data. Worth noting the example usage statements are not as high. The -tc filters out tracks that do not occur very often. Likewise the chosen cutoff is higher than the examples. For reference, the above command took about 15 minutes to run.