Usage Statistics
Ways to track Genome Browser usage stats
It is worth noting that the first method described (coocuurenceCounts.c + filterTop.py) make a matrix of track usage, and can therefore provide relational information between the tracks. For general statistics (e.g. total track usage, total db used, usage by month) the second method (generateUsageStats.py) is recommended as it is more streamlined.
- Note: For more information on the error logs, see that wiki page (http://genomewiki.ucsc.edu/genecats/index.php/Apache_error_log_output).
Using cooccurrenceCounts.c and filterTop.py
This method was deduced by Chris Lee and Max Haussler. It uses a combination of coocurrenceCounts.c which creates a usage matrix out of the error logs, and then filterTop.py which filters out a certain number of tracks from the matrix to make the data more digestible. Keep in mind every instance of an hgsid that meets the threshold is counted upon refresh. This means that a user browsing heavily will be represented more heavily (as opposed to generateUsageStats.py). Most of this explains how to find which tracks are used commonly with each other, however at the end is an example of how to get general usage stats.
cooccurrenceCounts
This program can be found in the following directory:
kent/src/hg/logCrawl/dbTrackAndSearchUsage/coocurrenceCounts
You may have to do a 'make' in the directory to compile it. If you get an error, it may mean you are missing a x86_64/ directory in your bin. Hiram was able to quickly diagnose this. The program can be ran with no arguments to show a usage statement that includes examples of how to run it:
$ cooccurrenceCounts coocurrenceCounts - Experiment to build a coocurrence matrix out of lists. usage: cooccurrenceCounts -db=dbName inFile outFile options: -db=XXX = ucsc db name to filter on -tc=XXX = min (ie >=) filter for track counts -hc=XXX = min (ie >=) filter for hgsid counts -hgsids = don't make track matrix, instead print a tab separated file of hgsids:hgsidUseCount and track usages, for example: hgsid:hgsidUseCount trackCount1 trackCount2 ... -verbose=X = logging level, log messages print to stderr ...
In order to run the program, you must first put together a list of what error logs you are interested in extracting from. The best way to do this is to use makeLogSymLinks.sh. First you will want to make a directory in hive:
$ mkdir /hive/users/lrnassar/ErrorLogs
Next you will run makeLogSymLinks.sh to populate the directory with symlinks to all the error logs of interest (it can also be ran with no arguments for a usage statement). I recommend running it from Within your new directory:
$ cd /hive/users/lrnassar/ErrorLogs $ makeLogSymLinks.sh 20180107 12 error
This will create symlinks to all the error logs starting with the 7th of January (the date has to match a file in /hive/data/inside/wwwstats/RR), and going forward from there 12 months. Also I state I want the error logs, which is what we extract the data from. This particular command created symlinks to ~200 files (all of 2018).
Next you will want to run cooccurrenceCounts. It is recommended you make a new directory for the output, in this case I chose to do it in (/hive/users/lrnassar/MatrixData).
$ /hive/users/lrnassar/MatrixData $ cooccurrenceCounts -db=hg19 -hc=400 -tc=100 ../ErrorLogs/ 2018UsageStats.txt -verbose=2
The -hc flag stipulates how many times an hgsid needs to be present for the data not to be filtered out. In this case I chose a very high number to try and filter out bots, as I was using the entire 2018's worth of data. Worth noting the example usage statements are not as high. The -tc filters out tracks that do not occur very often. Likewise, the chosen cutoff is higher than the examples. For reference, the above command took about 15 minutes to run.
At this point, you should have a list of tracks you want to get stats on (for an example on just general usage, see the bottom of this section). In my example, we had created a session based around 'clinical' tracks, and we wanted to see if users often turned on other tracks in combination with them. My track list is as follows (found in /hive/users/lrnassar/MatrixData/clinicalThemeTrackNames.txt):
- knownGene
- ncbiRefSeqCurated
- pubsBlat
- pubsMarkerSnp
- iscaPathGainCum
- iscaPathLossCum
- iscaPathogenic
- clinvarMain
- clinvarCnv
- omimAvSnp
- spMut
- snp150Common
- snp150Flagged
- tgpPhase3
- evsEsp6500
- exacVariants
- decipher
- decipherSnvs
- hgmd
- lovd
Note: You have to input the track name, not the long/short label
You can now create a list of all the tracks in your usage stats to make sure that your tracks made the cutoff. For example, if you used a very high -tc cutoff for coocurrentcounts, many tracks could have been filtered out. You will get an error if you try to run the python filter script on a track that doesn't exist in your matrix.
wc -l clinicalThemeTrackNames.txt 20 head -1 2018UsageStats.txt > 2018TrackList.txt grep -owf clinicalThemeTrackNames.txt 2018TrackList.txt | wc -l 20
Here we can see that 20/20 tracks are in the matrix. The next step is optional, but in my case I wanted to remove all default tracks from my track list. Default tracks tend to be associated with tons of other tracks (specially other default tracks) as they are on... by default. You can create your own list; the list we put together for hg19 is here (/hive/users/lrnassar/MatrixData/hg19Defaults.txt).
grep -vwf hg19Defaults.txt clinicalThemeTrackNames.txt > clinicalThemeTrackNamesNoDefaults.txt wc -l clinicalThemeTrackNamesNoDefaults.txt 16 clinicalThemeTrackNamesNoDefaults.txt
filterTop.py
Here we see that 4 default tracks were filtered out of the original 20. Next we want to create a submatrix for each of our chosen tracks. This is done with filterTop.py. A usage statement that includes for the script can be seen with the --help flag:
$HOME/kent/src/hg/logCrawl/dbTrackAndSearchUsage/coocurrenceCounts/filterTop.py --help
For our purposes, we will want to run filterTop.py on each of our tracks. We can loop the script as follows:
for track in $(cat temp.txt); do # don't include the default tracks $HOME/kent/src/hg/logCrawl/dbTrackAndSearchUsage/coocurrenceCounts/filterTop.py \ -n 50 -t $track --default-track-list /hive/users/lrnassar/MatrixData/hg19Defaults.txt \ /hive/users/lrnassar/MatrixData/2018UsageStatsNoPrefix.txt $track.top50.nodefaults done
Keep in mind the --default-track-list is an optional flag. If you use it, you follow the flag with a file of all default tracks (or any tracks you want excluded from output). In my case, this outputs 16 submatrixes of each of the 50 top tracks used in conjunction with the original 16 input tracks. These matrices can be freely explored, but for my purposes, I wanted a more digestible output. So instead I pulled out the header line which represents the 50 tracks most often seen with each of the input tracks, then changed all the tabs into newlines, and appended them all to a single file.
for each in $(ls *.nodefaults); do head -n 1 $each | sed 's/\t/\n/g' >> usageResults.txt done
Now we have a file (usageResults.txt) which had a list of all the tracks in the top 50 for all of our input tracks. Next, we sort and count the unique entries:
cat usageResults.txt | sort | uniq -c | sort -r | head -n 30 > usageResults.final.txt cat usageResults.final.txt | head -n 10 16 dgvPlus 16 dgvMerged 15 omimLocation 15 decipher 15 cnvDevDelay 15 clinvarMain 15 clinvarCnv 15 clinvar 14 omimAvSnp 13 genomicSuperDup
Here we have the total results. The top number being 16 (the original number of input tracks). In this case, dgvMerged and dgvPlus are both in the top 50 usage for all 16 input tracks. This means they are very often used together with those tracks, etc. This was useful for me to see what tracks users like to see together, but has other applications as well.
General Stats Example
If you just want to generate a usage matrix of say, the top 100 tracks users use overall, you can use the same python script but not designate a specific track to filter by.
#See the top 100 tracks that are used from a single error log: $HOME/kent/src/hg/logCrawl/dbTrackAndSearchUsage/coocurrenceCounts/filterTop.py -n 100 2018UsageStatsNoPrefix.txt top100Tracks2018.txt #or with no defaults $HOME/kent/src/hg/logCrawl/dbTrackAndSearchUsage/coocurrenceCounts/filterTop.py -n 100 --default-track-list /hive/users/lrnassar/MatrixData/hg19Defaults.txt 2018UsageStatsNoPrefix.txt top100Tracks2018NoDefaults.txt
Using generateUsageStats.py
This method was put together my Matthew Speir. It generates usage statistics for dbs, tracks, and hubs tracks using Apache error_log files. This script only counts an hgsid use a single time (as opposed to the previous method), that is to say heavy browser users are only represented a single time if their hgsid remains the same.
This program can be found in the following directory:
~/kent/src/hg/logCrawl/dbTrackAndSearchUsage/generateUsageStats.py
The program can be ran with no arguments to show a usage statement that includes examples of how to run it:
$ ./generateUsageStats.py usage: generateUsageStats.py [-h] [-f FILENAME] [-d DIRNAME] [-p] [-m] [-j] [-t] [-o OUTDIR] Generates usage statistics for dbs, tracks, and hubs tracks using Apache error_log files optional arguments: -h, --help show this help message and exit -f FILENAME, --fileName FILENAME input file name, must be space-separated Apache error_log file -d DIRNAME, --dirName DIRNAME input directory name, files must be space-separated error_log files. No other files should be present in this directory. -p, --perMonth output file containing info on db/track/hub track use per month -m, --monthYear output file containing month/year pairs (e.g. "Mar 2017") -j, --jsonOut output json files for summary dictionaries -t, --outputDefaults output file containing info on default track usage for top 15 most used assemblies -o OUTDIR, --outDir OUTDIR directory in which to place output files
In order to run the program, you must first put together a list of what error logs you are interested in extracting from. The best way to do this is to use makeLogSymLinks.sh. First you will want to make a directory in hive:
$ mkdir /hive/users/lrnassar/ErrorLogs
Next you will run makeLogSymLinks.sh to populate the directory with symlinks to all the error logs of interest (it can also be ran with no arguments for a usage statement). I recommend running it from Within your new directory:
$ cd /hive/users/lrnassar/ErrorLogs $ ~/kent/src/hg/logCrawl/dbTrackAndSearchUsage/makeLogSymLinks.sh 20180107 12 error
This will create symlinks to all the error logs starting with the 7th of January (the date has to match a file in /hive/data/inside/wwwstats/RR), and going forward from there 12 months. Also I state I want the error logs, which is what we extract the data from. This particular command created symlinks to ~200 files (all of 2018).
Next you can call the python script to generate the usage stats. I ran the program in its own directory using the following command:
$ pwd /hive/users/lrnassar/UsageStatsPython $ ~/kent/src/hg/logCrawl/dbTrackAndSearchUsage/generateUsageStats.py -d ../ErrorLogs/ -o . -pm
The -d flag directs the script to the directory with the error logs, -o designates the output directory, -p tells it to include 'perMonth' numbers (as opposed to just year), and -m tells it to include month/year pairs in the output.
For all of the 2018 error logs, generateUsageStats.py took about 3.5 hours to run. With the flags chosen, it created the following files (some with pretty large sizes):
-rw-rw-r-- 1 lrnassar protein 102K Apr 16 13:00 dbCounts.perMonth.tsv -rw-rw-r-- 1 lrnassar protein 24K Apr 16 12:57 dbCounts.tsv -rw-rw-r-- 1 lrnassar protein 153M Apr 16 13:00 dbUsers.perMonth.tsv -rw-rw-r-- 1 lrnassar protein 127M Apr 16 12:57 dbUsers.tsv -rw-rw-r-- 1 lrnassar protein 117 Apr 16 13:00 monthYear.tsv -rw-rw-r-- 1 lrnassar protein 172M Apr 16 13:04 trackCounts.perMonth.tsv -rw-rw-r-- 1 lrnassar protein 68M Apr 16 13:00 trackCounts.tsv -rw-rw-r-- 1 lrnassar protein 27M Apr 16 13:04 trackCountsHubs.perMonth.tsv -rw-rw-r-- 1 lrnassar protein 4.2M Apr 16 13:00 trackCountsHubs.tsv -rw-rw-r-- 1 lrnassar protein 5.8G Apr 16 13:04 trackUsers.perMonth.tsv -rw-rw-r-- 1 lrnassar protein 5.1G Apr 16 13:00 trackUsers.tsv -rw-rw-r-- 1 lrnassar protein 285M Apr 16 13:04 trackUsersHubs.perMonth.tsv -rw-rw-r-- 1 lrnassar protein 258M Apr 16 13:00 trackUsersHubs.tsv
This output can be simplified to 4 primary outputs:
dbCounts.tsv dbUsers.tsv trackCounts.tsv trackCountsHubs.tsv trackUsers.tsv trackUsersHubs.tsv
dbCounts.tsv
dbCounts.tsv shows which databases came up for each hgsid counted. This provides a general outline of UCSC GB assembly usage:
$ sort dbCounts.tsv -rnk2 | head hg19 1327811 hg38 854810 mm10 248086 mm9 122045 hg18 85716 dm2 28618 dm3 24265 dm6 19343 rn6 14095 danRer10 13565
dbUsers.tsv
dbUsers.tsv tracks the number of times an hgsid refreshed the page or the hgTracks image (e.g. zoom, change location), and what database was being used. It provides information on how much browsing users do on a particular hgsid before changing it, etc.
$ sort dbUsers.tsv -rnk3 | head hg19 xxxxxxxxx_hgsid 1178311 hg19 xxxxxxxxx_hgsid 580230 mm9 xxxxxxxxx_hgsid 321340 hg16 xxxxxxxxx_hgsid 210375 hg19 xxxxxxxxx_hgsid 164179 hg19 xxxxxxxxx_hgsid 139952 hg19 xxxxxxxxx_hgsid 109248 hg19 xxxxxxxxx_hgsid 103357 hg19 xxxxxxxxx_hgsid 103320 hg38 xxxxxxxxx_hgsid 93817
trackCounts.tsv
trackCounts.tsv shows the number of times a track was observed and which database it was seen with:
$ sort trackCounts.tsv -rnk3 | head hg19 knownGene 1022394 hg19 cytoBandIdeo 1001358 hg19 cons100way 790623 hg38 knownGene 775122 hg19 pubs 756332 hg38 cytoBandIdeo 756204 hg38 refSeqComposite 754162 hg19 rmsk 749855 hg19 refSeqComposite 743841 hg38 ncbiRefSeqCurated 731005
trackCountsHubs.tsv
trackCountsHubs.tsv shows the number of times a public hub (no individually added hubs are counted, only UCSC public hubs) was seen, name of the hub, the track within the hub, and the database it belongs to. This may be useful to see which public hubs are popular, and which ones do not get much use:
$ sort trackCountsHubs.tsv -rnk4 -t $'\t' | head Roadmap Epigenomics Data Complete Collection at Wash U VizHub hg19 RoadmapConsolidatedH 8526 FANTOM5 hg19 FANTOM_CAT_lv4_stringe 7743 Roadmap Epigenomics Data Complete Collection at Wash U VizHub hg19 RoadmapUniS 7440 Roadmap Epigenomics Data Complete Collection at Wash U VizHub hg19 RoadmapBIHistoneModificati 7187 Roadmap Epigenomics Integrative Analysis Hub hg19 RoadmapConsolidatedAssaya220 6865 Roadmap Epigenomics Integrative Analysis Hub hg19 RoadmapUnconsolidatedAssaya230 6688 Roadmap Epigenomics Integrative Analysis Hub hg19 RoadmapUnconsolidatedAssaya220 6686 Roadmap Epigenomics Data Complete Collection at Wash U VizHub hg19 RoadmapUCSFHistoneModificati 5901 Roadmap Epigenomics Data Complete Collection at Wash U VizHub hg19 RoadmapMeth 5659 Roadmap Epigenomics Data Complete Collection at Wash U VizHub hg19 RoadmapDNa 5170
trackUsers.tsv
trackUsers.tsv is similar to dbUsers.tsv, except it also tracks what tracks were seen by each hgsid and database.
$ sort trackUsers.tsv -rnk4 -t $'\t' | head hg16 refGene xxxxxxxxx_hgsid 210375 hg16 stsMap xxxxxxxxx_hgsid 210373 hg16 snp xxxxxxxxx_hgsid 210373 hg16 rmsk xxxxxxxxx_hgsid 210373 hg16 mzPt1Mm3Rn3Gg2_pHMM_wig xxxxxxxxx_hgsid 210373 hg16 mzPt1Mm3Rn3Gg2_pHMM xxxxxxxxx_hgsid 210373 hg16 mrna xxxxxxxxx_hgsid 210373 hg16 intronEst xxxxxxxxx_hgsid 210373 hg16 hg17Kg xxxxxxxxx_hgsid 210373 hg16 gap xxxxxxxxx_hgsid 210373
trackUsersHubs.tsv
trackUsersHubs.tsv is similar to trackCountsHubs.tsv, however it includes hgsids. It shows which tracks and hubs were used by hgsids, how many times hgTracks was refereshed/image refreshed, and what database was being used.
$ sort trackUsersHubs.tsv -rnk5 -t $'\t' | head ENCODE Analysis Hub hg19 wgEncodeUwHistoneK562H3k4me3StdAlnsign xxxxxxxxx_hgsid 97365 ENCODE Analysis Hub hg19 wgEncodeUwHistoneK562H3k4me3StdA xxxxxxxxx_hgsid 97365 ENCODE Analysis Hub hg19 wgEncodeUwHistoneK562H3k36me3StdAlnsign xxxxxxxxx_hgsid 97365 ENCODE Analysis Hub hg19 wgEncodeUwHistoneK562H3k36me3StdA xxxxxxxxx_hgsid 97365 ENCODE Analysis Hub hg19 wgEncodeUwHistoneK562H3k27me3StdAlnsign xxxxxxxxx_hgsid 97365 ENCODE Analysis Hub hg19 wgEncodeUwHistoneK562H3k27me3StdA xxxxxxxxx_hgsid 97365 ENCODE Analysis Hub hg19 wgEncodeUwHistoneGm12878H3k4me3StdAlnsign xxxxxxxxx_hgsid 97365 ENCODE Analysis Hub hg19 wgEncodeUwHistoneGm12878H3k4me3StdA xxxxxxxxx_hgsid 97365 ENCODE Analysis Hub hg19 wgEncodeUwHistoneGm12878H3k36me3StdAlnsign xxxxxxxxx_hgsid 97365 ENCODE Analysis Hub hg19 wgEncodeUwHistoneGm12878H3k36me3StdA xxxxxxxxx_hgsid 97365