Usage Statistics

From Genecats
Jump to navigationJump to search

Ways to track Genome Browser usage stats

It is worth noting that the first method described (coocuurenceCounts.c + filterTop.py) make a matrix of track usage, and can therefore provide relational information between the tracks. For general statistics (e.g. total track usage, total db used, usage by month) the second method (generateUsageStats.py) is recommended as it is more streamlined.

Using cooccurrenceCounts.c and filterTop.py

This method was deduced by Chris Lee and Max Haussler. It uses a combination of coocurrenceCounts.c which creates a usage matrix out of the error logs, and then filterTop.py which filters out a certain number of tracks from the matrix to make the data more digestible. Keep in mind every instance of an hgsid that meets the threshold is counted upon refresh. This means that a user browsing heavily will be represented more heavily (as opposed to generateUsageStats.py). Most of this explains how to find which tracks are used commonly with each other, however at the end is an example of how to get general usage stats.

cooccurrenceCounts

This program can be found in the following directory:

kent/src/hg/logCrawl/dbTrackAndSearchUsage/coocurrenceCounts

You may have to do a 'make' in the directory to compile it. If you get an error, it may mean you are missing a x86_64/ directory in your bin. Hiram was able to quickly diagnose this. The program can be ran with no arguments to show a usage statement that includes examples of how to run it:

$ cooccurrenceCounts 
coocurrenceCounts - Experiment to build a coocurrence matrix out of lists.
usage:
   cooccurrenceCounts -db=dbName inFile outFile
options:
   -db=XXX    = ucsc db name to filter on
   -tc=XXX    = min (ie >=) filter for track counts
   -hc=XXX    = min (ie >=) filter for hgsid counts
   -hgsids    = don't make track matrix, instead print a tab separated 
                file of hgsids:hgsidUseCount and track usages, for example: 
                hgsid:hgsidUseCount trackCount1 trackCount2 ... 
   -verbose=X = logging level, log messages print to stderr
...

In order to run the program, you must first put together a list of what error logs you are interested in extracting from. The best way to do this is to use makeLogSymLinks.sh. First you will want to make a directory in hive:

$ mkdir /hive/users/lrnassar/ErrorLogs

Next you will run makeLogSymLinks.sh to populate the directory with symlinks to all the error logs of interest (it can also be ran with no arguments for a usage statement). I recommend running it from Within your new directory:

$ cd /hive/users/lrnassar/ErrorLogs
$ makeLogSymLinks.sh 20180107 12 error

This will create symlinks to all the error logs starting with the 7th of January (the date has to match a file in /hive/data/inside/wwwstats/RR), and going forward from there 12 months. Also I state I want the error logs, which is what we extract the data from. This particular command created symlinks to ~200 files (all of 2018).

Next you will want to run cooccurrenceCounts. It is recommended you make a new directory for the output, in this case I chose to do it in (/hive/users/lrnassar/MatrixData).

$ /hive/users/lrnassar/MatrixData
$ cooccurrenceCounts -db=hg19 -hc=400 -tc=100 ../ErrorLogs/ 2018UsageStats.txt -verbose=2

The -hc flag stipulates how many times an hgsid needs to be present for the data not to be filtered out. In this case I chose a very high number to try and filter out bots, as I was using the entire 2018's worth of data. Worth noting the example usage statements are not as high. The -tc filters out tracks that do not occur very often. Likewise, the chosen cutoff is higher than the examples. For reference, the above command took about 15 minutes to run.

At this point, you should have a list of tracks you want to get stats on (for an example on just general usage, see the bottom of this section). In my example, we had created a session based around 'clinical' tracks, and we wanted to see if users often turned on other tracks in combination with them. My track list is as follows (found in /hive/users/lrnassar/MatrixData/clinicalThemeTrackNames.txt):

  • knownGene
  • ncbiRefSeqCurated
  • pubsBlat
  • pubsMarkerSnp
  • iscaPathGainCum
  • iscaPathLossCum
  • iscaPathogenic
  • clinvarMain
  • clinvarCnv
  • omimAvSnp
  • spMut
  • snp150Common
  • snp150Flagged
  • tgpPhase3
  • evsEsp6500
  • exacVariants
  • decipher
  • decipherSnvs
  • hgmd
  • lovd

Note: You have to input the track name, not the long/short label

You can now create a list of all the tracks in your usage stats to make sure that your tracks made the cutoff. For example, if you used a very high -tc cutoff for coocurrentcounts, many tracks could have been filtered out. You will get an error if you try to run the python filter script on a track that doesn't exist in your matrix.

wc -l clinicalThemeTrackNames.txt
20
head -1 2018UsageStats.txt > 2018TrackList.txt
grep -owf clinicalThemeTrackNames.txt 2018TrackList.txt | wc -l
20

Here we can see that 20/20 tracks are in the matrix. The next step is optional, but in my case I wanted to remove all default tracks from my track list. Default tracks tend to be associated with tons of other tracks (specially other default tracks) as they are on... by default. You can create your own list; the list we put together for hg19 is here (/hive/users/lrnassar/MatrixData/hg19Defaults.txt).

grep -vwf hg19Defaults.txt clinicalThemeTrackNames.txt > clinicalThemeTrackNamesNoDefaults.txt
wc -l clinicalThemeTrackNamesNoDefaults.txt
16 clinicalThemeTrackNamesNoDefaults.txt

filterTop.py

Here we see that 4 default tracks were filtered out of the original 20. Next we want to create a submatrix for each of our chosen tracks. This is done with filterTop.py. A usage statement that includes for the script can be seen with the --help flag:

$HOME/kent/src/hg/logCrawl/dbTrackAndSearchUsage/coocurrenceCounts/filterTop.py --help

For our purposes, we will want to run filterTop.py on each of our tracks. We can loop the script as follows:

for track in $(cat temp.txt); do
        # don't include the default tracks
        $HOME/kent/src/hg/logCrawl/dbTrackAndSearchUsage/coocurrenceCounts/filterTop.py \
        -n 50 -t $track --default-track-list /hive/users/lrnassar/MatrixData/hg19Defaults.txt \
        /hive/users/lrnassar/MatrixData/2018UsageStatsNoPrefix.txt $track.top50.nodefaults
done

Keep in mind the --default-track-list is an optional flag. If you use it, you follow the flag with a file of all default tracks (or any tracks you want excluded from output). In my case, this outputs 16 submatrixes of each of the 50 top tracks used in conjunction with the original 16 input tracks. These matrices can be freely explored, but for my purposes, I wanted a more digestible output. So instead I pulled out the header line which represents the 50 tracks most often seen with each of the input tracks, then changed all the tabs into newlines, and appended them all to a single file.

for each in $(ls *.nodefaults); do
        head -n 1 $each | sed 's/\t/\n/g' >> usageResults.txt
done

Now we have a file (usageResults.txt) which had a list of all the tracks in the top 50 for all of our input tracks. Next, we sort and count the unique entries:

cat usageResults.txt | sort | uniq -c | sort -r | head -n 30 > usageResults.final.txt
cat usageResults.final.txt | head -n 10 
     16 dgvPlus
     16 dgvMerged
     15 omimLocation
     15 decipher
     15 cnvDevDelay
     15 clinvarMain
     15 clinvarCnv
     15 clinvar
     14 omimAvSnp
     13 genomicSuperDup

Here we have the total results. The top number being 16 (the original number of input tracks). In this case, dgvMerged and dgvPlus are both in the top 50 usage for all 16 input tracks. This means they are very often used together with those tracks, etc. This was useful for me to see what tracks users like to see together, but has other applications as well.

General Stats Example

If you just want to generate a usage matrix of say, the top 100 tracks users use overall, you can use the same python script but not designate a specific track to filter by.

#See the top 100 tracks that are used from a single error log:
$HOME/kent/src/hg/logCrawl/dbTrackAndSearchUsage/coocurrenceCounts/filterTop.py -n 100 2018UsageStatsNoPrefix.txt top100Tracks2018.txt
#or with no defaults
$HOME/kent/src/hg/logCrawl/dbTrackAndSearchUsage/coocurrenceCounts/filterTop.py -n 100 --default-track-list /hive/users/lrnassar/MatrixData/hg19Defaults.txt 2018UsageStatsNoPrefix.txt top100Tracks2018NoDefaults.txt

Using generateUsageStats.py

This method was put together my Matthew Speir. It generates usage statistics for dbs, tracks, and hubs tracks using Apache error_log files. This script only counts an hgsid use a single time (as opposed to the previous method), that is to say heavy browser users are only represented a single time if their hgsid remains the same.

This program can be found in the following directory:

~/kent/src/hg/logCrawl/dbTrackAndSearchUsage/generateUsageStats.py

The program can be ran with no arguments to show a usage statement that includes examples of how to run it:

$ ./generateUsageStats.py 
usage: generateUsageStats.py [-h] [-f FILENAME] [-d DIRNAME] [-p] [-m] [-j]
                             [-t] [-o OUTDIR]

Generates usage statistics for dbs, tracks, and hubs tracks using Apache error_log files

optional arguments:
  -h, --help            show this help message and exit
  -f FILENAME, --fileName FILENAME
                        input file name, must be space-separated Apache
                        error_log file
  -d DIRNAME, --dirName DIRNAME
                        input directory name, files must be space-separated
                        error_log files. No other files should be present in
                        this directory.
  -p, --perMonth        output file containing info on db/track/hub track use
                        per month
  -m, --monthYear       output file containing month/year pairs (e.g. "Mar
                        2017")
  -j, --jsonOut         output json files for summary dictionaries
  -t, --outputDefaults  output file containing info on default track usage for
                        top 15 most used assemblies
  -o OUTDIR, --outDir OUTDIR
                        directory in which to place output files

In order to run the program, you must first put together a list of what error logs you are interested in extracting from. The best way to do this is to use makeLogSymLinks.sh. First you will want to make a directory in hive:

$ mkdir /hive/users/lrnassar/ErrorLogs

Next you will run makeLogSymLinks.sh to populate the directory with symlinks to all the error logs of interest (it can also be ran with no arguments for a usage statement). I recommend running it from Within your new directory:

$ cd /hive/users/lrnassar/ErrorLogs
$ ~/kent/src/hg/logCrawl/dbTrackAndSearchUsage/makeLogSymLinks.sh 20180107 12 error

This will create symlinks to all the error logs starting with the 7th of January (the date has to match a file in /hive/data/inside/wwwstats/RR), and going forward from there 12 months. Also I state I want the error logs, which is what we extract the data from. This particular command created symlinks to ~200 files (all of 2018).

Next you can call the python script to generate the usage stats. I ran the program in its own directory using the following command:

$ pwd
/hive/users/lrnassar/UsageStatsPython
$ ~/kent/src/hg/logCrawl/dbTrackAndSearchUsage/generateUsageStats.py -d ../ErrorLogs/ -o . -pm

The -d flag directs the script to the directory with the error logs, -o designates the output directory, -p tells it to include 'perMonth' numbers (as opposed to just year), and -m tells it to include month/year pairs in the output.

For all of the 2018 error logs, generateUsageStats.py took about 3.5 hours to run. With the flags chosen, it created the following files (some with pretty large sizes):

-rw-rw-r-- 1 lrnassar protein  102K Apr 16 13:00 dbCounts.perMonth.tsv
-rw-rw-r-- 1 lrnassar protein   24K Apr 16 12:57 dbCounts.tsv
-rw-rw-r-- 1 lrnassar protein  153M Apr 16 13:00 dbUsers.perMonth.tsv
-rw-rw-r-- 1 lrnassar protein  127M Apr 16 12:57 dbUsers.tsv
-rw-rw-r-- 1 lrnassar protein   117 Apr 16 13:00 monthYear.tsv
-rw-rw-r-- 1 lrnassar protein  172M Apr 16 13:04 trackCounts.perMonth.tsv
-rw-rw-r-- 1 lrnassar protein   68M Apr 16 13:00 trackCounts.tsv
-rw-rw-r-- 1 lrnassar protein   27M Apr 16 13:04 trackCountsHubs.perMonth.tsv
-rw-rw-r-- 1 lrnassar protein  4.2M Apr 16 13:00 trackCountsHubs.tsv
-rw-rw-r-- 1 lrnassar protein  5.8G Apr 16 13:04 trackUsers.perMonth.tsv
-rw-rw-r-- 1 lrnassar protein  5.1G Apr 16 13:00 trackUsers.tsv
-rw-rw-r-- 1 lrnassar protein  285M Apr 16 13:04 trackUsersHubs.perMonth.tsv
-rw-rw-r-- 1 lrnassar protein  258M Apr 16 13:00 trackUsersHubs.tsv

This output can be simplified to 4 primary outputs:

dbCounts.tsv
dbUsers.tsv
trackCounts.tsv
trackCountsHubs.tsv
trackUsers.tsv
trackUsersHubs.tsv

dbCounts.tsv

dbCounts.tsv shows which databases came up for each hgsid counted. This provides a general outline of UCSC GB assembly usage:

$ sort dbCounts.tsv -rnk2 | head
hg19	1327811
hg38	854810
mm10	248086
mm9	122045
hg18	85716
dm2	28618
dm3	24265
dm6	19343
rn6	14095
danRer10	13565

dbUsers.tsv

dbUsers.tsv tracks the number of times an hgsid refreshed the page or the hgTracks image (e.g. zoom, change location), and what database was being used. It provides information on how much browsing users do on a particular hgsid before changing it, etc.

$ sort dbUsers.tsv -rnk3 | head
hg19	xxxxxxxxx_hgsid	1178311
hg19	xxxxxxxxx_hgsid	580230
mm9	xxxxxxxxx_hgsid	321340
hg16	xxxxxxxxx_hgsid	210375
hg19	xxxxxxxxx_hgsid	164179
hg19	xxxxxxxxx_hgsid	139952
hg19	xxxxxxxxx_hgsid	109248
hg19	xxxxxxxxx_hgsid	103357
hg19	xxxxxxxxx_hgsid	103320
hg38	xxxxxxxxx_hgsid	93817

trackCounts.tsv

trackCounts.tsv shows the number of times a track was observed and which database it was seen with:

$ sort trackCounts.tsv -rnk3 | head
hg19	knownGene	1022394
hg19	cytoBandIdeo	1001358
hg19	cons100way	790623
hg38	knownGene	775122
hg19	pubs	756332
hg38	cytoBandIdeo	756204
hg38	refSeqComposite	754162
hg19	rmsk	749855
hg19	refSeqComposite	743841
hg38	ncbiRefSeqCurated	731005

trackCountsHubs.tsv

trackCountsHubs.tsv shows the number of times a public hub (no individually added hubs are counted, only UCSC public hubs) was seen, name of the hub, the track within the hub, and the database it belongs to. This may be useful to see which public hubs are popular, and which ones do not get much use:

$ sort trackCountsHubs.tsv -rnk4 -t $'\t' | head
Roadmap Epigenomics Data Complete Collection at Wash U VizHub	hg19	RoadmapConsolidatedH	8526
FANTOM5	hg19	FANTOM_CAT_lv4_stringe	7743
Roadmap Epigenomics Data Complete Collection at Wash U VizHub	hg19	RoadmapUniS	7440
Roadmap Epigenomics Data Complete Collection at Wash U VizHub	hg19	RoadmapBIHistoneModificati	7187
Roadmap Epigenomics Integrative Analysis Hub	hg19	RoadmapConsolidatedAssaya220	6865
Roadmap Epigenomics Integrative Analysis Hub	hg19	RoadmapUnconsolidatedAssaya230	6688
Roadmap Epigenomics Integrative Analysis Hub	hg19	RoadmapUnconsolidatedAssaya220	6686
Roadmap Epigenomics Data Complete Collection at Wash U VizHub	hg19	RoadmapUCSFHistoneModificati	5901
Roadmap Epigenomics Data Complete Collection at Wash U VizHub	hg19	RoadmapMeth	5659
Roadmap Epigenomics Data Complete Collection at Wash U VizHub	hg19	RoadmapDNa	5170

trackUsers.tsv

trackUsers.tsv is similar to dbUsers.tsv, except it also tracks what tracks were seen by each hgsid and database.

$ sort trackUsers.tsv -rnk4 -t $'\t' | head
hg16	refGene	xxxxxxxxx_hgsid	210375
hg16	stsMap	xxxxxxxxx_hgsid	210373
hg16	snp	xxxxxxxxx_hgsid	210373
hg16	rmsk	xxxxxxxxx_hgsid	210373
hg16	mzPt1Mm3Rn3Gg2_pHMM_wig	xxxxxxxxx_hgsid	210373
hg16	mzPt1Mm3Rn3Gg2_pHMM	xxxxxxxxx_hgsid	210373
hg16	mrna	xxxxxxxxx_hgsid	210373
hg16	intronEst	xxxxxxxxx_hgsid	210373
hg16	hg17Kg	xxxxxxxxx_hgsid	210373
hg16	gap	xxxxxxxxx_hgsid	210373

trackUsersHubs.tsv

trackUsersHubs.tsv is similar to trackCountsHubs.tsv, however it includes hgsids. It shows which tracks and hubs were used by hgsids, how many times hgTracks was refereshed/image refreshed, and what database was being used.

$ sort trackUsersHubs.tsv -rnk5 -t $'\t' | head
ENCODE Analysis Hub	hg19	wgEncodeUwHistoneK562H3k4me3StdAlnsign	xxxxxxxxx_hgsid	97365
ENCODE Analysis Hub	hg19	wgEncodeUwHistoneK562H3k4me3StdA	xxxxxxxxx_hgsid	97365
ENCODE Analysis Hub	hg19	wgEncodeUwHistoneK562H3k36me3StdAlnsign	xxxxxxxxx_hgsid	97365
ENCODE Analysis Hub	hg19	wgEncodeUwHistoneK562H3k36me3StdA	xxxxxxxxx_hgsid	97365
ENCODE Analysis Hub	hg19	wgEncodeUwHistoneK562H3k27me3StdAlnsign	xxxxxxxxx_hgsid	97365
ENCODE Analysis Hub	hg19	wgEncodeUwHistoneK562H3k27me3StdA	xxxxxxxxx_hgsid	97365
ENCODE Analysis Hub	hg19	wgEncodeUwHistoneGm12878H3k4me3StdAlnsign	xxxxxxxxx_hgsid	97365
ENCODE Analysis Hub	hg19	wgEncodeUwHistoneGm12878H3k4me3StdA	xxxxxxxxx_hgsid	97365
ENCODE Analysis Hub	hg19	wgEncodeUwHistoneGm12878H3k36me3StdAlnsign	xxxxxxxxx_hgsid	97365
ENCODE Analysis Hub	hg19	wgEncodeUwHistoneGm12878H3k36me3StdA	xxxxxxxxx_hgsid	97365