HgFindSpec

From genomewiki
Jump to: navigation, search

hgFindSpec How-To

Introduction

When a user types a search term into the position box in hgTracks or hgTables, the search is performed by kent/src/hg/lib/hgFind.c's findGenomePos() or findGenomePosWeb(). The set of tables to be searched and the methods for searching each table used to be hardcoded into hgFind.c. Since the spring of '04, the trackDb.ra files now define the set of tables to be searched, the order of search, and the method of searching each table. The goal of these changes is to allow developers to add or change searches by making small edits to the trackDb.ra files instead of large edits (with much copy-paste-modify code) to hgFind.c.

When you run a "make update" in kent/src/hg/makeDb/trackDb/, the hgFindSpec program is invoked to read search specifications from the trackDb.ra files and to create an hgFindSpec_$USER table in each database. "make alpha" causes an hgFindSpec table to be created in each database. This parallels the generation of trackDb_$USER and trackDb tables from the same trackDb.ra files and the same make targets.

This document assumes familiarity with trackDb and the kent/src/hg/ tree, and describes how to go about adding a new search to a trackDb.ra file. It also contains a bit of reference info about the new table and programs, including a diagnostic program checkHgFindSpec which helps test the searches.

How to create a search

1. Identify the table(s) and field(s) to search.

Usually, you'll have just added a track, and you'll want to make a search on the names or IDs from the track table. So you'll start like this:

searchTable mySpiffyTrack

In more complicated cases, you might have nice names in a non-positional table (e.g. stsAlias) which you would like to make searchable by cross-referencing them to a positional/track table (e.g. stsMap). In that case, you'll need to add a line to identify the cross-referencing table:

xrefTable mySpiffyAlias

If there is already a search defined for your table, but you want to add another, then you will have to make up a different searchName to distinguish the two searches (by default, searchName = searchTable; searchTable does not have to be unique among searches, but searchName does):

searchName mySpiffyTrackSpecial
searchTable mySpiffyTrack


2. What is the type (or query format) of the table(s)?

If the main positional/track table for the search is a common type such as genePred, bed or psl, then you can just declare a searchType and hgFind will know how to write the query for the table:

searchType bed

Some tables/searches already have special search code written for them, so you can name them as the searchType (but you usually won't have to write new descriptions for them since most are in the top-level trackDb.ra):

  • knownGene
  • refGene
  • cytoBand
  • gold
  • mrnaAcc
  • mrnaKeyword
  • sgdGene

If your table is not one of the common or special-code table types, then you'll have to write a SQL query to return the chrom, start, end and name of items matching a given search term like this:

query select myChrom, myChromStart, myChromEnd, myName from %s where myName like '%s'

If you want your query format to end with '%s%%' (so that a search term of "hox" will result in a query on 'hox%'), then add this line to your search spec:

searchMethod prefix

Similarly, if you want to use '%%%s%%' (=> '%hox%'), add this:

searchMethod fuzzy

The default searchMethod is exact ('%s'). If you don't write your own query, then hgFindSpec will use searchMethod to pick an ending for the default query for searchType. If you write your own query and don't have an xrefTable/xrefQuery, then your query must end with a pattern that's consistent with searchMethod.

If you have defined an xrefTable above, then you will definitely have to define an xrefQuery for it. Here's what that will look like:

xrefQuery select trackName, searchName from %s where searchName like '%s'

Use "searchMethod prefix" or "searchMethod fuzzy" if that's what you want for xrefQuery. If you define an xrefQuery, then searchMethod applies to xrefQuery only, not query, and query has to be an exact search.

Here is another example -- the following setting let the user search for a geneReviews item using either gene symbol or disease name:

searchName geneReviews
searchTable geneReviews
searchType bed
searchPriority 50
xrefTable geneReviewsRefGene
xrefQuery select geneSymbol,diseaseName from %s where diseaseName like '%%%s%%'
searchBoth on

The diseaseName in the xrefQuery is the column in geneReviewsRefGene table that is compared and geneSymbol is the ID that will be searched in the primary table (geneReviews).


3. Define a regular expression for search terms

Often, the names that you're searching will follow a pattern, and we can exploit that to save a little time when searching. For example, if you're adding a search for a track where every name starts with "NT_" and then has 6 numbers, then we know that this track should be searched only when the user has typed in a search term that follows the same pattern. If the user types in "HOX", then there is no point in searching through names composed of "NT_" plus 6 numbers. So we add this line to the search spec:

termRegex NT_[0-9]{6}

Ultimately, the pattern we're looking for is in the user's search terms that should be applied to our table -- this is almost always the same as the pattern of the names in the table. (Exceptions: when the user types in a prefix that is not found in the table's names, e.g. "HG-U95:", or when the user omits a suffix that is found in the table's names.)

Human beings are pretty good at recognizing patterns in names. We've even written little languages to describe text patterns as "regular expressions", or regexes, which are easy for computers to parse and then evaluate on arbitrary input (like users' search terms). hgFindSpec's termRegex field uses the regular expression language "regex". If you have used egrep before, you already know regex. If you have used fancy glob commands, you have a good headstart. If you use Perl regexps a lot, you are spoiled but regex will be straightforward enough.

One way to learn "regex" is by example:

  • [NX]M_[0-9]+: An N or an X, then an "M_", then at least one number.
  • [a-z][a-z0-9][0-9]+: A letter (lowercase, but hgFind does case-insensitive regex search), then a letter or a number, then at least one number.
  • (RP|CT|GS)[[:alnum:]]+-[[:alnum:]]+: Either RP, CT, or GS; then some alphanumeric character(s); then a hyphen; then some alphanumeric character(s).
  • [^[:space:]]: Anything that is not a space (or tab or newline). This will match single-word queries (but not multi-word because hgFindSpec adds "^" at the beginning and "$" at the end of termRegexes, so that they are forced to match the entire search term).
  • A\.chr.*-.*\.[0-9]+: An A; a . (backslash-escaped because otherwise it would mean "any character"); chr; anything up to a hypen; anything up to a . and some number(s).
  • (x|y|[1-9][0-9]?)(p|q)[0-9]+(\.[0-9]+)?: An x, y or one- or two-digit number; a p or q; some number(s); an optional .number suffix.

You can refer to man pages for regex syntax and for the C routines used to parse and execute regex searches:

man 7 regex
man regex

There are also numerous references on regex out there, e.g. http://www.delorie.com/gnu/docs/regex/regex_toc.html... wow, looks like there's even a GUI wizard/coach: http://www.weitz.de/regex-coach/. And you can always ask an old UNIX-head like me for help.

Here's my favorite way to define a termRegex, and make sure that it really covers all the names in a table:

hgsql $db -N -e "select name from $table limit 10"

Eyeball the results and write a regex. Then try out that regex in this command (substitute it in for "__TERMREGEX__"):

hgsql $db -N -e "select name from $table" | egrep -vi '^__TERMREGEX__$' | head

If that returns any results, then your regex needs to be loosened up to incorporate those. Keep on playing with the regex and running that command until it comes back clean - there's your termRegex*!

  • In those rare cases mentioned above when the user types in something

a little different from what's in the table, use the regex you just derived as the <A HREF="#dontCheck">dontCheck</A> setting, so "checkHgFindSpec -checkTermRegex" won't complain. Write a termRegex to match what the user types in. Do some extra testing to make sure that your termRegex encompasses all user search terms that should match.

4. When is a shortCircuit a good thing?

Some search terms are easily identified as accessions. Returning to our example of "NT_" and 6 numbers, we can be fairly certain that the user wants the NCBI physical map contig for that accession. So if we find a match for such a search term in the ctgPos table, we're done! No need to search other tables, even if their regexes are loose enough to accept the accession (for example, the mrnaKeyword or knownGene searches will take anything, but they also take a long time).

If we have a nice clear-cut case like that, we can make it a shortCircuit search:

shortCircuit 1

... but be extra-sure that terms found there won't have interesting matches anywhere else. For example, the snpMap table contains a bunch of IDs that start with rs and end with one or more digits. But there is a gene rs10, so we don't want to shortCircuit because then the user couldn't search for that gene -- they'd be zapped to the SNP whether they wanted it or not. So we define two searches for snpMap: one that shortCircuits for rs followed by a bunch of numbers (unambiguously a SNP ID), and one that doesn't shortCircuit but searches for rs followed by a small number of numbers.

hgFind performs shortCircuit searches first, stopping if it gets a match. If no shortCircuit search produces a match, then hgFind performs all other (additive/non-shortCircuit) searches.

A slight twist to this mechanism is the semiShortCircuit setting:

semiShortCircuit 1

That allows other shortCircuit or semiShortCircuit searches to be performed even if a match is found for this search, and is for use when we need the speed but are not absolutely sure that this track contains the only correct result for the search term.

5. Decide on a searchPriority

hgFindSpec.searchPriority, like trackDb.priority, is a relative thing. Run this command to see the order in which tables are searched in $db:

checkHgFindSpec $db

Figure out where your search should fit in (this is not nearly as important as whether it's shortCircuit or not! but ask an old-timer if you're having trouble deciding). For additive/non-short-circuit searches, if there are a bunch of matches from various tracks, in what order should those tracks' matches be presented to the user?

Then look at the searchPriorities of the searches between which your search should fit, and pick a (floating-point) number between those two numbers.

searchPriority 42

6. Test!

First do a "make update" so that the hgFindSpec program will process your trackDb.ra definition into the hgFindSpec table in each applicable database:

cd kent/src/hg/makeDb/trackDb
make update DBS=$db1 ZOO_DBS=
# or if your search applies to more than one db
make update DBS="$db1 $db2" ZOO_DBS=

hgFindSpec can catch some problems with search definitions, such as missing fields or improperly formatted queries or termRegexes.

Next, use the checkHgFindSpec utility to try out an example search and see if there are any incomplete termRegexes.

checkHgFindSpec $db $exampleTerm
checkHgFindSpec -checkTermRegex

Then open up a browser window on hgwdev-$USER and try a bunch of examples. If it looks OK, check in your trackDb.ra changes, go to a clean updated tree, and do a "make alpha" in kent/src/hg/makeDb/trackDb/ .

Some special cases

User search terms not exactly the same as names in table

Sometimes the search terms that users type in are not quite the same as the name values in the tables to be searched. For example, for our affy* tracks, we tell users to prefix probe IDs with chip IDs, but the affy* tables contain just probe IDs. So the user may type in "HG-U95:1003_s_at", but the item name in the affyU95 table is just "1003_s_at". To tell hgFind (and "checkHgFindSpec -checkTermRegex") that search terms (and termRegex) have a prefix that does not appear in the table, add a line like this to the trackDb.ra search spec:

termPrefix HG-U95:

For all other cases where user search terms (and therefore termRegex) don't match the actual values in the table, or are a subset of the actual values in the table, add a line like this with a regex that will cover the table values not covered by termRegex, so that "checkHgFindSpec -checkTermRegex" doesn't flag it as an error:

dontCheck [[:alnum:]]+\.[0-9]+

Adding padding

For small features such as STS markers or SNPs, we often want to display the larger genomic context of the requested feature. If that is the case for your search, add a line like this to the trackDb.ra search spec:

padding 5000

That will cause 5000 to be subtracted from the start and added to the end of search results (unless the user has entered multiple search terms separated by ";" in order to get the range between them).

xrefTable names may overlap searchTable names

An xrefTable may relate totally distinct types of names/IDs, such as gene names to accessions. However, in the case of alias tables, there may be some names that are found both in the searchTable and in the xrefTable. Undesirable duplicate results can arise when performing separate searches on searchTable alone and on searchTable via xrefTable. If that's the case, the xref search can be made to also search for the term in searchTable (instead of giving up if xrefTable doesn't contain the search term) by adding this to the trackDb.ra search spec:

searchBoth 1

Alternate description for table/search in HTML for multiple results

When there are multiple results for a search term, hgFind outputs HTML with the various choices organized by the searches that produced them, with links to the browser. By default, the description of the search is the searchTable name plus its trackDb.longLabel value (if any). If you want something different (e.g. if you want to make it clear that this was an xref search), add a line like this to the trackDb.ra search spec:

searchDescription Alias of STS Marker

hgFindSpec (the table) fields and settings

Both proper fields of hgFindSpec and optional settings appear in trackDb.ra search specs, one per line. Settings are for search parameters that are rarely used, or that were added after hgFindSpec.as was frozen.

Here's the kent/src/hg/lib/hgFindSpec.as description of the fields:

   string searchName;		"Unique name for this search.  Defaults to searchTable if not specified in .ra."
   string searchTable;		"(Non-unique!) Table to be searched.  (Like trackDb.tableName: if split, omit chr*_ prefix.)"
   string searchMethod;	        "Type of search (exact, prefix, fuzzy)."
   string searchType;		"Type of search (bed, genePred, knownGene etc)."
   ubyte shortCircuit;		"If nonzero, and there is a result from this search, jump to the result instead of performing other searches."
   string termRegex;		"Regular expression (see man 7 regex) to eval on search term: if it matches, perform search query."
   string query;		"sprintf format string for SQL query on a given table and value."
   string xrefTable;		"If search is xref, perform xrefQuery on search term, then query with that result."
   string xrefQuery;		"sprintf format string for SQL query on a given (xref) table and value."
   float searchPriority;	"0-1000 - relative order/importance of this search.  0 is top."
   string searchDescription;	"Description of table/search (default: trackDb.{longLabel,tableName})"

Here is a description of currently supported settings:

  • dontCheck: a regex for checkHgFindSpec -checkTermRegex to use in place of termRegex. This is for those cases when the termRegex (for user search terms) does not encompass all items to be searched in searchTable.
  • padding: an integer to pad the results range with, i.e. to subtract from the start and add to the end (unless the user has entered multiple search terms separated by ";" in order to get the range between them).
  • searchBoth: if non-null (present in the trackDb.ra spec), and the spec has an xrefTable/xrefQuery, but the search term is not found in xrefTable, then look for the search term in searchTable too.
  • termPrefix: a string found at the beginning of the user's search term (and termRegex) for this search, but not at the beginning of items in the table.
  • semiShortCircuit: if a match is found, don't halt immediately (like shortCircuit would) but instead allow other shortCircuit or semiShortCircuit searches to be performed before halting.

checkHgFindSpec and how to use it

checkHgFindSpec is a diagnostic program for examining the order of search, testing searches from the command line, and looking for possible problems with search specs. You need a ~/.hg.conf file in order to run it. If ~/.hg.conf specifies trackDb_$USER as your trackDb table, then the hgFindSpec_$USER table will be examined.

  checkHgFindSpec database [options | termToSearch]

If given a termToSearch, displays the list of tables that will be searched
and how long it took to figure that out; then performs the search and the
time it took.
options:
  -showSearches       Show the order in which tables will be searched in
                      general.  [This will be done anyway if no
                      termToSearch or options are specified.]
  -checkTermRegex     For each search spec that includes a regular
                      expression for terms, make sure that all values of
                      the table field to be searched match the regex.  (If
                      not, some of them could be excluded from searches.)
  -checkIndexes       Make sure that an index is defined on each field to
                      be searched.

The most common uses:

checkHgFindSpec $db
checkHgFindSpec $db $searchTerm
checkHgFindSpec $db -checkTermRegex

hgFindSpec (the program)

In general, you won't have to run hgFindSpec on the command line; "make update" or "make alpha" in kent/src/hg/makeDb/trackDb/ will do the right thing. However, hgFindSpec does check for several error conditions in trackDb.ra search specs, which you'll need to know about so you can fix them if you ever come across them:

  • hfsPolish: search %s: termRegex "%s" got regular expression compilation error ... : this implies a regular expression syntax problem in termRegex. Unfortunately the error description from regerror() is not always very clear. In particular it won't always point out simple things like unbalanced parentheses. Try the <A HREF="#egrep">egrep</A> command trick given above... if it works in egrep, it should work in hgFindSpec.
  • hfsPolish: search %s: query needs to be of the format ... : hgFindSpec is very picky about the format of the query (and xrefQuery). It actually uses a regex to make sure your query/xrefQuery looks just as expected, and is consistent with the searchMethod (exact, prefix, or fuzzy; exact is the default).
  • hfsPolish: search %s: there is an xrefQuery so query needs to end with ... : If your search has an xrefTable/xrefQuery, then the [#searchMethod|searchMethod]] applies to your xrefQuery. query must be exact.
  • hfsPolish: search %s: searchMethod is fuzzy so query needs to end with %s. : This type of message is what you'll see if searchMethod is inconsistent with your query's (or xrefQuery's) ending. searchMethod exact => '%s', prefix => '%s%%', fuzzy => '%%%s%%'.
  • hfsPolish: search %s: if searchType is not defined, then query must be defined. : A default query is provided for recognized searchTypes. If a searchType isn't specified, then you have to write a suitable query.
  • hfsPolish: search %s: can't define xrefTable without xrefQuery or vice versa. : yup.

Here's the hgFindSpec usage, just for completeness:

   hgFindSpec [options] orgDir database hgFindSpec hgFindSpec.sql hgRoot

Options:
   -strict              Add spec to hgFindSpec only if its table(s) exist.
   -raName=trackDb.ra - Specify a file name to use other than trackDb.ra
    for the ra files.