AutoSql

From genomewiki
Jump to navigationJump to search

Jim Kent wrote autoSql and has documented it in the source (kent/src/hg/autoSql/autoSql.doc) and in a Linux Journal article.

Our typical usage of autoSql is this:

  • write a .as file that describes a new (type of) table. This almost always goes in kent/src/hg/lib/ but .as files exist in other parts of the tree. Exceptions are becoming more common as new non-genomic, self-contained databases (e.g. uniProt, visiGene) are added.
  • run autoSql on the .as file to produce a code module with the same root name as the .as file. This results in the generation of 3 files with .h, .c and .sql suffixes.
  • Move the .h file into the appropriate inc/ directory.
  • Either add custom code to the end of the .c file, or add a new ...Custom.c file.
  • Edit the .sql file to add the desired INDEXes and (for positional tables) a bin field. (We do not include the bin field in the .as because we do not need it in the C structure -- bin is used only for querying.)

The .as files used to be ignored after running autoSql -- however, we now use the .as files to create the tableDescriptions table in each database which is used by the Table Browser's "describe table schema" page to display descriptions of table fields.

The source code also contains sqlToAs in case you have mysql tables but need autoSql files.

The longer story

Here is a longer, more narrative description about autoSql files in the kent/src tree, taken from an email from Angie to browser-qa 8/29/2007:

Jim wrote autoSql to generate both SQL and C code from the same spec, so the code would be guaranteed to match the table and then we wouldn't have to write a bunch of repetitive code for reading and writing data for each new table.

autoSql reads a specification from a .as file (in the autoSql format, defined in the plaintext file kent/src/hg/autoSql/autoSql.doc). It generates .sql, .c and .h files in the same directory. We usually put .as files in hg/lib, leave the .c and .sql in hg/lib, and move the .h to hg/inc; or we may simply remove the .c and .h if only the .sql is necessary. If we keep the generated .c file, its .o target should be added to hg/lib/makefile. .as files may be found in other parts of the kent/src tree, especially if they are not for libified Genome Browser database tables. Initially, .as files served only as the original spec for code generation.

We almost always manually add index statements to the generated .sql. Rarely, some custom code is added to the end of the .c/.h. The normal usage is to write .as once, run autoSql to generate .sql/.c/.h only once, and then add indexes to .sql. Sometimes before release of a new track, a developer will realize they need to change the table, so they'll tweak .as, re-run autoSql and then have to re-tweak the .sql.

[Opinionated interlude aimed at developers: Making a new .as file is not necessary when there is already one that has the exact same fields. Sometimes, it is useful to make a new .as to generate .sql for a new type of table, but especially in the case of non-positional tables or bed with one extra field, the .c/.h are not necessary. hg/lib has a *huge* number of files already and that's one of my low-priority pet peeves, when new files are checked in unnecessarily.]

I've seen some cases where a .as file was written for a new type of table, used to generate .sql, and then the .sql was coded into a loading program for that table type. In some of those cases, the .as and/or .sql was never checked in (only the loading program).

When the bin column was added, that introduced a new twist. Usually, the bin column is not included in the .as(.c,.h), and is manually added to the .sql. Because it's manually added, you'll sometimes see slightly different types used (usually bin is a short, but sometimes people use int out of habit). hgLoadBed/hgLoadPsl take care of adding the bin column when loading the table. When reading from a table with a bin column, the code has lots of tweaks here and there to ignore the first column, and populate the C struct for a row starting at the second column from "select *".

Then along came the tableDescriptions table, and suddenly the .as files had some new requirements. Previously, the .as file became vestigial after autoSql was run, so there could be some drift between the .as and the implementation without any ill effects. However, with tableDescriptions, now it's important that some checked-in .as accurately describes each table (ignoring the bin column).

Another twist: .sql may be tweaked to use types not yet supported by autoSql. For example, the set and enum sql types were not supported by autoSql but were edited into .sql, so the .as could not exactly match the .sql -- until MarkD added support for those types. So this is not an issue currently, but could possibly happen again in the future if someone hacks the .sql to use some exotic type that we haven't used so far. Putting an unsupported type in the .as isn't a workaround because the Table Browser uses autoSql lib code to parse the .as contents retrieved from tableDescriptions, and it will errAbort if it encounters an unsupported type.

The tableDescriptions table is used only by the 'describe table schema' links used in the Table Browser and by hgTrackUi.

For genome databases, where we frequently add new tables, the buildTableDescriptions.pl script should be run nightly to keep the tableDescriptions table up to date. I have a cron job to do this on hgwdev, and qateam (Ann) has a cron job on hgwbeta. The script slurps in all .as files in the kent/src/ tree and tries to match each table in the database to the appropriate .as file using table name, table type, and field names and types. It runs on all databases whose names look like genome database names, skipping others (so sometimes we have to tweak the pattern for "others"). If it finds multiple .as files that describe the same table name, it will die, and someone will have to figure out which .as should be used. The duplicate .as should be either edited to rename its table, removed from the source tree, or added to buildTableDescriptions.pl's internal list of .as files to ignore.

For highly relational (usually external) databases that are loaded once and then pretty much left alone, Jim's makeTableDescriptions program should be run once, just after the database has been loaded.


Navigation: back to Implementation_Notes