Migration to hive: Difference between revisions

From genomewiki
Jump to navigationJump to search
(Added "Who's responsible?" section.)
m (add category tag)
 
Line 53: Line 53:


And the san will not live forever... but the plan is that the hive will.  :)
And the san will not live forever... but the plan is that the hive will.  :)
[[Category:Technical FAQ]]

Latest revision as of 19:24, 20 January 2010

Our godlike cluster-admins have provided us with a giant and hopefully scalable virtual disk, /hive. We hope to use this superdisk for data build directories and cluster run i/o, replacing /cluster/store* as well as bluearc and san. We expect this to greatly simplify our build processes: all files will be in a logical place under /hive. No more hunting around /cluster/* and following symlinks to the actual storage! No more genome build stuff stored on /san and bluearc for lack of space! No more rsyncing files to and from san and bluearc for cluster runs!

However, in order to enjoy those benefits, we have some work to do:

  • move everything (of value) from /cluster/store*, san and bluearc to their new logical places under /hive.
  • update symlinks in /gbdb, htdocs/goldenPath and /cluster/data to point to the new /hive locations.
  • update all script references to the old fileservers
  • when using old doc/*.txt templates, replace old file paths with new; also, don't stage stuff on other disks for cluster runs (except for cluster nodes' local /scratch disk), just use /hive.

When the migration to /hive is complete, /cluster/store* will disappear, having been completely subsumed by /hive. /cluster/data will stick around a bit longer, but ultimately all uses of /cluster/data will be replaced by their corresponding /hive paths, and /cluster/data will be retired as well.

Who's responsible?

In true genecats style, the migration has been mostly at-will to date. Jorge et al. have sent warning emails to owners of /cluster/store* directories that are not obvious genome database build directories, and owners are expected to move those or lose them (to tape archive). Angie is officially responsible for moving stuff (or not) that belongs to people who no longer work here. Most genome database build directories still reside on /cluster/store*; a few have been moved to /hive, so far without any apparent deleterious effects (except for some broken links that still need to be updated, more on those below).

Suggestion for going forward: each genome build directory will be moved by its owner, after the owner has made a reasonable determination that nobody is actively working in the directory (no cluster runs, track builds, downloads in progress).

Old stuff: on /hive/archive (for now)

Currently, /cluster/storeN is an NFS-mounted version of /hive/archive/storeN. /san/sanvol1 has been backed up as /hive/archive/SanVol1. /cluster/bluearc is gone; /hive/archive/bluearc is a fairly recent backup.

They will not last forever -- we have a limited time to move stuff that we want to keep out of /hive/archive/* and into one of the new /hive/ paths. (Negotiating deadline w/cluster-admin.) After the cutoff date(s), /hive/archive/* will disappear, their contents will be archived to tape, and the disk space freed up.

GPFS vs. NFS, or how to move directories

For years we have been mindful of the difference between local disk (fast) and NFS (slow but bigger). Now we have a new filesystem to consider: /hive uses GPFS. All /cluster/* are still NFS-mounted. In order for mv to act like a rename (like we expect), the OS must see the source and destination as on the same filesystem. This means that although the old /cluster/store* contents have been physically moved to /hive/archive/store*, /cluster/store* and /hive are not recognized by the OS as the same filesystem, and mv doesn't behave as we'd like! So when you move data out of /hive/archive/* and into /hive/.../, make sure you use /hive for both the source and destination paths, on a machine with native GPFS support for hive (like hgwdev or swarm), while nobody is modifying anything in the directory:

% df -h /hive
Filesystem            Size  Used Avail Use% Mounted on
/dev/hivedev          160T   59T  102T  37% /hive

% mv /hive/archive/store6/myProject /hive/users/me/

That is just a rename operation, and it's practically instantaneous. (Please do not mv from /cluster to /hive -- this results in a copy instead of a move; it is slow, all file ownership information is lost, and you might not have permissions to move some of the files, resulting in errors and incomplete results.)

New /hive paths

/hive has several subdirectories, to provide some organization for the new namespace:

  • /hive/data, which in turn has a few subdirectories:
    • /hive/data/genomes: genome database build directories, e.g. /hive/data/genomes/sacCer1
    • /hive/data/outside: external database downloads, e.g. /hive/data/outside/ncbi
    • /hive/data/inside: internally-built non-genome databases, e.g. /hive/data/inside/visiGene
  • /hive/users: personal projects, e.g. /hive/users/kent
  • /hive/groups: group projects, e.g. /hive/groups/qa

How to move part 2: after the mv command

The mv from /hive/archive/... to /hive/... takes no time at all, but you may need to do some followup.

  1. If you move a genome database build directory $db, update its /cluster/data/$db symlink to point to /hive/data/genomes/$db. This applies to external-db directories too, e.g. /cluster/data/ncbi -> /hive/data/outside/ncbi.
  2. On hgwdev, look for any symlinks to the old location in /gbdb/$db and /usr/local/apache/htdocs/goldenPath/$db, and update if necessary:
find /gbdb/$db /usr/local/apache/htdocs/goldenPath/$db -type l -ls | grep /cluster/store

san and bluearc: what to move?

Short answer: anything that should have been in /cluster/store*, but wasn't because space was tight.

The bluearc is gone due to hardware failure, so any run results that we care about must be moved from /hive/archive/bluearc to /hive/{data,groups,users}.

The san is still up and running, but many of us got in the bad habit of storing large datasets there. cluster-admin threatens (rightly) to remove old stuff from there in order to free up some space for san's intended use: temporary disk for cluster runs. So, arduous as it may be, we really should dig through the old stuff on san and move genome db build pieces into /hive/data/genomes/$db/bed/ where they will be safe.

And the san will not live forever... but the plan is that the hive will.  :)