UdcFuse: Difference between revisions

From genomewiki
Jump to navigationJump to search
(FUSE doesn't do seek... bailing. :()
(unbailing - Mark found that kernel handles seek by passing offset to read function.)
Line 1: Line 1:
I bailed on this because it turns out that FUSE does not have an interface for seek, which is absolutely necessary in order for it to be useful to us.  But here's what I thought I'd be doing, for the record...
The udc (Url Data Cache - kent/src/lib/udc.c) module is the URL random access and sparse-file caching mechanism underlying the bigBed and bigWig custom track implementation.  Each bigBed/bigWig custom track's track line includes the bigDataUrl parameter which is set to the URL of the user's bigBed/bigWig file, e.g. "track name=myBB type=bigBed dataUrl=http://my.edu/myBigBed.bb".
The udc (Url Data Cache - kent/src/lib/udc.c) module is the URL random access and sparse-file caching mechanism underlying the bigBed and bigWig custom track implementation.  Each bigBed/bigWig custom track's track line includes the bigDataUrl parameter which is set to the URL of the user's bigBed/bigWig file, e.g. "track name=myBB type=bigBed dataUrl=http://my.edu/myBigBed.bb".


Line 8: Line 5:
However, samtools-C lacks SSL (https/ftps) and caching.  For each access, the entire index file is downloaded to a file in the current directory.  SSL is a valuable feature for users who want to display unpublished data (one alpha-tester is waiting for SSL support), and the lack of caching slows down the genome browser display (constant 4sec track load time for a 1000 Genomes test BAM file with smaller-than-average index file size).   
However, samtools-C lacks SSL (https/ftps) and caching.  For each access, the entire index file is downloaded to a file in the current directory.  SSL is a valuable feature for users who want to display unpublished data (one alpha-tester is waiting for SSL support), and the lack of caching slows down the genome browser display (constant 4sec track load time for a 1000 Genomes test BAM file with smaller-than-average index file size).   


I intended to place udc underneath the file handles used by samtools-C, as a Filesystem in Userspace ([http://fuse.sourceforge.net/ FUSE]) module.  FUSE provides an efficient kernel interface for userspace code to implement a fully functional file system.  udcFuse is a userspace program built on FUSE that mounts a filesystem that is actually a wrapper on udc functionality.  Paths within the udcFuse-mounted filesystem can be passed to samtools, which will treat them as local files.  File system accesses (from samtools-C, ls, cd etc) to the udcFuse-mounted filesystem will result in calls to udcFuse FUSE method bindings, which will call udc methods.  The udcFuse filesystem will be read-only, and will simply reflect the state of udc's local cache of the files.
MarkD made the most excellent suggestion to place udc underneath the file handles used by samtools-C, as a Filesystem in Userspace ([http://fuse.sourceforge.net/ FUSE]) module.  FUSE provides an efficient kernel interface for userspace code to implement a fully functional file system.  udcFuse is a userspace program built on FUSE that mounts a filesystem that is actually a wrapper on udc functionality.  Paths within the udcFuse-mounted filesystem can be passed to samtools, which will treat them as local files.  File system accesses (from samtools-C, ls, cd etc) to the udcFuse-mounted filesystem will result in calls to udcFuse FUSE method bindings, which will call udc methods.  The udcFuse filesystem will be read-only, and will simply reflect the state of udc's local cache of the files.


== udcFuse filesystem structure ==
== udcFuse filesystem structure ==
Line 14: Line 11:
A udcFuse filesystem is created by executing the udcFuse program:
A udcFuse filesystem is created by executing the udcFuse program:


  udcFuse ''mountPoint''
  udcFuse ''mountPoint'' ''[udcCacheDir]''


''mountPoint'' is an empty directory with permissions that don't exclude the user running udcFuse.  In practice I expect this to be $TMPDIR/udcFuse.  Actual files are stored in udc's local cache directory, so I expect little or no space to be taken up by ''mountPoint''.
''mountPoint'' is an empty directory with permissions that don't exclude the user running udcFuse.  In practice I expect this to be $TMPDIR/udcFuse.  Actual files are stored in udc's local cache directory, so I expect little or no space to be taken up by ''mountPoint''.
The optional ''udcCacheDir'' specifies a non-default location for udc's local cache directory.  (Default is in udc.c: /tmp/udcCache)


The directory structure beneath ''mountPoint'' will mirror the directory structure of udc's local cache directory:
The directory structure beneath ''mountPoint'' will mirror the directory structure of udc's local cache directory:
Line 28: Line 27:
== FUSE methods ==
== FUSE methods ==


''udcfs_getattr'': probably just return stat() of corresponding path, but with no write perms.
''udcfs_getattr'': if corresponding path doesn't exist yet, create it -- this is called before almost every other function (see http://sourceforge.net/apps/mediawiki/fuse/index.php?title=FuseInvariants) -- probably just return read-only stat() of corresponding path.
 
''udcfs_readdir'': read corresponding udc cache dir
 
''udcfs_mknod'': return 0 (read-only)
 
''udcfs_mkdir'': return 0 (read-only)


''udcfs_unlink'': return 0 (read-only)
''udcfs_open'': if "r", call udcFileMayOpen, store returned udcFile, return 0 or -EWHATEVER


''udcfs_rmdir'': return 0 (read-only)
''udcfs_read'': look up udcFile for given path, udcSeek to specified offset, call udcRead
 
''udcfs_open'': if "r", call udcFileMayOpen, store returned udcFile, return int index of file? or 0? or -EWHATEVER
 
''udcfs_read'': look up udcFile for given path, call udcRead
 
''udcfs_write'': return 0 (read-only)


''udcfs_release'': call udcFileClose, clean up internal state
''udcfs_release'': call udcFileClose, clean up internal state
Line 50: Line 37:
''udcfs_init'': create internal data structures (hash of paths to open udcFile pointers)
''udcfs_init'': create internal data structures (hash of paths to open udcFile pointers)


''udcfs_destroy'': probably don't need to clean up anything besides freeing stuff.  (leave udcCleanup for udc!)
''udcfs_destroy'': free stuff allocated during init.  (leave udcCleanup for udc!)
 
''udcfs_seek'': DOH!  No such operation!  How did I miss this before, ugh.

Revision as of 20:59, 15 October 2009

The udc (Url Data Cache - kent/src/lib/udc.c) module is the URL random access and sparse-file caching mechanism underlying the bigBed and bigWig custom track implementation. Each bigBed/bigWig custom track's track line includes the bigDataUrl parameter which is set to the URL of the user's bigBed/bigWig file, e.g. "track name=myBB type=bigBed dataUrl=http://my.edu/myBigBed.bb".

Similar to bigBed/bigWig, the BAM alignment format (binary compressed flavor of SAM)is indexed for random access which makes it suitable for track display. The samtools-C library includes code to do HTTP and FTP random access using the BAM index, so it is easy to implement basic custom track support by simply passing the bigDataUrl to samtools-C access functions.

However, samtools-C lacks SSL (https/ftps) and caching. For each access, the entire index file is downloaded to a file in the current directory. SSL is a valuable feature for users who want to display unpublished data (one alpha-tester is waiting for SSL support), and the lack of caching slows down the genome browser display (constant 4sec track load time for a 1000 Genomes test BAM file with smaller-than-average index file size).

MarkD made the most excellent suggestion to place udc underneath the file handles used by samtools-C, as a Filesystem in Userspace (FUSE) module. FUSE provides an efficient kernel interface for userspace code to implement a fully functional file system. udcFuse is a userspace program built on FUSE that mounts a filesystem that is actually a wrapper on udc functionality. Paths within the udcFuse-mounted filesystem can be passed to samtools, which will treat them as local files. File system accesses (from samtools-C, ls, cd etc) to the udcFuse-mounted filesystem will result in calls to udcFuse FUSE method bindings, which will call udc methods. The udcFuse filesystem will be read-only, and will simply reflect the state of udc's local cache of the files.

udcFuse filesystem structure

A udcFuse filesystem is created by executing the udcFuse program:

udcFuse mountPoint [udcCacheDir]

mountPoint is an empty directory with permissions that don't exclude the user running udcFuse. In practice I expect this to be $TMPDIR/udcFuse. Actual files are stored in udc's local cache directory, so I expect little or no space to be taken up by mountPoint.

The optional udcCacheDir specifies a non-default location for udc's local cache directory. (Default is in udc.c: /tmp/udcCache)

The directory structure beneath mountPoint will mirror the directory structure of udc's local cache directory:

mountPoint/urlProtocol/urlHost/urlRestOfPath

samtools-C will be passed filenames underneath mountPoint. For example, if a custom track's bigDataUrl is http://my.edu/myBam.bam, samtools-C will be passed /mountPoint/http/my.edu/myBam.bam as if it were a local file that already existed. When samtools opens a file handle on that path, udcFuse code will reconstruct the URL, open a udcFile object on the URL, and store the udcFile for later use when samtools-C wants to seek and read.

File system accesses will be mapped onto udc methods where possible, and mapped onto accesses to the corresponding udc cache directories and files otherwise.

FUSE methods

udcfs_getattr: if corresponding path doesn't exist yet, create it -- this is called before almost every other function (see http://sourceforge.net/apps/mediawiki/fuse/index.php?title=FuseInvariants) -- probably just return read-only stat() of corresponding path.

udcfs_open: if "r", call udcFileMayOpen, store returned udcFile, return 0 or -EWHATEVER

udcfs_read: look up udcFile for given path, udcSeek to specified offset, call udcRead

udcfs_release: call udcFileClose, clean up internal state

udcfs_init: create internal data structures (hash of paths to open udcFile pointers)

udcfs_destroy: free stuff allocated during init. (leave udcCleanup for udc!)