Cloud-storage providers and byte-range requests of UCSC big* files

From genomewiki
Jump to navigationJump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

This page explains why Cloud-backup providers are not good for storing UCSC track hubs or bigBed/bigWig/bam custom tracks. I'm not 100% sure of everything below, as I'm not an expert on storage either, feel free to check or correct.

Commercial data centers are pretty big: https://www.youtube.com/watch?v=XZmGGAbHqa0 They use thousands of servers.

Cloud storage providers store data over these thousands of servers using distributed blob storage. Most of these systems are proprietary, like Amazon S3 of Azure Storage or Google Cloud Storage. A famous open source version of such a system is CEPH, developed for his PhD at UCSC in 2007 by Sage Weil, I believe in Eng2 on floor 4. https://en.wikipedia.org/wiki/Ceph_(software) Sage is now rich and works for Redhat. Ceph is one of the storage options, I believe, in the IRODS system which is used by the acedemic project Cyverse, which uses not a single data center, but spreads its data servers across various US campuses.

You can see that distributed storage servers use thousands of cheap servers in a data center which is quite different from a webserver like our RR or hgwdev. Each server is connected to many cheap-ish spinning disks. Each incoming file is split into 64MB chunks (this number can vary, e.g. 8MB or 1MB). Each server stores a certain number of chunks. Each file is stored multiple times, typically three times, to make up for breaking hard disks. Humans on roller skates (?) replace broken hard disks. After a replacement, the software then restores the chunks from the other two copies. I believe that many storage servers have two network connections, one for data balancing/replacement and one for serving data to the outside world. Storage servers are "dumb", in the way that they only store pieces of data and they do nothing else. Other servers send them the identifier of the piece and they send back the piece.

A second type of server, meta data servers, store the file names and on which servers the chunks are stored. The meta data servers are the weak spot of the system, so there are much more than three copies and they probably use SSDs or RAM, not spinning disks. The meta data servers store the file names, the access rights, where the files are stored, the owner, file size and a lot of other meta data.

There also are internet-facing servers, they act as the gateway between the internal servers and the outside internet.

When a request for a file comes in from a client on the internet, the gateway asks the meta data servers where the files are stored. The meta data servers try to pick storage servers that are close to the client, not used too much and that have a working hard disk. The meta data servers also count usage and block the request is a file has been requested too often (it may be an illegal video and it's a backup provider, not a webspace provider). This also cost, as the cloud provider has to pay per TB transferred.

As an additional layer of security, Backup cloud storage providers make sure that the client is not an internet browser trying to stream a cat video or MP3, so instead of sending the file as it is, it replies with an HTTP redirect to the client or shows an HTML page where the user must do something to get the file. All links to the file include a secure token that is only valid for a few minutes and they include the location of the storage servers, so this is typically a very long link.

Backup cloud providers have no interest to fulfill video or audio streaming, so most do not support byte range requests. Those that do do it only for paid accounts or fulfill them relatively slowly, or have added the feature only recently and only for some customers (e.g. box.com).

Overall, one can see why backup storage providers cannot be as fast as a normal webserver. Every request has to go through at least one or two redirection layers to find the right storage server, do the authentication, look up if the data is cached somewhere, etc. Chunks have to found across servers and put together. The system is intentionally not built for speed but for low cost of storage and uses redirects and tokens to protect against abuse.

We could make our UDC work with most of these systems, if we tolerate a single redirect, which we currently don't do. Making our UDC work would mean, however, that it impacts the performance of all of hgTracks, as currently the slowest track holds up the whole display. Also, potentially, we may get blocked by some of these providers if we issue many requests to them.