Galileo: Scalable Storage of Time-Series Multidimensional Data



Overview \| Documents \| API \| Software \| Contact \| HOME \|

Overview
The Galileo distributed file systems is designed for high-throughput storage and retrieval of multidimensional data. The distributed storage system is incrementally scalable with the ability to assimilate new storage nodes as they become available. How data is stored and dispersed impacts the efficiency of subsequent retrievals. The data dispersion in Galileo stores similar data items in network proximity without introducing storage imbalances at individual storage nodes. This allows for a significant reduction in the search spaces for range queries that may be performed on the stored data.

These range queries take on different meanings depending on the type associated with the data dimension. Range queries can be used to:
(1) Constrain or expand the geographical scope of the region under consideration.
(2) Specify the chronological range of interest within the time-series data.
(3) Specify the range of numeric values that the dimension can take.

The system relies on a metadata graph that provides a dynamic indexing scheme to help the system respond to differing load conditions. The system leverages heterogeneity in the available nodes where some may have better disk capacities and I/O throughputs. Our benchmarks demonstrate the feasibility of designing high-throughput data storage from commodity nodes by exploiting such heterogeneity. Results from queries are streamed back to the requestor. The system also incorporates a framework to allow a number of data formats (observational or otherwise) to be read and understood by the system.

To sustain failures and recover from data corruptions of specific blocks the system relies on replication. The replication factor associated with data blocks is configurable.