Monday, January 4, 2010

Open research questions in IS4

I'm working on building a sensor storage system with features that will make it easy to integrate multiple stream of data into a database for long-term storage, real-time streaming, and perhaps real-time trending/search. The name of the system is the Integrated Sensor-stream Storage System or IS4 and there is more information here.

Long-term storage is trivial. The challenge or contribution in this area is really the protocols and associated schemas for integrating data stream from a heterogenous set of sensors. There are also challenges that have been directly addressed in the literature with respect to the integration of data that does not have a unifying condition. In other words, there's no column whose values match in order to run a join across columns. It's important to note that in this work, we assume that the data is all timeseries data. Furthermore, the associated metadata is also timeseries data. Therefore we can use aggregation, windowing, extrapolation, and interpolation techniques to create views (temporary and materialized).

From the ActionWebs meeting, the main take-away related to my work is the idea of the relationship between models and data. Up to this point our mindset has been on applying a model to the data, for example, running linear interpolation in time and surface-fitting in space to generate a 2D temperature-distribution map overlayed on a map of a space. But this kind of interpolation has major flaws. Especially when dealing with temperature data, linear interpolation has no bounded contraints. There is no notion of heat transfer, no notion of space separation, no physical model that can be fed the collected data so that observations are based more on physical reality rather than pure mathematical modeling. Really, the model should set the context for the data and the data should then inform the model. Adjustments to the model should made according to the data.

On the data model...

I am reluctant to use the word column or view since it implies a relational data model. The data model is object-oriented. We can, however, map data objects to tables. The concept of a column is an ordered vector of object attribute values for the same attribute label. The row corresponds to an object of attribute-value pairs. Views are comprised of a mapping of attributes in column form. Since all objects are timestamped, we can use a model (physical, mathematical, or otherwise) to fill in any missing values so that a merge/join can be performed.

Open challenges:

1. Real-time streaming queries
2. Actuation
3. Staged pub/sub system
4. Alerts and modeling
5. Query language

Real-time streaming queries
This problem has been addressed in the literature. Windowing is one of the most common techniques.

Actuation
This is a problem in sensor networks that has not been addressed. Most papers mention the need for actuation, very few actually do it.

Staged pub/sub system
Client systems to IS4 will have the ability to subscribe to various stages of incoming data and processing stream. This implies the need to paint stream flows and forwarding those flows to all subscribers.

Alerts and modeling
Models used to apply to incoming data; data informing the model for adjustments. Deviations of data from model beyond pre-defined thresholds should trigger either alerts to client processes or online model adjustment.

Query language
How do we query the data? If end-clients are not processes by a human user, should we provide a SQL-like to allow users to express what they would like from the data without having to express how to get it.

[more later]