Continuous Bulk Processing

Government, medical, financial, and web-based services increasingly depend on the ability to rapidl sift through huge, evolving data sets. These data-intensive applications perform complex multi-step computations over successive generations of data inflows (e.g., weekly web crawls, nightly telescope dumps, or daily click and ad impression logs).   

MapReduce and other "bulk" processing systems are popular as they make it simple to leverage commodity clusters to process large volumes of data.   However, to make today's analytics efficient, many programmers try to process incrementally.  This allows them to update, not re-compute, answers as new data arrive.  Similarly, many analytics, such as ML techniques, are iterative in nature, repeatedly refining data until some threshold is reached. 

State, the ability to re-use prior outputs, is a fundamental requirement for these analytics.  However, state is outside the purview of many bulk-processing systems, such as MapReduce or Hadoop.  This not only forces programmers to manage state by hand, it also leaves many important system-level optimizations on the floor.

We are creating a generalized bulk-processing architecture that extends the core features MapReduce-style computing.   In particular, the programming model explicitly includes state, allowing the underlying system to make several important optimizations.   We have expressed incremental web-oriented workflows and large-graph algorithms in this model, realizing a 50% improvement in running time over vanilla Hadoop.  Moreover, the optimizations reduce data movement by similar fractions, reducing the already high demands such systems place on data center networks. 

SSL.UCSD: 2012