Data-intensive processing turns raw data into useful data products for a variety of applications, including web search, click analytics, and bio-informatics. Current data-intensive scalable computing (DISC) architectures, such as Hadoop, scale to thousands of nodes and create enormous pools of derived data. These systems execute complex, multi-stage dataflows that can consist of hundreds of operations. Due to the scale and complexity of these analytics, derived data management, auditing, and debugging are fast becoming the next major operational bottlenecks in data-intensive computing.
We are designing scalable architectures that capture and use fine-grain data provenance to improve data management for data-intensive computing systems. Fine-grain provenance associates data outputs, e.g, records and files, with the data inputs used to create them, and Newt uses this information to debug and manage large-scale analytics. Specifically, this approach supports the replay of complex analytics to re-generate specific outputs, avoiding re-running entire data flows. This supports a variety of data management, workflow debugging, and data auditing scenarios.
Funding
- 2011 National Science Foundation award number CCF-1048296