In-situ MapReduce 

Log processing is an important component of many large web-based services.  The log data is brought to a centralized cluster, often running MapReduce, Hadoop, or Dryad.  However, the process of simply collecting the data can take hours.   The system may collect data from thousands of machines, each of which produces 1-10MB/s.  At the same time, the first processing task often reduces the data (through filtering or aggregation) by over 50%.   

This project uses an "in-situ" architecture for running MapReduce jobs on the servers themselves.  We are porting map and reduce operators to a distributed stream processor, including supporting continuous, windowed MapReduce operations.   This research is investigating how to provide high availability, minimally impact the end systems, and return meaningful results.  


SSL.UCSD: 2012