Performance Prediction Engineering

Francine Berman and Rich Wolski

UC San Diego

DARPA Report '98

Project Objective

High-end distributed computing and communication systems (metacomputing systems) require a new generation of techniques for predicting the performance of distributed applications. Performance prediction must reflect hardware and software heterogeneity and adapt to the dynamic variation in the performance deliverable to an application by system resources. In this project, we use a "structural modeling" approach to develop Performance Prediction Engineering systems (PPE systems) which address the performance prediction challenges of metacomputing systems. The idea is to develop models which reflect the dynamic interplay of deliverable metacomputing system performance characteristics and application resource requirements over a specific time-frame and predict application behavior with quantifiable accuracy.

Approach

In this project, we introduce and develop Performance Prediction Engineering (PPE) systems to provide a framework for modeling the time-dependent performance of applications in dynamic metacomputing environments. A PPE consists of the following components:

STRUCTURAL MODELS/PERFORMANCE GRAMMARS --- Performance predictions made by our system are generated from compositional models (called "structural models" or "performance grammars"). Structural models consist of components which represent the performance activities (e.g. computation or communication activities) of the application. Each component can be instantiated using time-dependent distributional dynamic parameters at the level of "accuracy" appropriate for its use.

QUALITY OF INFORMATION ("QoIn") MEASURES --- Each prediction generated by a PPE can be associated with a set of quantifiable attributes which characterize its qualitative value. Such attributes might include the "lifetime" of the prediction, its "accuracy", the "overhead" of computing the prediction, etc. QoIn attributes for component models of structural models can be combined to produce QoIn attributes for predictions of the overarching structural model. The quality of performance information critically impacts the effectiveness of performance estimations for scheduling. QoIn measures and values provide a way of quantifying qualitative information so that it can be used to improve application schedules and ultimately application performance.

PERFORMANCE FORECASTING --- The basis of all PPE predictions is a set of resource or application component performance forecasts. Dynamic and time-dependent information is critical for accurate modeling. We use the Network Weather Service (NWS) -- a facility which monitors, reports, and forecasts dynamic load and availability of system resources -- to provide performance forecasts which are based on statistical analysis of continually-updated resource performance histories. NWS forecasts are used to parameterize PPE application performance predictions. In this project, we extend the NWS to support PPE by calculating QoIn measures associated with performance predictions, and by enhancing the performance sensors themselves to better support PPE performance predictions.

Recent Accomplishments

Our first research accomplishment focused on the development of a PPE system for a representative application. We developed structural models for a distributed SOR (Successive Over-Relaxation) application. This involved analyzing the distributions associated with the parameters of the component models and developing component models which could use the enhanced (distributional) data. Distributional data (as opposed to single point values) provides additional information about the range of possible parameter values. Results for the SOR application in several settings indicate that the PPE methodology can accurately predict the range of behavior of applications running on dynamic multi-user metacomputing systems. For the SOR application in environments with well-behaved load conditions, we captured 100% of the expected behavior using distributional predictions as opposed to up to a 20% error which would be achieved by point-valued predictions for the same environment. In more chaotic systems, we captured approximately 80% of the application behavior, with a maximum error of only 14% using distributional predictions, in contrast to an error of up to 40% using point-valued predictions. Similar results are indicated by preliminary experiments with additional applications under development.

Our second research accomplishment involved the development of QoIn accuracy measures for PPE models. To generate accuracy-based QoIn measures, we enhanced the NWS to decompose the accuracy of NWS forecasts into a measurement error component and a prediction error component. Measurement error is proportional to the intrusiveness of the performance measurement process. We developed adaptive performance monitors based on a combination of active resource probing and passive monitoring techniques that seek to reduce measurement error while at the same time introducing as little resource load as possible.

Our third research accomplishment continued the NWS PPE enhancement by focusing on reducing the prediction error in NWS forecasts. The NWS forecasting system monitors prediction error dynamically and adjusts forecasting model parameters dynamically in response to changing conditions. It then automatically chooses between different forecasting models based on a history of their individual accuracies. To improve forecast accuracy, we introduced both new forecasting techniques and different methods for dynamically identifying the best model for each resource.

Results representative of all these accomplishments were submitted to the literature and reported on at technical meetings.

Current Plan

Our current plan with respect to this project is to focus on the following:

Technology Transfer

The structural modeling techniques developed in this project provide an effective framework for metacomputing scheduling. We are using the PPE project to explore modeling and quality of information issues in conjunction with the metacomputing application-level scheduling (AppLeS) project led by project PIs. AppLeS provides a framework for using PPE models and QoIn measures. We are in the process of using these concepts to develop applications in collaboration with NPACI, NSF, DoD, Globus/Gusto and NASA researchers.

The NWS enhancements designed and prototyped on this project are being deployed as part of the National Partnership for Advanced Computational Infrastructure (NPACI) project funded by NSF. NPACI maintains a national-scale installation of the NWS. We are currently testing the effectiveness of the PPE enhancements to NWS developed in this project in various Internet settings. As their implementations mature, we are integrating them as part of the NPACI NWS effort.