SUN ACCESS

 

Characterizing and Managing Temperature for Increased System Reliability

 

  • Thermal hot spots and temperature variations degrade system reliability and performance.
  • Our Focus: Developing techniques for cost-effective temperature management to increase system reliability

 

 

Characterizing the Effect of Thermal Stress on Reliability

 

  • We leverage Continuous System Telemetry Harness (CSTH) for quantifying the thermal stress experienced by computer chips.
  • CSTH:
    • Efficient infrastructure for collecting and analyzing time series data
    • Advanced pattern recognition tools for reliability surveillance: e.g. Multivariate state estimation technique (MSET)
  • Quantification of thermal stress can guide the reliability testing at manufacturing stage.
  • Using CSTH, we can obtain a real time metric for evaluating the remaining useful life of components. See our innovative Length-of-Curve (LOC) Metric.

 

 

Optimizing the Thermal Profile

 

  • Thermal management techniques typically trade off performance to lower temperature.
  • Current systems on the market only pay attention to the maximum temperature achieved.
  • Frequency and magnitude of thermal hot spots and temperature gradients can be minimized in a cost effective way through real time monitoring (i.e. CSTH) and adapting to changes in workload and cooling dynamics. See the Temperature Aware Scheduling technique we have developed.

 

 

Estimating On-Die Temperature Accurately

 

  • Lower estimates than actual temperature cause late activation of thermal management, which increases packaging costs and degrades reliability.
  • Higher estimates than actual temperature cause early activation of thermal management, which degrades performance.
  • Our Accurate Temperature Estimation technique eliminates thermal sensor noise. It also provides temperature estimates on various locations on the chip based on temperature readings from a limited number of thermal sensors.

 

 

CSTH and MSET can be combined to perform real-time reliability monitoring, and for building an autonomous closed loop control system to optimize reliability.

 

* This work has been funded by Sun Microsystems, and the University of California MICRO grant 06-198.