Our group explores a broad range of computer architecture and system topics that bridge system software and hardware design. Our recent research efforts can be categorized into the following themes:
Persistent memory -- enabled by nonvolatile memories (NVRAMs) -- is envisioned as one of the next new data storage components in future computing systems. Seeing the great value, both system software and hardware suppliers have recently begun to adopt persistent memory in their next-generation designs. Though promising, this technical transition fundamentally changes current memory and storage system design assumptions and introduces critical design challenges. Our goal in this research is to enhance the performance, energy efficiency, and reliability of memory and storage systems ranging from data centers to end-user devices. Our approach includes persistent-memory-aware fault tolerance support, logging acceleration, persistent caching in system libraries and userspace, NV-DIMM system profiling and characterization, and load balancing techniques among critical system resources.
The organization of computer systems are becoming increasingly complex and heterogeneous in both software and hardware. Machine learning is demonstrated to outperform traditional heuristics in various aspects of advancing and optimizing computer system design, such as compilers, programming language, microarchitecture, and data center scheduling. On the other hand, machine learning has witnessed an increasing wave of large demand for computation and data storage resources with skyrocketing dataset and model sizes. In order to continue to scale computer systems and machine learning, new design approaches are needed. To embrace these challenges, we are exploring machine learning and systems co-design for programming language, compilers, and heterogeneous systems, such as neural programming translation, neural decompilation, neural architecture search in heterogeneous systems, and software/hardware co-design for deep learning acceleration.
The combination of smart technology and edge computing has paved the way for us to embrace the technology movement of smart homes, autonomous driving vehicles, and long-term service robots. These smart applications introduce critical architecture and system design challenges. One key component in these smart applications is deep learning. The increasingly larger number of parameters and data sets impose substantial system bottlenecks in memory and storage capacity, bandwidth, and energy efficiency. To tackle such bottlenecks, we are exploring near-data processing to accelerate deep learning. Moreover, the mission-critical functions of many emerging edge infrastructures will require unparalleled reliability and fault tolerance guarantees. For example, certain applications, such as autonomous driving, require fast recovery when systems fail. We are investigating efficient system management and fault tolerance schemes suitable for smart applications.
The volume of data has skyrocketed over the last decade, due to expanding working set sizes of modern applications. Unfortunately, memory capacity scaling falls far behind the pace of the application demand with current DDRx based architectures. One promising solution to tackle such challenges is memory networking, with memory systems consisted of multiple memory nodes interconnected by a high-speed network. The interconnected memory nodes form a disaggregated memory pool shared by processors from different CPU sockets in the server. Ideally, the memory network can enable more scalable performance and capacity than traditional memory systems.
Recently, accelerating general-purpose applications with GPU draws wide interests. In this project, we explore hardware design and hardware/software interface that continue to exploit the energy-efficiency of specialization while broadening the range of applicable applications. We have designed effective GPU architectures and data management techniques to optimize system energy efficiency and memory bandwidth, by exploiting the memory access patterns of general-purpose applications. Our on-going research investigates architectural support for (1) reducing the performance constraints posed by specialization (e.g., thread synchronization overheads of GPUs), (2) increasing the programmability of special-purpose accelerators, by developing software interfaces that can enable efficient mapping of generalpurpose applications to special-purpose hardware, without the need of reprogramming applications for each different type of accelerators.
Cost is always an important factor that influence the adoption of new technologies. We analyze and model the cost of designing and fabricating circuits and systems developed by employing new technologies. With the comprehensive cost models, we intend to help hardware designers choose the most cost effective design strategy at the early stages of the design flow. Our cost analysis among various design options can demonstrate that by properly configuring processor organization, new technologies are able to reduce the fabrication cost compared to their traditional technology counterparts.
In the area of hybrid memory designs, our studies are concerned with helping architects determine the best system organization when multiple choices of memory technologies exist. In particular, we proposed a bandwidth-aware reconfigurable cache architecture design, consisting of a hybrid cache hierarchy, a reconfiguration mechanism, and a statistical prediction engine. Our design dynamically adapts the cache capacity of each level based on the predicted bandwidth demands of different applications. We also developed an analytical performance model that estimates the performance of a throughput computing working running on various memory hierarchy configurations.
3D integration technology allows us to integrate two or more dies of a chip multiprocessor (CMP) vertically. Compared to traditional 2D CMPs, 3D CMPs promise reduced circuit delay, high memory bandwidth, and condensed form factor. But the continuous increase of power and energy budgets for CMP designs potentially brings in critical system design problems, such as power supply rail design, system reliability, and thermal issue. We address the high energy consumption and thermal issues when 3D stacking technology is used in developing CMPs. We propose to reduce the energy consumption of 3D-stacked CMPs by both temporally and spatially finegrained tuning the supply voltage and frequency of of processor cores and caches. Our tuning technique is implemented by integrating an array of onchip voltage regulators into the original processor.