Publications
-
Abstract
- 66. H Chen, Y Ni, A Zakeri, Z Zou, S Yun, F Wen, B Khaleghi, N Srinivasa, H Latapie, M Imani, "HDReason: Algorithm-Hardware Codesign for Hyperdimensional Knowledge Graph Reasoning", arXiv preprint, 2024.
-
Abstract
- 65. M Imani, Y Kim, B Khaleghi, J Morris, H Alimohamadi, F Imani, H Latapie, "Hierarchical, Distributed and Brain-Inspired Learning for Internet of Things Systems", ICDCS, 2023.
-
Abstract
- 64. T Zhang, S Salamat, B Khaleghi, J Morris, B Aksanli, TS Rosing, "HD2FPGA: Automated Framework for Accelerating Hyperdimensional Computing on FPGAs", ISQED, 2023.
-
Abstract
- 63. T Zhang, J Morris, K Stewart, H W Lui, B Khaleghi, A Thomas, T G Marback, B Aksanli, E Neftci, TS Rosing, "HyperSpikeASIC: Accelerating Event-based Workloads with HyperDimensional Computing and Spiking Neural Networks", IEEE TCAD, 2023.
-
Abstract
- 62. D Jones, J Allen, X Zhang, B Khaleghi, J Kang, W Xu, N Moshiri, TS Rosing, "HD-bind: Encoding of molecular structure with low precision, hyperdimensional binary representations", arXiv preprint, 2023.
-
Abstract
- 61. R Fielding-Miller, S Karthikeyan, T Gaines, et al., "Safer at school early alert: an observational study of wastewater and surface monitoring to detect COVID-19 in elementary schools", The Lancet Regional Health-Americas, 2023.
-
Abstract
- 60. B Khaleghi, T Zhang, C Martino, G Armstrong, A Akel, K Curewitz, J Eno, S Eilert, R Knight, N Moshiri, TS Rosing, "SALIENT: Ultra-Fast FPGA-based Short Read Alignment", FPT, 2022. (Best Paper Nomination)
-
Abstract
- 59. A Thomas, B Khaleghi, G K Jha, N Himayat, R Iyer, N Jain, TS Rosing, "Streaming Encoding Algorithms for Scalable Hyperdimensional Computing", arXiv preprint, 2022.
-
Abstract
- 58. U Mallappa, P Gangwar, B Khaleghi, H Yang, TS Rosing, "TermiNETor: Early Convolution Termination for Efficient Deep Neural Networks", ICCD, 2022.
-
Abstract
- 57. B Khaleghi, T Zhang, N Shao, A Akel, K Curewitz, J Eno, S Eilert, N Moshiri, TS Rosing, "FAST: FPGA-based Acceleration of Genomic Sequence Trimming", BioCAS, 2022.
-
Abstract
- 56. J Kang, B Khaleghi, Y Kim, TS Rosing, "OpenHD: A GPU-Powered Framework for Hyperdimensional Computing", IEEE TC, 2022.
-
Abstract
- 55. B Khaleghi, J Kang, H Xu, J Morris, TS Rosing, "GENERIC: Highly Efficient Learning Engine on Edge using Hyperdimensional Computing", SRC TECHCON, 2022.
-
Abstract
- 54. M Imani, Y Kim, B Khaleghi, N Moshiri, S Gupta, V Kumar, TS Rosing, "Methods, Circuits, and Articles of Manufacture for Searching within a Genomic Reference Sequence for Queried Target Sequence using Hyper-Dimensional Computing Techniques", US Patent App, 2022.
-
Abstract
- 53. J Morris, M Imani, Y Kim, J. Messerly, Y Guo, B Khaleghi, S Gupta, S Salamat, J Sim, TS Rosing, "Circuits, Methods, and Articles of Manufacture for Hyper-Dimensional Computing Systems and Related Applications", US Patent App, 2022.
-
Abstract
- 52. S Salamat, M Imani, B Khaleghi, TS Rosing, "Methods and Systems Configured to Specify Resources for Hyperdimensional Computing Implemented in Programmable Devices using a Parameterized Template for Hyperdimensional Computing", US Patent App, 2022.
-
Abstract
- 51. J Morris, Y Hao, S Gupta, B Khaleghi, B Aksanli, TS Rosing, "Stochastic-HD: Leveraging Stochastic Computing on the Hyper-Dimensional Computing Pipeline", Frontiers in Neuroscience, 2022.
-
Abstract
- 50. A Dutta, S Gupta, B Khaleghi, R Chandrasekaran, W Xu, TS Rosing, "HDnn-PIM: Efficient in Memory Design of Hyperdimensional Computing with Feature Extraction", GLSVLSI, 2022.
-
Abstract
- 49. J Morris, K Ergun, B Khaleghi, M Imani, B Aksanli, TS Rosing, "HyDREA: Utilizing Hyperdimensional Computing for a More Robust and Efficient Machine Learning System", ACM TECS, 2022.
-
Abstract
- 48.G Armstrong, C Martino, J Morris, B Khaleghi, et al. "Swapping Metagenomics Preprocessing Pipeline Components Offers Speed and Sensitivity Increases", mSystems, 2022.
-
Abstract
- 47. B Khaleghi*, U Mallappa*, D Yaldiz, H Yang, M Shah, J Kang, TS Rosing, "PatterNet: Explore and Exploit Filter Patterns for Efficient Deep Neural Networks", DAC, 2022.
-
Abstract
- 46. B Khaleghi, J Kang, H Xu, J Morris, TS Rosing, "GENERIC: Highly Efficient Learning Engine on Edge using Hyperdimensional Computing", DAC, 2022.
-
Abstract
- 45. S Gupta, B Khaleghi, S Salamat, J Morris, R Ramkumar, J Yu, A Tiwari, J Kang, M Imani, B Aksanli, TS Rosing, "Store-n-Learn: Classification and Clustering with Hyperdimensional Computing across Flash Hierarchy", ACM TECS, 2022.
-
Abstract
- 44. J Morris, H Lui, K Stewart, B Khaleghi, A Thomas, T Marback, B Aksanli, E Neftci, TS Rosing, "HyperSpike: HyperDimensional Computing for More Efficient and Robust Spiking Neural Networks", DATE, 2022.
-
Abstract
- 43. J Kang, B Khaleghi, Y Kim, TS Rosing, "XCelHD: An Efficient GPU-Powered Hyperdimensional Computing with Parallelized Training", ASP-DAC, 2022.
-
Abstract
- 42. B Khaleghi, M Imani, S Salamat, TS Rosing, "Methods of Providing Trained Hyperdimensional Machine Learning Models Having Classes with Reduced Elements and Related Computing Systems", US Patent App, 2021.
-
Abstract
- 41. R. Fielding-Miller, S Karthikeyan, T Gaines, et al., "Wastewater and surface monitoring to detect COVID-19 in elementary school settings: The Safer at School Early Alert project", Medrxiv, 2021.
-
Abstract
- 40. A Paul, G Hota, B Khaleghi, Y Xu, TS Rosing, G Cauwenberghs, "Attention State Classification with In-Ear EEG", BioCAS, 2021.
-
Abstract
- 39. Y Hao, S Gupta, J Morris, B Khaleghi, B Aksanli, TS Rosing, "Stochastic-HD: Leveraging Stochastic Computing on Hyper-Dimensional Computing", ICCD, 2021.
-
Abstract
As the size of data generated every day grows dramatically, the computational bottleneck of computer systems has been shifted toward the storage devices. The interface between the storage and the computational platforms has become the main limitation as it provides limited bandwidth which does not scale when the number of storage devices increases. Interconnect networks do not provide simultaneous accesses to all storage devices and thus limit the performance of the system when independent operations on different storage devices. Offloading the computations to the storage devices eliminates the burden of data transfer from the interconnects. Emerging as a nascent computing trend, near storage computing offloads a portion of computation to the storage devices to accelerate the big data applications. In this paper, we propose a near storage accelerator for database sort, NASCENT, which utilizes Samsung SmartSSD, an NVMe flash drive with an on-board FPGA chip that processes data in-situ. We propose, to the best of our knowledge, the first near storage database sort based on bitonic sort which considers the specifications of the storage devices to increase the scalability of computer systems as the number of storage devices increases. NASCENT improves both performance and energy efficiency as the number of storage devices increases. With 12 SmartSSDs, NASCENT is 7.6x (147.2x) faster and 5.6x (131.4x) more energy efficient than the FPGA (CPU) baseline
- 38. S Salamat, A H Aboutalebi, B Khaleghi, J H Lee, Y S Ki, TS Rosing, "NASCENT: Near-Storage Acceleration of Database Sort on SmartSSD", FPGA, 2021.
-
Abstract
- 37. B Khaleghi, H Xu, J Morris, TS Rosing, "tiny-HD: Ultra-Efficient Hyperdimensional Computing Engine for IoT Applications", DATE, 2021.
-
Abstract
Today's systems, especially in the age of federated learning, rely on sending all the data to the cloud, and then use complex algorithms, such as Deep Neural Networks, which require billions of parameters and many hours to train a model. In contrast, the human brain can do much of this learning effortlessly. Hyperdimensional (HD) Computing aims to mimic the behavior of the human brain by utilizing high dimensional representations. This leads to various desirable properties that other Machine Learning (ML) algorithms lack such as: robustness to noise in the system and simple, highly parallel operations. In this paper, we propose HyDREA, a HD computing system that is Robust, Efficient, and Accurate. To evaluate the feasibility of HyDREA in a federated learning environment with wireless communication noise, we utilize NS-3, a popular network simulator that models a real world environment with wireless communication noise. We found that HyDREA is 48x more robust to noise than other comparable ML algorithms. We additionally propose a Processing-in-Memory (PIM) architecture that adaptively changes the bitwidth of the model based on the signal to noise ratio (SNR) of the incoming sample to maintain the robustness of the HD model while achieving high accuracy and energy efficiency. Our results indicate that our proposed system loses less than 1% classification accuracy, even in scenarios with an SNR of 6.64. Our PIM architecture is also able to achieve 255x better energy efficiency and speed up execution time by 28x compared to the baseline PIM architecture.
- 36. J Morris, K Ergun, B Khaleghi, M Imani, B Aksanli, TS Rosing, "HyDREA: Towards More Robust and Efficient Machine Learning Systems with Hyperdimensional Computing", DATE, 2021.
-
Abstract
- 35. R Garcia, F Asgarinejad, B Khaleghi, TS Rosing, M Imani, "TruLook: A Framework for Configurable GPU Approximation", DATE, 2021.
-
Abstract
FPGA devices are continually integrating more and more resources to satisfy emerging applications’ performance requirements, which has increased the power of cutting-edge devices beyond CPUs. Consequently, more aggressive power reduction techniques have been explored recently, including voltage scaling, which is also adopted by several commercial FPGA families. In this paper, we investigate the FPGA routing (both the interconnection network and the routing algorithm) under variable voltage, temperature, as well as degradation. We first examine routing switch boxes (SBs) and point out that SBs with different wire segment lengths have different sensitivity/tolerance to varying operating conditions. Accordingly, we show how architectures with similar overall efficiency in the nominal condition can have a different performance at the scaled voltage or temperature. Finally, we reveal that unlike current FPGA flow that first accomplishes the placement and routing and ex-posts multi-condition timing analysis, bringing the timing information of the actual operating condition in the placement and routing steps helps the underlying algorithms to utilize the resources that have better relative efficiency in the target condition, leading to higher performance.
- 34. B Khaleghi, S Salamat, TS Rosing, "Revisiting FPGA Routing under Varying Operating Conditions", FPT, 2020.
-
Abstract
Deep neural networks are widely deployed on embedded devices to solve a wide range of problems from edge-sensing to autonomous driving. The accuracy of these networks is usually proportional to their complexity. Quantization of model parameters (i.e., weights) and/or activations to alleviate the complexity of these networks while preserving accuracy is a popular powerful technique. Nonetheless, previous studies have shown that quantization level is limited as the accuracy of the network decreases afterward. We propose Residue-Net, a multiplication-free accelerator for neural networks that uses Residue Number System (RNS) to achieve substantial energy reduction. RNS breaks down the operations to several smaller operations that are simpler to implement. Moreover, Residue-Net replaces the copious of costly multiplications with non-complex, energy-efficient shift and add operations to further simplify the computational complexity of neural networks. To evaluate the efficiency of our proposed accelerator, we compared the performance of Residue-Net with a baseline FPGA implementation of four widely-used networks, viz., LeNet, AlexNet, VGG16, and ResNet-50. When delivering the same performance as the baseline, Residue-Net reduces the area and power (hence energy) respectively by 36% and 23%, on average with no accuracy loss. Leveraging the saved area to accelerate the quantized RNS network through parallelism, Residue-Net improves its throughput by 2.8x and energy by 2.7x.
- 33. S Salamat, S Shubhi, B Khaleghi, TS Rosing, "Residue-Net: Multiplication-free Neural Network by In-situ, No-loss Migration to Residue Number Systems", ASP-DAC, 2021.
-
Abstract
- 32. B Khaleghi, M Imani, TS Rosing, "Prive-HD: Privacy-Preserved Hyperdimensional Computing", SRC TECHCON Conference, 2020.
-
Abstract
Hyperdimensional computing (HD) is an emerging paradigm for machine learning based on the evidence that the brain computes on high-dimensional, distributed, representations of data. The main operation of HD is encoding, which transfers the input data to hyperspace by mapping each input feature to a hypervector, accompanied by so-called bundling procedure that simply adds up the hypervectors to realize encoding hypervector. Although the operations of HD are highly parallelizable, the massive number of operations hampers the efficiency of HD in embedded domain. In this paper, we propose SHEARer, an algorithm-hardware co-optimization to improve the performance and energy consumption of HD computing. We gain insight from a prudent scheme of approximating the hypervectors that, thanks to inherent error resiliency of HD, has minimal impact on accuracy while provides high prospect for hardware optimization. In contrast to previous works that generate the encoding hypervectors in full precision and then ex-post quantizing, we compute the encoding hypervectors in an approximate manner that saves a significant amount of resources yet affords high accuracy. We also propose a novel FPGA implementation that achieves striking performance through massive parallelism with low power consumption. Moreover, we develop a software framework that enables training HD models by emulating the proposed approximate encodings. The FPGA implementation of SHEARer achieves an average throughput boost of 104,904x (15.7x) and energy savings of up to 56,044x (301x) compared to state-of-the-art encoding methods implemented on Raspberry Pi 3 (GeForce GTX 1080 Ti) using practical machine learning datasets.
- 31. B Khaleghi, S Salamat, A Thomas, F Asgarinejad, Y Kim, TS Rosing, "SHEARer: Highly-Efficient Hyperdimensional Computing by Software-Hardware Enabled Multifold AppRoximation", ISLPED, 2020.
-
Abstract
The privacy of data is a major challenge in machine learning as a trained model may expose sensitive information of the enclosed dataset. Besides, the limited computation capability and capacity of edge devices have made cloud-hosted inference inevitable. Sending private information to remote servers makes the privacy of inference also vulnerable because of susceptible communication channels or even untrustworthy hosts. In this paper, we target privacy-preserving training and inference of brain-inspired Hyperdimensional (HD) computing, a new learning algorithm that is gaining traction due to its light-weight computation and robustness particularly appealing for edge devices with tight constraints. Indeed, despite its promising attributes, HD computing has virtually no privacy due to its reversible computation. We present an accuracy-privacy trade-off method through meticulous quantization and pruning of hypervectors, the building blocks of HD, to realize a differentially private model as well as to obfuscate the information sent for cloud-hosted inference. Finally, we show how the proposed techniques can be also leveraged for efficient hardware implementation.
- 30. B Khaleghi, M Imani, TS Rosing, "Prive-HD: Privacy-Preserved Hyperdimensional Computing", DAC, 2020.
-
Abstract
FPGA devices are continuously evolving to meet high computation and performance demand for emerging applications. As a result, cutting edge FPGAs are not energy efficient as conventionally presumed to be, and therefore, aggressive power-saving techniques have become imperative. The clock rate of an FPGA-mapped design is set based on worst-case conditions to ensure reliable operation under all circumstances. This usually leaves a considerable timing margin that can be exploited to reduce power consumption by scaling voltage without lowering clock frequency. There are hurdles for such opportunistic voltage scaling in FPGAs because (a) critical paths change with designs, making timing evaluation difficult as voltage changes, (b) each FPGA resource has particular power-delay trade-off with voltage, (c) data corruption of configuration cells and memory blocks further hampers voltage scaling. In this paper, we propose a systematical approach to leverage the available thermal headroom of FPGA-mapped designs for power and energy improvement. By comprehensively analyzing the timing and power consumption of FPGA building blocks under varying temperatures and voltages, we propose a thermal-aware voltage scaling flow that effectively utilizes the thermal margin to reduce power consumption without degrading performance. We show the proposed flow can be employed for energy optimization as well, whereby power consumption and delay are compromised to accomplish the tasks with minimum energy. Lastly, we propose a simulation framework to be able to examine the efficiency of the proposed method for other applications that are inherently tolerant to a certain amount of error, granting further power saving opportunity. Experimental results over a set of industrial benchmarks indicate up to 36% power reduction with the same performance, and 66% total energy saving when energy is the optimization target.
- 29. B Khaleghi, S Salamat, M Imani, TS Rosing, "FPGA Energy Efficiency by Leveraging Thermal Margin", ICCD, 2019.
-
Abstract
The continuous growth of big data applications with high computational and scalability demands has resulted in increasing popularity of cloud computing. Optimizing the performance and power consumption of cloud resources is therefore crucial to relieve the costs of data centers. In recent years, multi-FPGA platforms have gained traction in data centers as low-cost yet high-performance solutions particularly as acceleration engines, thanks to the high degree of parallelism they provide. Nonetheless, the size of data centers workloads varies during service time, leading to significant underutilization of computing resources while consuming a large amount of power, which turns out as a key factor of data center inefficiency, regardless of the underlying hardware structure. In this paper, we propose an efficient framework to throttle the power consumption of multi-FPGA platforms by dynamically scaling the voltage and hereby frequency during runtime according to prediction of, and adjustment to the workload level, while maintaining the desired Quality of Service (QoS). This is in contrast to, and more efficient than, conventional approaches that merely scale (i.e., power-gate) the computing nodes or frequency. The proposed framework carefully exploits a pre-characterized library of delay-voltage, and power-voltage information of FPGA resources, which we show is indispensable to obtain the efficient operating point due to the different sensitivity of resources w.r.t. voltage scaling, particularly considering multiple power rails residing in these devices. Our evaluations by implementing state-of-the-art deep neural network accelerators revealed that, providing an average power reduction of 4.0x, the proposed framework surpasses the previous works by 33.6% (up to 83%).
- 28. S Salamat, B Khaleghi, M Imani, TS Rosing, "Workload-Aware Opportunistic Energy Efficiency in Multi-FPGA Platforms", ICCAD, 2019.
-
Abstract
Sequence alignment is a core component of many biological applications. As the advancement in sequencing technologies produces a tremendous amount of data on an hourly basis, this alignment is becoming the critical bottleneck in bioinformatics analysis. Even though large clusters and highly-parallel processing nodes can carry out sequence alignment, in addition to the exacerbated power consumption, they cannot afford to concurrently process the massive amount of data generated by sequencing machines. In this paper, we propose a novel processing in-memory (PIM) architecture suited for DNA sequence alignment, called RAPID. We revise the state-of-the-art alignment algorithm to make it compatible with in-memory parallel computations, and process DNA data completely inside memory without requiring additional processing units. The main advantage of RAPID over the other alignment accelerators is a dramatic reduction in internal data movement while maintaining a remarkable degree of parallelism provided by PIM. The proposed architecture is also highly scalable, facilitating precise alignment of lengthy sequences. We evaluated the efficiency of the proposed architecture by aligning chromosome sequences from human and chimpanzee genomes. The results show that RAPID is at least 2x faster and 7x more power efficient than BioSEAL, the best DNA sequence alignment accelerator.
- 27. S Gupta, M Imani, B Khaleghi, V Kumar, TS Rosing, "RAPID: A ReRAM Processing in Memory Architecture for DNA Sequence Alignment", ISLPED, 2019.
-
Abstract
Hyperdimensional (HD) computing is gaining traction as an alternative light-way machine learning approach for cognition tasks. Inspired by the neural activity patterns of the brain, HD computing performs cognition tasks by exploiting longsize vectors, namely hypervectors, rather than working with scalar numbers as used in conventional computing. Since a hypervector is represented by thousands of dimensions (elements), the majority of prior work assume binary elements to simplify the computation and alleviate the processing cost. In this paper, we first demonstrate that the dimensions need to have more than one bit to provide an acceptable accuracy to make HD computing applicable to real-world cognitive tasks. Increasing the bit-width, however, sacrifices energy efficiency and performance, even when using low-bit integers as the hypervector elements. To address this issue, we propose a framework for HD acceleration, dubbed SparseHD, that leverages the advantages of sparsity to improve the efficiency of HD computing. Essentially, SparseHD takes account of statistical properties of a trained HD model and drops the least effective elements of the model, augmented by iterative retraining to compensate the possible quality loss raised by sparsity. Thanks to the bit-level manipulability and abounding parallelism granted by FPGAs, we also propose a novel FPGAbased accelerator to effectively utilize the advantage of sparsity in HD computation. We evaluate the efficiency of our framework for practical classification problems. We observe that SparseHD makes the HD model up to 90% sparse while affording a minimal quality loss (less than 1%) compared to the non-sparse baseline model. Our evaluation shows that, on average, SparseHD provides 48.5x and 15.0x lower energy consumption and faster execution as compared to the AMD R390 GPU implementation.
- 26. M Imani, S Salamat, B Khaleghi, M Samragh, F Koushanfar, TS Rosing, "SparseHD: Algorithm-Hardware Co-Optimization for Efficient High-Dimensional Computing", FCCM, 2019.
-
Abstract
- 25. J Sim, M Kim, Y Kim, S Gupta, B Khaleghi, TS Rosing, "Multi-bit Parallelized Sensing for Processing in Non-volatile Memory", NVMW, 2019.
-
Abstract
In the Internet of Things (IoT) era, data movement between processing units and memory is a critical factor in the overall system performance. Processing-in-Memory (PIM) is a promising solution to address this bandwidth bottleneck by performing a portion of computation inside the memory. Many prior studies have enabled various PIM operations on nonvolatile memory (NVM) by modifying sense amplifiers (SA). They exploit a single sense amplifier to handle multiple bitlines with a multiplexer (MUX) since a single SA circuit takes much larger area than an NVM 1-bit cell. This limits potential parallelism that the PIM techniques can ideally achieve. In this paper, we propose MAPIM, mat parallelism for high-performance processing in non-volatile memory architecture. Our design carries out multiple bit-lines (BLs) requests under a MUX in parallel with two novel design components, multi-column/row latch (MCRL) and shared SA routing (SSR). The MCRL allows the address decoder to activate multiple addresses in both column and row directions by buffering the consecutively-requested addresses. The activated bits are simultaneously sensed by the multiple SAs across a MUX based on the SSR technique. The experimental results show that MAPIM is up to 339x faster and 221x more energy efficient than a GPGPU. As compared to the state-of-theart PIM designs, our design is 16x faster and 1.8x more energy efficient with insignificant area overhead.
- 24. J Sim, M Kim, Y Kim, S Gupta, B Khaleghi, TS Rosing, "MAPIM: Mat Parallelism for High Performance Processing in Non-volatile Memory Architecture", ISQED, 2019.
-
Abstract
Hyperdimensional (HD) computing is a novel computational paradigm that emulates the brain functionality in performing cognitive tasks. The underlying computation of HD involves a substantial number of element-wise operations (e.g., addition and multiplications) on ultra-wise hypervectors, in the granularities of as small as a single bit, which can be effectively parallelized and pipelined. In addition, though different HD applications might vary in terms of number of input features and output classes (labels), they generally follow the same computation flow. Such characteristics of HD computing inimitably matches with the intrinsic capabilities of FPGAs, making these devices a unique solution for accelerating these applications. In this paper, we propose F5-HD, a fast and flexible FPGA-based framework for refreshing the performance of HD computing. F5-HD eliminates the arduous task of handcrafted designing of hardware accelerators by automatically generating an FPGA implementation of HD accelerator leveraging a template of optimized processing elements, according to the applications specification and user's constraint. Our evaluations using different classification benchmarks revealed that F5-HD provides 6.9x and 7.8x (11.9x and 1.7x) higher energy efficiency improvement and faster training (inference) as compared to an optimized implementation of HD on AMD R9 390 GPU, respectively.
- 23. S Salamat, M Imani, B Khaleghi, TS Rosing, "F5-HD: Fast Flexible FPGA-based Framework for Refreshing Hyperdimensional Computing", FPGA, 2019.
-
Abstract
To ensure reliable operation of circuits under elevated temperatures, designers are obliged to put a pessimistic timing margin proportional to the worst-case temperature Tworst, which incurs significant performance overhead. The problem is exacerbated in deep-CMOS technologies with increased leakage power, particularly in Field-Programmable Gate Arrays (FPGAs) that comprise an abundance of leaky resources. We propose a two-fold approach to tackle the problem in FPGAs. For this end, we first obtain the performance and power characteristics of FPGA resources in a temperature range. Having the temperature-performance correlation of resources together with the estimated thermal distribution of applications makes it feasible to apply minimal, yet sufficient, timing margin. Second, we show how optimizing an FPGA device for a specific thermal corner affects its performance in the operating temperature range. This emphasizes the need for optimizing the device according to the target (range of) temperature. Building upon this observation, we propose thermal-aware optimization of FPGA architecture for foreknown field conditions. We performed a comprehensive set of experiments to implement and examine the proposed techniques. The experimental results reveal that thermal-aware timing on FPGAs yields up to 36.5% performance improvement. Optimizing the architecture further boosts the performance by 6.7%.
- 22. B Khaleghi, TS Rosing, "Thermal-Aware Design and Flow for FPGA Performance Improvement", DATE, 2019.
-
Abstract
In this paper, we present a comprehensive analysis of the impact of aging on the interconnection network of Field-Programmable Gate Arrays (FPGAs) and propose novel approaches to mitigate aging effects on the routing network. We first show the insignificant impact of aging on data integrity of FPGAs, i.e., Static Noise Margin (SNM) and Soft Error Rate (SER) of the configuration cells, as well as we show the negligible impact of the mentioned degradations on the FPGA performance. As such, we focus on performance degradation of datapath transistors. In this regard, we propose a routing accompanied by a placement algorithm that prevent constant stress on transistors by evenly distributing the stress through the interconnection resources. By observing the impact of the signal probability on the aging of routing buffers, we enhance the synthesis flow as well as augment the proposed routing algorithm to converge the signal probabilities towards aging-friendly values. Experimental results over a set of industrial benchmarks and commercial-like FPGA architecture indicate the effectiveness of the proposed method with 64.3% reduction of stress duration in multiplexers and up to 45.2% improvement of the degradation of buffers. Altogether, the proposed method reduces the timing guardband by from 14.1% to 31.7%, depending on the FPGA routing architecture.
- 21. B Khaleghi, B Omidi, H Amrouch, J Henkel, H Asadi, "Estimating and Mitigating Aging Effects in Routing Network of FPGAs", IEEE TVLSI, 2018.
-
Abstract
The increased leakage power of deep-nano technologies in the one hand, and exponential growth in the number of transistors in a given die particularly in Field Programmable Gate Arrays (FPGAs) have resulted in an intensified rate of static power dissipation as well as power density. This ever-increasing static power consumption acts as a power wall to further integration of transistors and has caused the breakdown of Dennard scaling. To meet the available power budget and preclude reliability challenges associated with high power density, designers are obligated to restrict the active percentage of the chip by powering off a selective fraction of silicon die, referred to as Dark Silicon. Several promising architectures have been proposed to enhance the static power and energy efficiency in FPGAs. The main approach in the majority of suggested architectures includes applying power gating to unused logic and routing resources and/or designing power-efficient logic and routing elements such as Reconfigurable Hard Logics as an alternative for conventional Look-up Tables. This study represents a survey on evolution of SRAM-based FPGA architectures toward the era of dark silicon.
- 20. Z Seifoori, Z Ebrahimi, B Khaleghi, H Asadi, "Dark Silicon and Future On-chip Systems: Emerging SRAM-based FPGA Architectures in Dark Silicon Era", Book Chapter in Advances in Computers, 2018.
-
Abstract
The maximum delay of a processor is subject to both temperature (T) and voltage (V) jointly. Therefore, there is the prospect of reducing V at runtime whenever T is below the worst-case temperature without violating the predetermined timing constraints, instead of employing throughout a conservative V that corresponds to Tworst. Hence, the efficiency can be maximized, while still accounting for temperature variation at runtime, without sacrificing reliability. In this work, we focus on the key challenge of how the small, yet sufficient voltage can be accurately, at design time, obtained in which timing constraints at runtime are assuredly fulfilled. To achieve that, we model the delay of processor under the joint effects of T and V through creating cell libraries that contain the detailed delay information of cells at a wide variety of T and V cases. Libraries are compatible with commercial EDA tools. Hence, standard tool flows, even though they were not designed for that purpose, can be employed to seamlessly obtain the correlation between T and V in any circuit regardless its complexity. After modeling of the correlation between T and V from the physical level to circuit level, we implement at the system level a temperature-guided voltage adaptation technique that tunes the voltage at runtime following temperature variations leading to a considerable reduction in power.
- 19. H Amrouch, B Khaleghi, J Henkel, "Voltage Adaptation under Temperature Variation", SMACD, 2018.
-
Abstract
Generative Adversarial Networks (GANs) are a frontier in deep learning. GANs consist of two models: generative and discriminative. While the discriminative model uses the conventional convolution, the generative model depends on a fundamentally different operator, called transposed convolution. This operator initially inserts a large number of zeros in its input and then slides a window over this expanded input. This zero-insertion step leads to a large number of ineffectual operations and creates distinct patterns of computation across the sliding windows. The ineffectual operations along with the variation in computation patterns lead to significant resource underutilization when using conventional convolution hardware. To alleviate these sources of inefficiency, this paper devises FlexiGAN, an end-to-end solution, that generates an optimized synthesizable FPGA accelerator from a high-level GAN specification. FlexiGAN is coupled with a novel template architecture that aims to harness the benefits of both MIMD and SIMD execution models to avoid ineffectual operations. To this end, the proposed architecture separates data retrieval and data processing units at the finest granularity of each compute engine. Leveraging this separation enables the architecture to use a succinct set of operations to cope with the irregularities of transposed convolution. At the same time, it significantly reduces the on-chip memory usage, which is generally limited in FPGAs. We evaluate our end-to-end solution by generating FPGA accelerators for a variety of GANs. These generated accelerators provide 2.4× higher performance than an optimized conventional convolution design. In addition, FlexiGAN, on average, yields 2.8× (up to 3.7×) improvements in Performance-per-Watt over a Titan X GPU.
- 18. A Yazdanbakhsh, M Brzozowski, B Khaleghi, S Ghodrati, K Samadi, N S Kim, H Esmaeilzadeh, "FlexiGAN: An End-to-End Solution for FPGA Acceleration of Generative Adversarial Networks", FCCM, 2018.
-
Abstract
Novel algorithmic advances have paved the way for robotics to transform the dynamics of many social and enterprise applications. To achieve true autonomy, robots need to continuously process and interact with their environment through computationally-intensive motion planning and control algorithms under a low power budget. Specialized architectures offer a potent choice to provide low-power, high-performance accelerators for these algorithms. Instead of taking a traditional route which profiles and maps hot code regions to accelerators, this paper delves into the algorithmic characteristics of the application domain. We observe that many motion planning and control algorithms are formulated as a constrained optimization problems solved online through Model Predictive Control (MPC). While models and objective functions differ between robotic systems and tasks, the structure of the optimization problem and solver remain fixed. Using this theoretical insight, we create RoboX, an end-to-end solution which exposes a high-level domain-specific language to roboticists. This interface allows roboticists to express the physics of the robot and its task in a form close to its concise mathematical expressions. The RoboX backend then automatically maps this high-level specification to a novel programmable architecture, which harbors a programmable memory access engine and compute-enabled interconnects. Hops in the interconnect are augmented with simple functional units that either operate on in-fight data or are bypassed according a micro-program. Evaluations with six different robotic systems and tasks show that RoboX provides a 29.4× (7.3×) speedup and 22.1× (79.4×) performance-per-watt improvement over an ARM Cortex A57 (Intel Xeon E3). Compared to GPUs, RoboX attains 7.8×, 65.5×, and 71.8× higher Performance-per-Watt to Tegra X2, GTX 650 Ti, and Tesla K40 with a power envelope of only 3.4 Watts at 45 nm.
- 17. J Sacks, D Mahajan, R Lawson, B Khaleghi, H Esmaeilzadeh, "RoboX: An End-to-End Solution to Accelerate Autonomous Control in Robotics", ISCA, 2018.
-
Abstract
A programmable logic unit (PLU). The PLU includes a plurality of four-input reconfigurable hard logics (RHLs), a three-input look-up-table (LUT), and a plurality of reconfigurable inverters. The plurality of RHLs include a first RHL, a second RHL, and a third RHL. The plurality of reconfigurable inverters are associated with the plurality of RHLs.
- 16. H Asadi, Z Ebrahimi, B Khaleghi, "Programmable Logic Design", US Patent, 2018.
-
Abstract
Nowadays, embedded processors are widely used in wide range of domains from low-power to safety-critical applications. By providing prominent features such as variant peripheral support and flexibility to partial or major design modifications, Field-Programmable Gate Arrays (FPGAs) are commonly used to implement either an entire embedded system or a Hardware Description Language (HDL)-based processor, known as soft-core processor. FPGA-based designs, however, suffer from high power consumption, large die area, and low performance that hinders common use of soft-core processors in low-power embedded systems. In this paper, we present an efficient reconfigurable architecture to implement soft-core embedded processors in SRAM-based FPGAs by using characteristics such as low utilization and fragmented accessibility of comprising units. To this end, we integrate the low utilized functional units into efficiently designed Look-Up Table (LUT) based Reconfigurable Units (RUs). To further improve the efficiency of the proposed architecture, we used a set of efficient Configurable Hard Logics (CHLs) that implement frequent Boolean functions while the other functions will still be employed by LUTs. We have evaluated effectiveness of the proposed architecture by implementing the Berkeley RISC-V processor and running MiBench benchmarks. We have also examined the applicability of the proposed architecture on an alternative open-source processor (i.e., LEON2) and a Digital Signal Processing (DSP) core. Experimental results show that the proposed architecture as compared to the conventional LUT-based soft-core processors improves area footprint, static power, energy consumption, and total execution time by 30.7%, 32.5%, 36.9%, and 6.3%, respectively.
- 15. S Tamimi, Z Ebrahimi, B Khaleghi, H Asadi, "An Efficient SRAM-based Reconfigurable Architecture for Embedded Processors", IEEE TCAD, 2018.
-
Abstract
Despite the considerable effort has been put on the application of Non-Volatile Memories (NVMs) in Field-Programmable Gate Arrays FPGAs, previously suggested designs are not mature enough to substitute the state of-the-art SRAM-based counterparts mainly due to the inefficient building blocks and/or the overhead of programming structure which can impair their potential benefits. In this paper, we present a Resistive Random Access Memory RRAM-based FPGA architecture employing efficient Switch Box (SB) and Look-Up Table (LUT) designs with programming circuitry integrated in both SB and LUT designs that creates area and power efficient programmable components while precluding performance overhead to these blocks. In addition, we present an efficient scheme to load the configuration bitstream into the memory elements, which makes the configuration time comparable to that of SRAM-based FPGAs. Besides, we investigate the correct functionality and reliability of the programming structure subject to fluctuations in attributes of RRAM cells. Using Versatile Place and Route (VTR) tool with the obtained characteristics of the proposed blocks demonstrate that the average area and delay of the proposed FPGA architecture are 59.4% and 20.1% less than conventional SRAM-based FPGAs. Compared with a recent RRAM-based architecture, the proposed architecture improves the area and power by 49.7% and 33.8% while keeps the delay intact.
- 14. B Khaleghi, H Asadi, "A Resistive RAM-Based FPGA Architecture Equipped with Efficient Programming Circuitry", IEEE TCAS-I, 2018.
-
Abstract
In recent technology nodes, wide guardbands are needed to overcome reliability degradations due to aging. Such guardbands manifest as reduced efficiency and performance. Existing approaches to reduce guardbands trade off aging impact for increased circuit overhead. By contrast, the goal of this work is to completely remove guardbands through exploring, for the first time, application of approximate computing principles in the context of aging. As a result of naively narrowing or removing guardbands, timing errors start to appear as transistors age. We demonstrate that even in circuits that may tolerate errors, aging can be catastrophic due to unacceptable quality loss. Furthermore, quantifying such aging-induced quality loss necessitates expensive (often infeasible) gate-level simulations of the complete design. We show how nondeterministic aginginduced timing errors can be converted into deterministic and controlled approximations instead. We first translate the required guardband over time into an equivalent reduction in precision for individual RTL components. We then demonstrate how, based on pre-characterization of RTL components, we can quantify aging-induced approximation at the whole microarchitecture level without the need for further gate-level simulations. Results show that a 3 bit reduction in precision is sufficient to sustain 10 years of operation under worst-case aging in the context of an image processing circuit. This corresponds to an acceptable PSNR reduction of merely 8 dB, while at the same time increasing area and energy efficiency by 13%.
- 13. H Amrouch, B Khaleghi, A Gerstlauer, J Henkel, "Towards Aging-induced Approximations", DAC, 2017. (Best Paper Nomination)
-
Abstract
Continuous down scaling of CMOS technology in recent years has resulted in exponential increase in static power consumption which acts as a power wall for further transistor integration. One promising approach to throttle the substantial static power of Field-Programmable Gate Array (FPGAs) is to power off unused routing resources such as switch boxes, known as dark silicon. In this paper, we present a Power gating Switch Box Architecture (PESA) for routing network of SRAM-based FPGAs to overcome the obstacle for further device integration. In the proposed architecture, by exploring various patterns of used multiplexers in switch boxes, we employ a configurable controller to turn off unused resources in the routing network. Our study shows that due to the significant percentage of unused switches in the routing network, PESA is able to considerably improve power efficiency in SRAM-based FPGAs. Experimental results carried out on different benchmarks using VPR toolset show that PESA decreases power consumption of the routing network up to 75% as compared to the conventional architectures while preserving the performance intact.
- 12. Z Seifouri, B Khaleghi, H Asadi, "A Power Gating Switch Box Architecture in Routing Network of SRAM-Based FPGAs in Dark Silicon Era", DATE, 2017.
-
Abstract
We introduce the first temperature guardbands optimization based on thermal-aware logic synthesis and thermal-aware timing analysis. The optimized guardbands are obtained solely due to using our so-called thermal-aware cell libraries together with existing tool flows and not due to sacrificing timing constraints (i.e. no trade-offs). We demonstrate that temperature guardbands can be optimized at design time through thermal-aware logic synthesis in which more resilient circuits against worst-case temperatures are obtained. Our static guardband optimization leads to 18% smaller guardbands on average. We also demonstrate that thermal-aware timing analysis enables designers to accurately estimate the required guardbands for a wide range of temperatures without over/under-estimations. Therefore, temperature guardbands can be optimized at operation time through employing the small, yet sufficient guardband that corresponds to the current temperature rather than employing throughout a conservative guardband that corresponds to the worst-case temperature. Our adaptive guardband optimization results, on average, in a 22% higher performance along with 9 2% less energy. Neither thermal-aware logic synthesis nor thermal-aware timing analysis would be possible without our thermal-aware cell libraries. They are compatible with use of existing commercial tools. Hence, they allow designers, for the first time, to automatically consider thermal concerns within their design tool flows even if they were not designed for that purpose.
- 11. H Amrouch, B Khaleghi, J Henkel, "Optimizing Temperature Guardbands", DATE, 2017. (Best Paper Nomination)
-
Abstract
Significant increase of static power in nano-CMOS era and, subsequently, the end of Dennard scaling has put a Power Wall to further integration of CMOS technology in Field-Programmable Gate Arrays (FPGAs). An efficient solution to cope with this obstacle is power gating inactive fractions of a single die, resulting in Dark Silicon. Previous studies employing power gating on SRAM-based FPGAs have primarily focused on using large-input Look-up Tables (LUTs). The architectures proposed in such studies inherently suffer from poor logic utilization which limits the benefits of power gating techniques. This paper proposes a Power-Efficient Architecture for FPGAs (PEAF) based on combination of Reconfigurable Hard Logics (RHLs) and a small-input LUT. In the proposed architecture, we selectively turn off unused RHLs and/or LUTs within each logic block by employing a reconfigurable controller. By mapping a majority of logic functions to simple-design RHLs, PEAF is able to significantly improve power efficiency without deteriorating the performance. Experimental results over a comprehensive set of benchmarks (MCNC, IWLS'05, and VTR) demonstrate that compared with baseline four-LUT architecture, PEAF reduces the total static power and Power-Delay-Product (PDP), on average, by 24.5% and 21.7%, respectively. This is while the overall system performance is also improved by 1.8%. PEAF increases total area by 18.9%, however, it still occupies 22.1% less area footprint than the six-LUT architecture with 31.5% improvement in PDP.
- 10. Z Ebrahimi, B Khaleghi, H Asadi, "PEAF: A Power-Efficient Architecture for SRAM-Based FPGAs Using Reconfigurable Hard Logic Design in Dark Silicon Era", IEEE TC, 2016.
-
Abstract
Nanoelectromechanical (NEM) relays are a promising emerging technology that has gained widespread research attention due to its zero leakage current, sharp ON-OFF transitions, and complementary metal-oxide-semiconductor compatibility. As a result, NEM relays have been significantly investigated as highly energy-efficient design solutions. A major shortcoming of NEMs preventing their widespread use is their limited switching endurance. Hence, in order to utilize the low-power advantages of NEM relays, further device, circuit, and architectural techniques are required. In this paper, we introduce the concept of shadow NEM relays, which is a circuit-level technique to leverage the energy efficiency of the NEM relays despite their low switching endurance. This technique creates two virtual ground nodes in a block to allow: 1) a low power mode with functional NEM relays and 2) a normal mode with failed NEM relays. To demonstrate the applicability of this concept, we have applied it to a six-transistor SRAM cell as an illustrative example. We also investigate the applicability of this SRAM cell in field-programmable gate arrays and on-chip caches. Experimental results reveal that shadow NEM relays can reduce the power consumption of SRAM cells by up to 80% while addressing the limited switching endurance of NEM relays.
- 9. S Yazdanshenas, B Khaleghi, P Ienne, H Asadi, "Designing Low Power and Durable Digital Blocks Using Shadow Nano-Electromechanical Relays", IEEE TVLSI, 2016.
-
Abstract
"With the advent in technology and shrinking the transistor size down to nano scale, static power may become the dominant power component in Networks-on-Chip (NoCs). Powergating is an efficient technique to reduce the static power of under-utilized resources in different types of circuits. For NoC, routers are promising candidates for power gating, since they present high idle time. However, routers in a NoC are not usually idle for long consecutive cycles due to distribution of resources in NoC and its communication-based nature, even in low network utilizations. Therefore, power-gating loses its efficiency due to performance and power overhead of the packets that encounter powered-off routers. In this paper, we propose Turn-on on Turn (TooT) which reduces the number of wake-ups by leveraging the characteristics of deterministic routing algorithms and mesh topology. In the proposed method, we avoid powering a router on when it forwards a straight packet or ejects a packet, i.e., a router is powered on only when either a packet turns through it or its associated node injects a packet. Experimental results on PARSEC benchmarks demonstrate that, compared with the conventional power-gating, the proposed method improves static power and performance by 57.9% and 35.3%, respectively, at the cost of a negligible area overhead."
- 8. H Farrokhbakht, M Taram, B Khaleghi, S Hessabi, "TooT: An Efficient and Scalable Power-Gating Method for NoC Routers", NOCS, 2016.
-
Abstract
Continuous shrinking of transistor size to provide high computation capability along with low power consumption has been accompanied by reliability degradations due to e.g., aging phenomenon. In this regard, with huge number of configuration bits, Field-Programmable Gate Arrays (FPGAs) are more susceptible to aging since aging not only degrades the performance, it may additionally result in corrupting the configuration cells and thus causing permanent circuit malfunctioning. While several works have investigated the aging effects in Look-Up Tables (LUTs), the routing fabric of these devices is seldom studied - even though it contributes to the majority of FPGAs' resources and configuration bits. Furthermore, there is a high prospect that errors in its state to propagate to the device outputs. In this paper, we first investigate aging effects in the routing fabric of FPGAs with respect to performance and reliability degradations. Based on this investigation, we enhance the conventional routing algorithm to mitigate the impact of aging by increasing the recovery time (i.e., the mechanism used to heal aging-induced defects) of transistors used in the routing resources. We examine our proposed method as reduction in stress time and required guardband to protect against aging in the routing fabric, as well as in improving the FPGA's lifetime. Our experiments show that the proposed method reduces the average stress time and aging-induced delay of routing resources by 41% and 18.3%, respectively. This, in turn, leads to improving the device lifetime by 130% compared to baseline routing. The proposed method can be applied by simple amending of conventional routing algorithms. Thus, it incurs negligible delay overhead.
- 7. B KhaleghiB Omidi, H Amrouch, J Henkel, H Asadi, "Stress-Aware Routing to Mitigate Aging Effects in SRAM-based FPGAs", FPL, 2016.
-
Abstract
Due to aging, circuit reliability has become extraordinary challenging. Reliability-aware circuit design flows do virtually not exist and even research is in its infancy. In this paper, we propose to bring aging awareness to EDA tool flows based on so-called degradation-aware cell libraries. These libraries include detailed delay information of gates/cells under the impact that aging has on both threshold voltage (Vth) and carrier mobility (μ) of transistors. This is unlike state of the art which considers V th only. We show how ignoring ß degradation leads to underestimating guard-bands by 19% on average. Our investigation revealed that the impact of aging is strongly dependent on the operating conditions of gates (i.e. input signal slew and output load capacitance), and not solely on the duty cycle of transistors. Neglecting this fact results in employing insufficient guard-bands and thus not sustaining reliability during lifetime. We demonstrate that degradation-aware libraries and tool flows are indispensable for not only accurately estimating guardbands, but also efficiently containing them. By considering aging degradations during logic synthesis, significantly more resilient circuits can be obtained. We further quantify the impact of aging on the degradation of image processing circuits. This goes far beyond investigating aging with respect to path delays solely. We show that in a standard design without any guardbanding, aging leads to unacceptable image quality after just one year. By contrast, if the synthesis tool is provided with the degradation-aware cell library, high image quality is sustained for 10 years (even under worst-case aging and without a guardband). Hence, using our approach, aging can be effectively suppressed.
- 6. H Amrouch, B Khaleghi, A Gerstlauer, J Henkel, "Reliability-Aware Design to Suppress Aging Effects", DAC, 2016.
-
Abstract
Generous flexibility of Look-Up Tables (LUTs) in implementing arbitrary functions comes with significant performance and area overheads compared with their Application-Specific Integrated Circuit (ASIC) equivalent. One approach to alleviate such overheads is to use less flexible logic elements capable to implement majority of logic functions. In this paper, we first investigate the most frequently used functions in standard benchmarks and then design a set of less-flexible but area-efficient logic cells, called Hard Logics (HL). Since higher input functions have diverse classes, we leverage Shannon decomposition to break them into smaller ones to either reduce the HL design space complexity or attain asymmetric low input functions. A heterogeneous LUT-HL architecture and a mapping scheme are also proposed to attain maximum logic resource usage. Experimental results on MCNC benchmarks demonstrate that the proposed architecture reduces area-delay product by 13% and 36% as compared to LUT4 and LUT6 based FPGAs, respectively. Considering the same area budget, our proposed architecture improves performance by 17% and 2% as compared to LUT4 and LUT6 based FPGAs.
- 5. I Ahmadpour, B Khaleghi, H Asadi, "An Efficient Reconfigurable Architecture by Characterizing Most Frequent Logic Functions", FPL, 2015.
-
Abstract
With the continual scaling of feature size, system failure due to soft errors is getting more frequent in CMOS technology. Soft errors have particularly severe effects in static random-access memory (SRAM)-based reconfigurable devices (SRDs) since an error in SRD configuration bits can permanently change the functionality of the system. Since interconnect resources are the dominant contributor to the overall configuration memory upsets in SRD-based designs, the system failure rate can be significantly reduced by mitigating soft errors in routing fabric. This paper first presents a comprehensive analysis of SRD switch box susceptibility to short and open faults. Based on this analysis, we present a dependable routing fabric by efficiently employing asymmetric SRAM cells in configuration memory of SRDs. The proposed scheme is highly scalable and capable of achieving any desired level of dependability. In the proposed scheme, we also present a fault masking mechanism to mitigate the effect of soft errors in the routing circuitry. A routing algorithm is also proposed to take the advantage of the proposed routing fabric. Experimental results over the Microelectronics Center of North Carolina benchmarks show that the proposed scheme can mitigate both single and multiple event upsets in the routing fabric and can reduce system failure rate orders of magnitude as compared with the conventional protection techniques.
- 4. S Yazdanshenas, H Asadi, B Khaleghi, "A Scalable Dependability Scheme for Routing Fabric of SRAM-based Reconfigurable Devices", IEEE TVLSI, 2015.
-
Abstract
Hardware trojan horses (HTH) have recently emerged as a major security threat for field-programmable gate arrays (FPGAs). Previous studies to protect FPGAs against HTHs may still leave a considerable amount of logic resources to be misused by malicious attacks. This letter presents a low-level HTH protection scheme for FPGAs by filling the unused resources with the proposed dummy logic. In the proposed scheme, we identify the unused resources at the device layout-level and offer dummy logic cells for different resources. The proposed HTH protection scheme has been applied on Xilinx Virtex devices implementing a set of IWLS benchmarks. The results show that by employing the proposed HTH protection scheme, the chance of logic abuse can be significantly reduced. Experimental results also show that as compared to nonprotected designs, the proposed HTH scheme imposes no performance and power penalties.
- 3. B Khaleghi, A Ahari, H Asadi, S Bayat-Sarmadi, "FPGA-based Protection Scheme Against Hardware Trojan Horse Insertion Using Dummy Logic", IEEE ESL, 2015.
-
Abstract
While the transistor density continues to grow exponentially in Field-Programmable Gate Arrays (FPGAs), the increased leakage current of CMOS transistors act as a power wall for the aggressive integration of transistors in a single die. One recently trend to alleviate the power wall in FPGAs is to turn off inactive regions of the silicon die, referred to as dark silicon. This paper presents a reconfigurable architecture to enable effective fine-grained power gating of unused Logic Blocks (LBs) in FPGAs. In the proposed architecture, the traditional soft logic is replaced with Mega Cells (MCs), each consists of a set of complementary Generic Reconfigurable Hard Logic (GRHL) and a conventional Look-Up Table (LUT). Both GRHL cells and LUTs can be power gated and turned off by controlling configuration bits. In the proposed MC, only one cell is active and the others are turned off. Experimental results on MCNC benchmark suite reveal that the proposed architecture reduces the critical path delay, power, and Power Delay Product (PDP) of LBs up to 5.3%, 30.4%, and 28.8% as compared to the equivalent LUT-based architecture.
- 2. A Ahari, H Asadi, B Khaleghi, Z Ebrahimi, M B Tahoori, "Towards Dark Silicon Era in FPGAs Using Complementary Hard Logic Design", FPL, 2014.
-
Abstract
Promising advantages offered by resistive NonVolatile Memories (NVMs) have brought great attention to replace existing volatile memory technologies. While NVMs were primarily studied to be used in the memory hierarchy, they can also provide benefits in Field-Programmable Gate Arrays (FPGAs). One major limitation of employing NVMs in FPGAs is significant power and area overheads imposed by the Peripheral Circuitry (PC) of NVM configuration bits. In this paper, we investigate the applicability of different NVM technologies for configuration bits of FPGAs and propose a power-efficient reconfigurable architecture based on Phase Change Memory (PCM). The proposed PCM-based architecture has been evaluated using different technology nodes and it is compared to the SRAM-based FPGA architecture. Power and Power Delay Product (PDP) estimations of the proposed architecture show up to 37.7% and 35.7% improvements over SRAM-based FPGAs, respectively, with less than 3.2% performance overhead.
- 1. A Ahari, H Asadi, B Khaleghi, M B Tahoori, "A Power Efficient Reconfigurable Architecture Using PCM Configuration Technology", DATE, 2014.