Mobilizing the Micro-Ops: Exploiting Context Sensitive Decoding for Security and Energy Efficiency

ABSTRACT
Modern instruction set decoders feature translation of native instructions into internal micro-ops to simplify CPU design and improve instruction-level parallelism. However, this translation is static in most known instances. This work proposes context-sensitive decoding, a technique that enables customization of the micro-op translation at the microsecond or faster granularity, based on the current execution context and/or preset hardware events. While there are many potential applications, this work demonstrates its effectiveness with two use cases: 1) as a novel security defense to thwart instruction/data cache-based side-channel attacks, as demonstrated on commercial implementations of RSA and AES and 2) as a power management technique that performs selective devectorization to enable efficient unit-level power gating.

This architecture, first by allowing execution to transition between different translation modes rapidly, defends against a variety of attacks, completely obfuscating code-dependent cache access, only sacrificing 5% in steady-state performance – orders of magnitude less than prior art. By selectively disabling the vector units without disabling vector arithmetic, context-sensitive decoding reduces energy by 12.9% with minimal loss in performance. Both optimizations work with no significant changes to the pipeline or the ISA.

1. INTRODUCTION
The post-Dennard scaling era has witnessed an upsurge in the adoption of specialized processing elements [1, 2, 3, 4, 5] to improve the execution efficiency of domain-specific workloads. While general-purpose processors continue to gradually add domain-specific instructions every CPU generation, the technical challenges and market risks associated with legacy software have significantly limited innovation in the ISA design space. This work exploits an underutilized feature of modern instruction set decoders to show that even general-purpose processors can be customized, and in fact that customization can be seamlessly configured dynamically at an extremely fine granularity.

The key to this change is the fact that most modern processors employ translated ISAs, as the Intel and AMD x86 processors and many ARM processors typically feature translation from the native instruction set into internal micro-ops that enter the pipeline for execution [6, 7, 8]. These architectures enjoy the dual benefits of a versatile backward-compatible CISC front-end and a simple cost-effective RISC back-end. Moreover, the additional level of indirection enables seamless optimization of the internal micro-op ISA, under the covers, without any changes to the programmer interface. However, for those architectures the translation is static, changing once per generation. Instead, we propose that translation be dynamic, potentially changing frequently within the execution of a single program.

In this paper, we unlock the full potential of translated ISAs via context-sensitive decoding (CSD), a technique that allows native instructions to be decoded/translated into a different set of custom micro-ops based on their current execution context. This presents operating systems, runtime systems, and antivirus programs with the unique opportunity of triggering different custom translation modes, at microsecond or finer granularity, by simply configuring a set of model-specific registers (MSRs). In this way, for example, an insecure executable can instantly become a secure executable, or performance-optimized code can become energy-optimized, without recompilation or binary translation.

By leveraging existing native-to-microcode translation functionality in the decoder and exploiting an already well-established microcode update procedure outlined by Intel [6], we further empower runtime systems and virtual machines (that operate at a certain privilege level) to push custom translation updates written in native x86 code into the processor. At the decode stage of the pipeline, the CSD framework intercepts such custom microcode updates, auto-translates and optimizes them into a compact set of micro-ops, and further pushes them into the microcode engine. These custom updates could potentially enable instrumentation for profiling and performance monitoring, profile-guided optimizations, and API-hooks for security updates, among other applications.

The CSD framework we describe allows custom translation modes to be triggered by hotspot detection [9, 10], unit-criticality predictors [11], thread-criticality predictor [12], protection-domain crossings [13], interception of a tainted input [14, 15, 16, 17], a watchdog timer event, changes in power or energy availability, or thermal events – all with no significant changes to the pipeline or the ISA. In fact, a major contribution of this work is a set of microarchitectural techniques that enable the seamless integration of the context-sensitive decoding framework into Intel’s legacy decode pipeline and micro-op cache design.

Due to its low performance overhead and non-intrusive nature, context-sensitive decoding has potential applications in areas such as malware detection and prevention [18, 19, 20, 21], dynamic information flow tracking [14, 22, 23, 17], runtime profiling and performance programming [24, 25], on-demand type-safety [26, 27, 28], program verification and
debugging [29, 30, 31], and runtime phase tracking and code specialization [32]. This paper showcases two diverse applications of context-sensitive decoding – an obfuscation-based security defense against cache-based side channel attacks, and criticality-aware power gating to improve energy efficiency.

Side-channel attacks have been used to leak secret information by exploiting the micro-architectural and physical characteristics of a cryptosystem. Many types of side-channel attacks have been described in the literature to subvert prominent cryptographic algorithms such as RSA, DES, and AES. These attacks hinge on a spy program running side-by-side with a victim program that leaks timing and other execution characteristics via shared micro-architectural structures.

By leveraging custom translation modes offered by context-sensitive decoding, we provide a low-cost, high-performance, and reconﬁgurable alternative to existing side-channel mitigations [33, 34, 35]. Owing to its unfettered access to on-chip microarchitectural structures and an array of hardware control signals, context-sensitive decoding allows us to inject decoy micro-ops into execution that give the attacker an illusion of a modiﬁed architectural state, by obfuscating micro-architectural characteristics alone. These decoy micro-ops are unreadable from both user and kernel modes as they exist within the processor outside any addressable memory. As a result, they remain invulnerable to spyware, rootkits, and other rogue programs, even if they are able to execute with the highest privileges. This paper shows that by causing micro-architectural perturbations at the decoder level, we can be more performance-efﬁcient than a software-based obfuscation technique and less intrusive than a system that causes anomalies at the gate level.

Aside from security, we further showcase the potential of context-sensitive decoding in efﬁciently emulating infrequently used feature sets on alternative functional units. This enables aggressive power gating, even for units that are infrequently but regularly used. In this case, we scalarize vector instructions via micro-op translation onto the scalar units, enabling the system to make more global decisions about when to turn on the vector units, rather than always responding to instruction demand. In summary, the framework we describe in this paper offers the following unique capabilities:-

- Fine-grained dynamic instruction stream customization of legacy binaries without recompilation, and without the full overhead of binary translation.
- Seamless integration into a state-of-the-art Intel microcode engine with no signiﬁcant changes to the pipeline.
- A ﬂexible auto-translated microcode update procedure that allows runtime systems to inject custom translation modes into the microcode engine.
- A ﬂexible auto-translated microcode update procedure that allows runtime systems to inject custom translation modes into the microcode engine.
- A ﬂexible auto-translated microcode update procedure that allows runtime systems to inject custom translation modes into the microcode engine.
- A ﬂexible auto-translated microcode update procedure that allows runtime systems to inject custom translation modes into the microcode engine.
- An energy-optimization mechanism that scalarizes vector instructions in order to power gate vector units during phases of minimal vector activity to save an average of 12.9% in overall energy.

2. BACKGROUND AND RELATED WORK

Translated Instruction Sets. Modern ISAs such as x86 and ARM typically translate complex native instructions into simpler internal micro-ops [6, 8]. While this was originally intended to simplify CPU design and allow complex long latency instructions to be pipelineable, it has been instrumental in enabling several ISA and micro-architectural optimizations [36, 37, 38, 39, 40] that improve the front-end throughput and overall instruction-level parallelism [5]. Note-worthy optimizations include the micro-op cache, micro-op fusion, macro-op fusion, loop stream detection, and early load address resolution [41, 42]. However, due to their static translation scheme, x86 decoders lack the ability to customize instruction translation based on the current execution context, a feature that most dynamic binary translator (DBT) systems enjoy. This paper explores techniques to enable custom translations at the decoder level and studies their synergy/interference with existing optimizations.

Multi-ISA Architectures. Aside from translation to micro-ops, many modern decoders are equipped with multiple decode units to translate instructions from different feature sets, ISA extensions, and sometimes completely different ISAs. A classic example of this is the ARM architecture which supports three major instruction sets (A32, T32, and A64) and several other feature sets/extensions such as Jazelle and NEON. The multi-ISA ARM decoder not only provides backward compatibility, but enables the seamless integration of a completely re-architected AArch64 ISA. It also allows developers and compilers to take advantage of the ability to switch between the different instruction sets at exception boundaries (A64 to A32) or by simply executing a branch and exchange instruction (A32 to T32). Furthermore, the multi-ISA heterogeneous chip multiprocessor architectures [43, 44, 45, 46] allow applications to migrate back and forth between different ISAs at basic block boundaries. While this work offers similar capabilities in terms of seamlessly switching execution between different custom translations, it does so at a much finer granularity, requires no re-compilation, and no signiﬁcant changes to the architecture.

Binary Translators and Code-Morphing Machines. In the software world, binary translators have been long used to port/emulate legacy binaries on new architectures [9, 47]. Furthermore, managed runtimes and browsers employ dynamic binary translation to perform proﬁle-guided optimization [48] hot code regions, program shepherding, and JIT hardening. On the hardware front, several binary-translation-driven processor designs have been proposed. These are typically equipped with a code-morphing software binary translation layer that feeds translated instructions into the processor’s decoder.

IBM’s DAISY [49], Transmeta’s Crusoe and Efficio [50], and more recently Nvidia’s Denver processors [51] have sparked further innovation in this space. Clark, et al. [52] describe a hybrid compiler-microarchitecture approach to instruction set customization that involves statically identifying code regions to offload and dynamically replacing jumps to such regions by complex custom instructions that trigger an accelerator. Owing to their dynamic translation model, as opposed to x86’s static translation model, these machines have the ability to customize translation and/or perform continuous proﬁle-guided optimization.

The most related work to this research is DISE (Dynamic Instruction Stream Editing) [53, 54, 55, 56], a macro-engine that exposes the API, allowing programmers to dynamically reconfigure a stream of instructions in order to perform bounds-
checking, debugging, and prefetching. However, this work differs in many important ways. First, they require complex pattern-matching and user-defined production rules to be integrated into their decoder framework, whereas context-sensitive decoding can be triggered by mere reconfiguration of a set of model-specific registers or even a pipeline event, or a thermal or energy event. Second, while this work is easily integrated, exploiting existing features of modern processors, DISE adds significant new complexity to the pipeline. Third, this work fully explores performance, power, and area implications of incorporating the decoding framework into existing modern designs, including those that sport a wide variety of micro-op optimizations such as micro-op cache and micro-op fusion [6]. Finally, our research builds on works like DISE by introducing new applications – better protection against several new attack models that have gotten more sophisticated over the years, and energy-efficient management of vector computation.

**ISA-enabled Security.** In a computing world that is plagued with vulnerabilities, hardware design with a security focus is critical [57, 58, 59]. Intel’s Trusted Execution Technology (TXT), Intel’s Software Guard Extension (SGX) [60, 61], and ARM’s TrustZone extension [62] provide authentication mechanisms and isolation guarantees to operating systems and other specialized software, in order to defend against software-based side-channel and key guessing attacks. While these rely on hardware mechanisms that embed encryption and integrity verification into the microarchitecture, the CHERI capability model [63] leverages the load-compute-store feature of RISC ISAs to provide fine-grained memory protection. Furthermore, Venkat, et al. [45] exploit multi-ISA architectures to defend against Return- Oriented Programming attacks. This work has the potential to offer similar capabilities by run-time customization of micro-op translation.

**Side-channel attacks.** Side-channel attacks typically steal secret information from cryptosystems and other sensitive data from a co-located user on the cloud [64, 65]. Numerous spy programs have demonstrated the full/partial reconstruction of a victim’s execution behavior by observing its instruction/data cache access patterns [66, 67, 68, 69, 70, 71], branch access patterns [72], differential power consumption characteristics [73, 74, 75], subnormal floating-point timing [76], electromagnetic radiation [77], acoustics [78], and fault behavior [79].

Several cache-based side channel attacks have been proposed in the literature [80, 70, 81]. Prominent ones include PRIME+PROBE and FLUSH+RELOAD attacks that can be performed on both a shared private data/instruction cache or on last-level caches. Notable mitigations to these attacks include secure cache partitioning [33], compiler-based obfuscation [34, 35], and run-time software diversity [82]. In this paper, we leverage context-sensitive decoding to provide stealth-mode custom translations, as a novel security defense that mitigates such side channel attacks. Context-sensitive decoding-based security enhancement has no software performance cost when not in use, minimal software overhead when used, no additional vulnerability, and very minimal hardware/power cost.

**Unit-level Power Management.** Many power management techniques have been proposed in prior work ranging from unit-level [83, 84] to coarse-grained core-level power management [85, 86], with the central theme of powering off idle blocks to reduce overall static leakage. Vector processing units (VPUs) are promising candidates for power gating since they’re typically not in use during most scalar phases, and yet account for a significant portion of core’s peak power. However, phases of intermittent vector activity create small idle intervals that are below the break-even time needed to compensate the power gating overhead. Dynamic devectorization [87, 88] achieve significant energy savings by using a translation optimization layer to profile and de-vectorize non-critical vector instructions while the VPU’s are power-gated. Similarly, PowerChop [11] proposes a binary translation-driven approach that uses a unit-criticality predictor to assist power-gating of multiple units in the processor (including VPU’s). While binary translation can be an effective tool, it is not ideal for adaptive energy optimizations – in many scenarios we can hide the considerable startup cost (in performance and energy) of binary translation; however, when we trigger a new optimization due to an energy event or emergency, we bear the entire brunt of the startup cost at the worst possible time.

3. CONTEXT-SENSITIVE DECODING

In this section, we provide a brief overview of the x86 front-end, describe techniques to enable context-sensitive decoding in the x86 architecture, and discuss potential applications.

3.1 Overview of the x86 Front End

The x86 front end in Figure 1 contains two major components: (a) the legacy decode pipeline that translates native instructions into micro-ops, and (b) a micro-op cache that delivers already translated micro-ops into the instruction queue.

The legacy decode pipeline includes an instruction-length decoder that feeds from a 16-byte fetch buffer and decodes the variable-length x86 instruction byte-by-byte. The decoded instructions are inserted into an 18-entry macro-op queue, a macro-op cache, and vectorized cooperatively by the four macro-ops at a time. These macro-ops then feed into one of the four decoders that translate them into simpler micro-ops. These decoders use a static table-driven approach for the micro-op translation. In fact, only one of the decoders can translate an instruction to more than one micro-op, with the other three performing a simple one-to-one mapping operation. Complex instructions that decompose into more than four micro-ops are microsequenced by a microcode ROM.

The micro-ops translated by the legacy decode pipeline are cached in an 8-way set associative micro-op cache that can hold up to 1536 micro-ops. When the micro-op cache is active, the legacy decode pipeline is disabled to conserve power. The front-end then streams micro-ops from the micro-op cache into the instruction (micro-op) queue until a miss occurs, at which point it switches back to the legacy decode pipeline. The front-end also sports a number of optimization features such as stack-pointer tracking (a flavor of early load address resolution), micro-op fusion, macro-op fusion, and loop stream detection.

3.2 Integration with the x86 Front End

This section describes techniques to integrate our architecture into the x86 front end with the legacy decode pipeline and micro-op cache designs, and further study its synergy/interference with existing front-end optimizations such as micro-op fusion. Figure 1 highlights necessary hardware components required to enable context-sensitive decoding in the x86 front-end.

Integration with the Legacy Decode Pipeline. To en-
we allow a hardware watchdog timer to periodically trigger a vision the legacy decode pipeline with one or more custom conflict misses, it allows us to improve the micro-op cache translated it. While this could potentially create artificial to extend the tag bits of the micro-op cache with an additional at a low performance overhead.

interferes with one of the major goals of context-sensitive running without the micro-op cache [41]. This particularly pipeline could cause more performance degradation than switching between the micro-op cache and the legacy decode ommends software to be carefully optimized since frequent operations that require custom translation to the custom decoder. These decoders continue to employ a simple static table-driven translation model, like the four native x86 decoders. However, they can generate more sophisticated micro-op flows by relegating to the microcode ROM.

Furthermore, when context-sensitive decoding is turned on, we update the macro-op dispatch logic to redirect macro-ops that require custom translation to the custom decoder. In our implementation, this logic can be triggered in three different scenarios. First, software programs such as the operating system can trigger this logic by configuring a set of model-specific registers (MSRs) [6]. We leverage the already existing register-tracking optimization in the decoder to track updates to the MSRs and consequently trigger context-sensitive decoding. Second, we allow a translation context switch to be triggered by hardware events such as the interception of a tainted input by information-flow tracking or a power-gating decision by the unit criticality predictor. Finally, we allow a hardware watchdog timer to periodically trigger a translation mode switch.

**Interactions with the Micro-Op Cache.** The micro-op cache is an important performance and energy optimization that allows certain hot code regions to be completely serviced from the micro-op cache. The Intel Optimization manual recommends software to be carefully optimized since frequent switching between the micro-op cache and the legacy decode pipeline could cause more performance degradation than running without the micro-op cache [41]. This particularly interferes with one of the major goals of context-sensitive decoding – the ability to frequently switch translation context at a low performance overhead.

Flushing the micro-op cache every translation mode switch could have a major performance impact. We instead choose to extend the tag bits of the micro-op cache with an additional set of context bits (one bit per custom translation mode) that associate a particular micro-op way with the decoder that translated it. While this could potentially create artificial conflict misses, it allows us to improve the micro-op cache utilization by co-locating micro-op translations from different custom decoders.

Finally, customization could involve injecting multiple micro-ops at a time. This not only clutters the execution stream, but could pollute the micro-op cache. The x86 micro-op cache design has a check that does not allow 32-byte code regions to occupy more than 3 ways (amounting to 18 micro-ops) in the micro-op cache. This is because, unlike a regular cache, the micro-op cache simply allows the front-end engine to stream instructions from it, to avoid expensive indexing and tag comparison. Furthermore, it does not allow instructions longer than six fused micro-ops to be cached. Although we can imagine several options that would allow the architecture to remove that constraint, to be conservative we assume that it still holds in this paper, which does impact many of our translated micro-op sequences.

### 3.3 Micro-Code Update and Auto-Translation

CSD exploits already the existing Micro-code update (MCU) procedure of Intel processors [6] to empower the runtime system with the ability to inject custom translations into processor’s microcode engine, with the API provided to the runtime being the entire x86 instruction set. The CSD framework further auto-translates such microcode updates by exploiting Intel’s existing front-end translation and optimization infrastructure. While this offers significant flexibility to software agents such as the OS and the runtime system, the chip designer exerts more control over the microcode engine, potentially allowing custom translations that include non-user-visible features such as a micro-op that can change the state of the branch predictor or the hardware return address stack. We also note that custom translations injected via microcode updates should not alter architectural register and memory state, unless explicitly specified in the MCU header.

Figure 2 shows the MCU procedure in more detail. Since microcode update is performed via a privileged instruction or system call, only trusted entities [57, 89, 62] within the OS/runtime system should have the ability to successfully inject microcode updates into the processor. The microcode update system call invokes Intel’s microcode driver [90] that performs sanity and integrity checks, and further invokes the processor’s microcode update feature via an MSR update [6]. The MCU itself is provisioned with a descriptive header prepended by data containing custom translations injected by the runtime. When the header contains a reserved field that indicates context-sensitive decoding, the microcode update is assumed to contain only native x86 instructions, and is further marked for auto-translation. On the processor end, the MCU header is again verified for sanity and integrity, before extracting the data part. In the event that the MCU is marked for auto-translation, the native instructions in the data part of the MCU are further translates into internal micro-ops by leveraging the existing translation capabilities in the decoder. The translated micro-ops are further optimized into more compact micro-ops using existing front-end optimizations such as macro/micro-op fusion, adhering to certain performance guidelines described below. We further note that virtually all of the building blocks we use to provide this feature are already well-established mechanisms that appear in mainstream Debian Linux kernel releases [90, 91].

### 3.4 Performance Guidelines and Optimizations

The micro-op expansion due to customization could poten-
and off the vector unit to minimize energy. However, other

temporary involve reusing micro-registers (e.g., as a loop induc-

tion variable). By extending the stack pointer tracker to

together in the pipeline as they can potentially knock native

instructions in a 32-byte region out of the micro-op cache,

creating short and tight loops that benefit from the improved

Micro-op fusion. To take advantage of the micro-op fu-
sion optimization, we use load-op and load+br combinations

as much as we can in our custom micro-op sequences. By
doing so, we gain 1.6% in performance and eliminate bottle-

necks in the front end stages, and the micro-op cache.

Micro-loop Specialization. Furthermore, we recommend

creating short and tight loops that benefit from the improved

micro-op cache utilization, and take advantage of the loop

cache when present. The Intel optimization manual [41]

recommends loop fission [92] in case of longer loops. If
the custom decoder decides to employ loop fission on micro-

loops, we recommend that these loops don’t occur too close
together in the pipeline as they can potentially knock native

instructions in a 32-byte region out of the micro-op cache,
causimore degradation in performance.

Register Tracking. Finally, customization may poten-
tially involve reusing micro-registers (e.g., as a loop induc-
tion variable). By extending the stack pointer tracker to
perform full register tracking, more compact instances of
custom micro-op translations could be created.

3.5 Potential Applications

Later sections of this paper focus on particular applications
of context-sensitive decoding including (1) adding security
on-demand and (2) selectively moving vector computation on
and off the vector unit to minimize energy. However, other

potential applications abound.

Programming Languages – To reduce costly time spent
on finding and fixing bugs, developers are increasingly en-
couraged to employ software practices that ensure software
fault isolation, type safety, and formal verification [26, 27,
28]. While most type checkers and proof assistants rely on
static verification, the dynamic nature of JavaScript and other
JITed code has made it increasingly hard to statically infer
and reason about types. Typed assembly languages [93] and
Google’s Portable Native Client [26] ensure deep sandboxing
and type safety of inherently native and/or JITed code, but at
a prohibitively high performance cost for many workloads.
CSD provides us with the unique opportunity of on-demand
dynamic type checking at the decoder level, especially for
sensitive code regions where the coverage offered by static
verification is insufficient.

Debugging – Aside from type checking, breakpoints and
watchpoints are indispensable tools that software developers
use to find and fix evasive memory errors. Modern ISAs with
hardware debugging support reserve a small number of moni-
tor/debug registers to encode breakpoint/watchpoint rules [29,
31]. However, most debuggers and runtime analyses typically
run out of debug registers and resort to software breakpoints
and watchpoints which are extremely inefficient [30]. In trans-
lated ISAs, a context-sensitive decoder can microsequence
a performance-efficient watchpoint implementation since it
has direct access to microarchitectural structures such as the
address translation unit.

Performance Counters – Modern processors implement
a variety of performance counters, but with several limita-
tions. Only a few can be used at once, and the actual counters
typically change from generation to generation, often depen-
dent on where the designer had room for a counter and where
they did not. However, we can add many counters in the
decoder, with no limit to the number of counters active at
once, and providing compatibility across generations despite
different layouts and space availability.

Profiling – Modern systems typically rely on instrumenta-
tion to profile code. However, instrumentation alters the code,
potentially resulting in heisenbugs. That is, instruction cache,
data cache, and even memory interference behavior is altered
by the instrumentation. With CSD, we can add profiling with
no change whatsoever to code layout or data layout.

4. CASE STUDY I: SIDE-CHANNEL DEFENSE

In this section, we demonstrate the security potential of
custom-sensitive decoding. We first lay out our assumptions
and threat model, then describe the stealth-mode translation
feature of context-sensitive decoding. Finally, we leverage
this feature to secure commercial implementations of RSA
and AES against the exploitation of the two major data and
instruction cache side channel attacks.

4.1 Assumptions and Threat Model

Trusted Computing Base. We assume that the micro-op
engine – which includes both the legacy decode pipeline and
the micro-op cache – is tamper-proof and is a part of the
Trusted Computing Base (TCB) [58, 89]. We also further ex-
tend the TCB to include all hardware or software mechanisms
that can potentially trigger context-sensitive decoding. These
include register tracking, Dynamic Information Flow Track-
ing [14], hardware watchdog timers, and anti-virus-driven


While we intend stealth-mode to be a security feature to be auto-translated by leveraging the ties [89] with the right privileges can achieve similar effects deployed by the chip manufacturer, trusted software entities multiply contain T-tables of AES and the operations that load data into the caches that would be touched on an attacker-oblivious way. In particular, we use the ability to make precise timing measurements and has unlimited access to hardware performance counters. This allows them to make inferences about the software algorithm being run by the victim, by observing the micro-architectural characteristics alone. Furthermore, we assume that the attacker has the ability to exploit other digital and physical side channels that exist outside the TCB. While the major focus of the defense we describe in this paper is to defend against cache-based side channel attacks, we further lay out strategies to leverage context-sensitive decoding to potentially mitigate other side-channel attacks in the final part of this section.

4.2 Stealth-Mode Translation

Cache-based side channel attacks typically involve probing one or more cache lines of a co-located victim in order to capture its memory access patterns that could potentially reveal secret information. For example, an attacker who intends to break a cryptographic algorithm could compute one or more bits of a secret key by capturing access patterns of key-dependent loads and branches. The goal of the stealth-mode translation is to provide an illusion of a modified architectural state by obfuscating the micro-architectural characteristics alone. In this specific implementation, we obfuscate a victim’s control path and/or access to sensitive data structures in an attacker-oblivious way. In particular, we use decoy micro-ops that load data into the caches that would be touched on all data-dependent paths. These include all cache blocks that contain T-tables of AES and the multiply functions of RSA. While we intend stealth-mode to be a security feature to be deployed by the chip manufacturer, trusted software entities [89] with the right privileges can achieve similar effects by leveraging the auto-translated microcode update feature.

Figure 3 shows context-sensitive decoding with stealth-mode configuration (e.g., an Intel/McAfee security solution). Moreover, we assume that such hardware security mechanisms including any microcode that enables security is formally verified [94]. Finally, since the API exposed to software only consists of macro-ops, we continue to assume that the translated micro-ops in the micro-op cache (both native and custom-translated) can neither be read by software nor be probed via hardware side-channels.

Attacker Environment. We assume an active attacker who can effortlessly probe, flush, or eviction a co-located victim’s cache lines, but does not have direct access to the contents in the cache. We also assume that the attacker has the ability to make precise timing measurements and has unlimited access to hardware performance counters. This allows them to make inferences about the software algorithm being run by the victim, by observing the micro-architectural characteristics alone. Furthermore, we assume that the attacker has the ability to exploit other digital and physical side channels that exist outside the TCB. While the major focus of the defense we describe in this paper is to defend against cache-based side channel attacks, we further lay out strategies to leverage context-sensitive decoding to potentially mitigate other side-channel attacks in the final part of this section.

4.2 Stealth-Mode Translation

Cache-based side channel attacks typically involve probing one or more cache lines of a co-located victim in order to capture its memory access patterns that could potentially reveal secret information. For example, an attacker who intends to break a cryptographic algorithm could compute one or more bits of a secret key by capturing access patterns of key-dependent loads and branches. The goal of the stealth-mode translation is to provide an illusion of a modified architectural state by obfuscating the micro-architectural characteristics alone. In this specific implementation, we obfuscate a victim’s control path and/or access to sensitive data structures in an attacker-oblivious way. In particular, we use decoy micro-ops that load data into the caches that would be touched on all data-dependent paths. These include all cache blocks that contain T-tables of AES and the multiply functions of RSA. While we intend stealth-mode to be a security feature to be deployed by the chip manufacturer, trusted software entities [89] with the right privileges can achieve similar effects by leveraging the auto-translated microcode update feature.

Figure 3 shows context-sensitive decoding with stealth-mode translation in action. Stealth-mode translation is primarily triggered by updates to register-tracked decoy address-range registers, similar to the already existing Memory Type Range Registers (MTRR) [6] in x86 that allow system software to control cache policies for specific address ranges (e.g., write-back vs write-through). The decoy address range registers, on the other hand, allow anti-virus and other hardware/software trusted entities to mark specific data and instruction address ranges in a program’s address space as sensitive. As soon as the stealth-mode translation is triggered, these decoy address ranges are copied to the context-sensitive decoder’s internal registers; after that, the macro-op dispatcher starts redirecting all loads and branches within the PC range for custom translation.

CSD injects decoy micro-ops into all instructions that use memory operands and/or attempt control transfer (i.e., all instructions that get translated to load/store/branch micro-ops) during stealth-mode translation. Figure 4, as an example, shows stealth-mode translation of the MOV instruction for cache-based side-channel prevention. In this example, CSD injects a micro-loop into micro-op stream. The micro-loop effectively obfuscates the architectural state by loading all sensitive cache blocks whose addresses have been specified by a software agent (e.g., antivirus) in the MSRMs.

We implement two schemes of translation – one for a software anti-virus-driven stealth-mode configuration where the tainted program counter (PC) values are known a priori with the help of binary analysis and configured in specific MSRs, and one for architectures that implement full information-flow tracking in hardware where the taint-checking is performed dynamically. In both instances, the decoy micro-ops execute only for tainted instructions – for the DIFT-enhanced architectures, this decision is made dynamically at run-time.

We do not need to load the decoy structures constantly, since they will stay in the cache for a time, or until the attacker removes them. Thus, stealth-mode translation automatically turns itself off once all the address ranges in the context-sensitive decoder’s internal copy of the decoy address-range MSRs have been emptied out (all blocks specified by the range registers are loaded) by the decoy micro-ops. However,
by observing a victim’s key-dependent data access patterns. In this section, we describe a well-known data cache-based side channel in OpenSSL’s implementation of the AES algorithm and discuss the potential of stealth-mode translation to thwart these attacks.

**The AES Cryptographic Algorithm.** AES Algorithm is a substitution-permutation block cipher that performs several rounds of simple substitution and permutation during encryption. Several software implementations including OpenSSL employ lookup tables called T-tables in order to speed up the substitution-permutation rounds, which then consist of several simple table lookup and xor operations. The index computation for the T-table lookup involves an xor operation between the key bits and the plaintext bits, thereby entailing a key-dependent load.

**D-Cache Side-Channel Attack on AES.** The OpenSSL implementation of AES employs four 256-entry T-tables, which amounts to sixty-four 64-byte cache blocks in the data cache. A spy process that monitors the access patterns of these blocks during encryption can significantly reduce the possible key space, and potentially reconstruct the entire key, by using a large number of carefully chosen plaintext [68]. As with I-cache attacks, a PRIME+PROBE attack fills up the D-cache sets for one or more of these T-table blocks in the prime phase, and probes for them after a certain interval to check if the victim made an access to any of the primed sets. A FLUSH+RELOAD D-cache attack exploits de-duplication in order to flush one or more of the T-table blocks, and reload them after a carefully chosen probe interval.

**Effect of Stealth-Mode Translation.** Similar to the use of stealth-mode translation to defend against I-cache attacks, by configuring the data decoy address range MSRs with the appropriate address range of the T-tables, we can successfully obfuscate the key-dependent data access patterns. Furthermore, by carefully choosing a watchdog timeout period to enable periodic decoy load injection, we can also defend against brute-force key extraction attacks [95].

**4.5 Securing other Side-Channels**

In addition to data and instruction cache-based side-channel attacks, the stealth-mode translation feature of context-sensitive decoding has further potential to defend against other timing and physical attacks. For example, the decoy micro-ops could alter the branch predictor tables and BTB entries to confuse branch prediction analysis attacks, or potentially introduce a random stream of NOPs (and different types of NOPs) to skew timing analysis. Furthermore, owing to its ability to microsequence instructions using the MSROM, it can add additional noise into the sensed power by non-deterministically microsequencing instructions that cause switching activity across different microarchitectural structures.

## 5. CASE STUDY II: UNIT-LEVEL POWER GATING

In this section we present another use case of context-sensitive decoding, selective devectorization for unit-level power-gating. While similar in functionality to software-based devectorization approaches proposed in prior work [11, 87, 88], we eliminate binary translation costs and related cache effects, add the ability to switch modes at a finer granularity, allow cheaper and more effective monitoring, and provide more direct control over power gating and ungating.
When it goes below a threshold, it turns on devectorization. When devectorization is enabled, the microcode engine replaces a vector instruction to equivalent scalar micro-ops, and, 2) it hides the power-on delay by continuing the execution of instructions using scalar mode until the unit is ready.

Figure 5 shows our hardware support (beyond the decoder) for dynamic devectorization. We employ nothing more than a simple counter that tracks a window of instructions, counting up one for simple vector instructions and more than one for more complex vector instructions (higher micro-op count). When it goes below a threshold, it turns on devectorization and powers off the entire vector unit, and when it goes above a (higher) threshold, it turns the vector unit back on. It also includes a cycle counter to continue devectorization until the unit is fully powered.

When devectorization is enabled, the microcode engine translates the vector instructions to an equivalent set of scalar micro-ops. As an example, Figure 6a shows the pseudocode that devectorizes the SSE PADDB instruction which performs integer addition on packed bytes. This code is further compiled and optimized into a set of native x86 instructions by a runtime system which performs the actual microcode update. Figure 6b shows the equivalent auto-translated micro-op version of such an update. While it is possible to use a simpler translation with a loop, we reckon that it is more efficient to unroll the loop in this case. This is because, by employing suitable masks, the computation itself can be optimized in a way that allows us to just perform four adds and accumulate the results. While this optimization holds true for this particular example, the decision to use micro-loops and other optimizations purely depends upon the nature and purpose of the custom translation.

Power Modeling and Power Gating Overheads For powering a unit on and off, power gating uses a header transistor that connects or disconnects the power source of the unit. A sleep signal is applied to the gate of the header device to control its operation. Switching the unit on and off comes at the cost of asserting and de-asserting the sleep signal plus switching on and off the header device. These costs are responsible for the non negligible timing and energy overhead of power gating. We use Equation 1 from a model proposed by Hu et. al. [96] to account for the energy overhead of power gating.

\[ E_{\text{Overhead}} \approx 2W_H E_{\text{cycle}} \alpha \]  

(1)

Where \( W_H \) is the ratio of the area of sleep transistor to the area of the unit and \( E_{\text{cycle}} / \alpha \) is the switching energy of the unit for one cycle when switching factor \( \alpha = 1 \). We use a conservative value of 0.20 for \( W_H \), as the literature uses an estimated range of 0.05 to 0.20 [11, 96, 97, 98], and for \( E_{\text{cycle}} / \alpha \) we use McPAT estimates [99]. Power gating cycles should be made long enough to compensate for the \( E_{\text{Overhead}} \). The break-even time is defined as the number of cycles a unit should stay in power-gated state so that the aggregate energy savings of power gating (\( E_{\text{gated}} \)) matches the energy of switching the unit on then off (\( E_{\text{Overhead}} \)). We model the leakage current of the header transistor itself, using McPAT. We use Laurenzano et. al.'s [11] estimate of 30 cycles for powering on the VPU.

6. METHODOLOGY

This section details the evaluation methods we used for our two case study applications of context-sensitive decoding.
6.2 Security Evaluation

To evaluate the effectiveness of our security defense, we subject AES and RSA running on our architecture to the FLUSH+RELOAD variant of cache-based side-channel attacks, modeled after the AES T-Table attacks by Gruss, et. al. [80]. As further demonstration, we try the PRIME+PROBE attack on AES and RSA. Our attack models exploit the I-cache side-channel for RSA and the D-cache side-channel for AES, demonstrating defense against both side channels. Since we model our stealth-mode translation on a cycle-accurate simulator, we allow our attack models to benefit from precise counters and therefore do not require a calibration phase to set thresholds that distinguish between a hit and a miss, and subsequently determine probe intervals.

7. RESULTS

In this section, we will evaluate each of our instantiations of CSD, starting first with our side-channel defense mechanism and following that with selective devectorization.

7.1 Stealth Mode

Security Evaluation. Figure 7a shows the results of a well-known PRIME+PROBE attack on AES algorithm [107]. The attack repeatedly triggers encryptions with carefully chosen plaintexts while probing 16 different addresses of the AES T-tables. For each probe, only one plaintext has a 100% hit rate (steep dips in the curve), revealing 4-bits of the key. When the stealth-mode translation is not enabled, 64-bits out of the 128-bits of the key get compromised in a matter of 64000 attempts as shown in the figure. These bits are sufficient enough to reverse-engineer the rest of key by well-established cryptanalysis techniques. However, when stealth-mode translation is enabled, the attacker-perceived data access patterns are completely obfuscated and always result in a hit for every probe, almost mimicking the behavior of a constant-time defense [108, 95].

Figure 7b shows the results of a FLUSH+RELOAD attack on the RSA algorithm [70]. In absence of stealth-mode translation, the attacker can almost always detect when a multiply function has been invoked by measuring hits and misses (shown as dips and spikes in the figure) to the corresponding reloaded cache line. However, when stealth-mode translation is in effect, the attacker-perceived instruction access pattern is completely obfuscated resulting in a perceived I-cache hit at the end of every probe interval. The PRIME+PROBE attack on RSA (not shown) is also defeated, recording a miss on the attacker end after every probe interval.

Performance. The potential performance overheads of CSD include micro-op expansion and related side effects (micro-op queue pressure, micro-op cache pressure) and possible cache effects due to increased cache pressure from decoy loads. Careful construction of the secure-mode micro-op translation allow us to minimize many of the side effects of micro-op expansion.

Figure 8 compares the execution time of our pipeline without any optimization (NoOpt) and with both micro-op cache and micro-op fusion enabled (Opt). In these results, we see performance loss consistently below 10% and averaging 5.6% when secure mode is enabled. This is to be compared with the current state of the art obfuscation techniques, which rely on the compiler, that see performance expansion on the order of 20X [34], orders of magnitude higher.

To break down the performance overhead of the context-sensitive decoding, we first study micro-op expansion rela-
Figure 7: Effect of the cache attacks on AES and RSA with stealth-mode translation enabled.

Figure 8: The execution time impact of context-sensitive decoding when implementing secure cache obfuscation normalized to insecure execution mode. The watchdog timer is set so that secure mode is re-entered every 500 microseconds.

Figure 9: Micro-op expansion due to CSD.

tive to unaltered execution. As shown in figure 9, context-sensitive decoding causes a micro-op expansion of 8.0% on average. Comparing these results together with Figure 8 seems to indicate that the primary cost of context-sensitive decoding is in fact the micro-op expansion. This is somewhat surprising, because we expect additional overheads from the higher incidence of loads and greater memory activity.

We investigate this further with several experiments. First, we measure performance in cycles/micro-op, and find that in fact this figure does not increase (and in some cases decreases), despite the fact that the percentage of load micro-ops has increased. Second, we see in Figure 10 that the number of cache misses per kilo instruction (MPKI) stays about the same on average. This indicates the vast majority of additional injected loads are hits. In yet another experiment where we discounted the cost of micro-op expansion, we actually saw an overall performance increase on average – this was due to a prefetching effect from the added micro-ops. Thus, the cache prefetching effect of the decoy loads is actually muting some of the performance cost of micro-op expansion.

Another negative side effect of context-sensitive decoding is on the micro-op cache. Because we introduce translations not allowed in the micro-op cache, or in other cases expand loops so they no longer fit, we do lose some of the effectiveness of that cache. However, that effect is small, especially when modeled with micro-op fusion enabled. Without fusion, the micro-op cache hit rate, on average, drops from 44% to 39% when we introduce CSD, but in the presence of micro-op fusion (which shortens some of the code sequences we see an overall performance increase on average – this was surprising, because we expect additional overheads from the higher incidence of loads and greater memory activity.

In all above experiments, the watchdog timer is set to 1000 cycles (500 microseconds), so the decoy loads are deployed at the first decoded tainted load or branch encountered, then decoding returns to normal mode until the timer fires again. While re-injecting decoy micro-ops every 1000 cycles provides almost perfect obfuscation against these cache-based side-channel attacks, based on system characteristic (e.g., cache miss/hit delays) and targeted attacks one can tune this parameter to increase the performance of the defense. Figure 11 shows the normalized execution time of our defense, sweeping the watchdog timer from 1000 to 10000 cycles. The decrease in execution time is caused by fewer extra micro-ops and fewer micro-op cache conflicts.

Overall, we find that the overhead of secure obfuscation of secure-data dependent microarchitecture state can be enabled with context-sensitive decoding with almost no performance cost, particularly in comparison with prior techniques.

7.2 Selective Devectorization
Figure 13: Execution time for different power gating policies, normalized to always on policy

Figure 14: Micro-op expansion due to context sensitive decoding normalized to native mode

Figure 15: Percentage of time that CSD power gates VPU cores.

Figure 16: Breakdown of vector activity

gated for each benchmark. On average, context sensitive decoding can keep the VPU power-gated more than 70% of the execution time. For benchmarks astar, gcc, gobmk, and sjeng with low (but not nonexistent) vector activity, we are able to keep the vector unit turned off just about all the time, not having to turn it on for occasional outliers.

Figure 16 shows the breakdown of number of SSE instructions for each benchmark. We categorized instruction into three categories: 1) instructions that are executed on the VPU (Powered On), 2) instructions that are devectorized and executed on scalar units because the VPU was in the process of powering on (Powering On), and 3) instructions that are executed on scalar units because the VPU was in power-gated state (Power-Gated). We find that bwaves and milc are frequently forced to execute scalar instructions while waiting for the vector unit to power on. They are still able to slightly come out ahead in energy, though, due to the performance advantage of not having to stall to wait for the VPU to turn on. Namd executes the largest number of vector instructions in gated mode, and is in fact gated 20% of the time despite having a large amount of vector activity. This implies that the threshold that performed overall was too aggressive for namd, and a more dynamic threshold or usage predictor would work better. Omnetpp, on the other hand, has a reasonable number of scalar operations but executes nearly all of them with the vector unit disable, resulting in a significant gain in energy. And gamess is able to judiciously enable power gating, as it is gated nearly half the time, yet only about 20% of vector instructions are affected.

Overall, for context-sensitive decoding enabled selective devectorization, we find that we are able to power gate the vector unit for longer, unbroken periods, resulting in good energy savings with small performance cost.

8. CONCLUSION

The paper presents Context-Sensitive Decoding, which enables the decoder to dynamically alter the decoding of programmer-visible ISA instructions. This allows the system to change the functionality of the software without programmer or compiler intervention. In this paper, we use the technique to enter stealth mode, where the decoder injects instructions which completely obfuscates the effect of secure-data dependent branches and data accesses, defending against multiple variants of cache-based side channel attacks at just 5% performance degradation. This removes the microarchitectural footprint of the secure code from an attacker. Additionally we show that context-sensitive decoding can be used to enable selective devectorization, saving 12.9% in energy while simultaneously achieving a speedup of 3.4% over conventional power gating.
9. REFERENCES

[34] A. Rane, C. Lin, and M. Tiwari, “Racoon: Closing digital side-channels through obfuscated execution,”


“Intel-microcode for Debian.”

“Debian microcode update.”


