osiris: a low-cost mechanism to enable …...osiris: a low-cost mechanism to enable restoration of...

Osiris: A Low-Cost Mechanism to EnableRestoration of Secure Non-Volatile Memories

Mao YeUniversity of Central Florida

Orlando, [email protected]

Clayton HughesSandia National Laboratories

Albuquerque, [email protected]

Amro AwadUniversity of Central Florida

Orlando, [email protected]

Abstract—With Non-Volatile Memories (NVMs) beginning toenter the mainstream computing market, it is time to considerhow to secure NVM-equipped computing systems. Recent Melt-down and Spectre attacks are evidence that security must beintrinsic to computing systems and not added as an afterthought.Processor vendors are taking the first steps and are beginning tobuild security primitives into commodity processors. One securityprimitive that is associated with the use of emerging NVMs ismemory encryption. Memory encryption, while necessary, is verychallenging when used with NVMs because it exacerbates thewrite endurance problem.

Secure architectures use cryptographic metadata that must bepersisted and restored to allow secure recovery of data in theevent of power-loss. Specifically, encryption counters must bepersistent to enable secure and functional recovery of an inter-rupted system. However, the cost of ensuring and maintainingpersistence for these counters can be significant. In this paper, wepropose a novel scheme to maintain encryption counters withoutthe need for frequent updates. Our new memory controller design,Osiris, repurposes memory Error-Correction Codes (ECCs) toenable fast restoration and recovery of encryption counters. Toevaluate our design, we use Gem5 to run eight memory-intensiveworkloads selected from SPEC2006 and U.S. Department ofEnergy (DoE) proxy applications. Compared to a write-throughcounter-cache scheme, on average, Osiris can reduce 48.7% ofthe memory writes (increase lifetime by 1.95x), and reduce theperformance overhead from 51.5% (for write-through) to only5.8%. Furthermore, without the need for backup battery or extrapower-supply hold-up time, Osiris performs better than a battery-backed write-back (5.8% vs. 6.6% overhead) and has less write-traffic (2.6% vs. 5.9% overhead).

Index Terms—Secure Architecture, Non-volatile Memory,ECC, Computer Architecture

I. INTRODUCTION

Emerging Non-Volatile Memories (NVMs) are a promisingadvancement in memory and storage systems. For the first timein the modern computing era, memory and storage systems havethe opportunity to converge into a single system. With latenciesclose to those of DRAM and the ability to retain data duringpower loss events, NVMs represent the perfect building blockfor novel architectures that have files stored in media that canbe accessed similar to the way we access the memory system,i.e., through conventional load/store operations. Furthermore,given the near-zero idle power of emerging NVMs, they canbring significant power savings by eliminating the need forthe frequent refresh operations needed for DRAM. NVMs alsopromise large capacities and much better scalability comparedto DRAM [1], [2], which means more options for runningworkloads with large memory footprints.

While NVMs are certainly promising, they still face somechallenges that can limit their wide adoption. One majorchallenge is the limited number of writes that NVMs canendure; most promising technologies, e.g., Phase-ChangeMemory (PCM), can only endure tens of millions of writes foreach cell [3], [4]. They also face security challenges — NVMscan facilitate data remanence attacks [4]–[6]. To address suchvulnerabilities, processor vendors, e.g., Intel and AMD, havestarted to support memory encryption. Unfortunately, memoryencryption exacerbates the write endurance problem due to itsavalanche effect [4], [6]. Thus, it is very important to reducethe number of writes in the presence of encryption. In fact, weobserve that a significant number of writes can occur due topersisting encryption metadata. For security and performancereasons, counter-mode encryption has been used for state-of-the-art memory encryption schemes [4], [6]–[9]. For counter-modeencryption, encryption counters/IVs are needed and they aretypically organized and grouped to fit in cache blocks, e.g.,Split-Counter scheme (64 counters in 64B block) [10] and SGX(8 counters in 64B block) [7]. One major reason behind thatis, for cache fill and eviction operations, most DDR memorysystems process read/write requests on a 64B block granularity,e.g., DDR4 8n-prefetch mode. Moreover, packing multiplecounters in a single cache line can exploit spatial locality toincrease counter cache hit rate.

Most prior research work has assumed that systems havesufficient residual or backup power to flush encryption metadata,i.e., encryption counters, in the event of power loss [4], [9].Although feasible in theory, in practice, reasonably long-lasting Uninterruptible Power-Supplies (UPS) are expensiveand occupy large areas. Admittedly, the Asynchronous DRAMRefresh (ADR) feature has been a requirement for many NVMform factors, e.g., NVDIMM [11], but major processor vendorsonly limit persistent domains in processors to tens of entries inthe write pending queue (WPQ) in the memory controller [12].The main reason behind this is the high cost for guaranteeinga long-lasting sustainable power supply in case of power losscoupled with the significant power needed for writing to someNVM memories. However, since encryption metadata can be onthe order of kilobytes or even megabytes, the system-level ADRfeature would most likely fail to guarantee its persistence as awhole. In many real-life scenarios, affording sufficient batterybackup or expensive power-supplies is infeasible, either due toarea, environmental or cost limitations. Therefore, battery-free

solutions are always in demand.Persisting encryption counters is not only critical for system

restoration but is also a security requirement. Reusing encryp-tion counters, i.e., those not persisted during a crash, can resultin meaningless data after decryption. Even worse, losing themost-recent counters invites a variety of attacks that rely onreusing encryption pads (counter-mode encryption securitystrictly depends on the uniqueness of encryption counters usedfor each encryption). Recent work [13] has proposed atomicpersistence of encryption counters, in addition to exploitingapplication-level persistence requirements to relax atomicity.Unfortunately, exposing application semantics (persistent data)to hardware requires application modifications. Moreover, ifapplications have a large percentage of persistency-requireddata, persisting encryption metadata will incur a significantnumber of extra writes to NVM. Finally, the selective counterpersistence scheme can cause encryption pad reuse for non-persistent memory locations, if a state-of-the-art split counteris used, which enables known-plaintext attackers to observethe lost/new encryption pads that will be reused when the stalecounters are incremented after crash recovery. Note that suchreuse can happen for different users or applications.

In this paper, we propose a lightweight solution, motivatedby a discussion of the problem of persisting NVM encryptioncounters, and discuss the different design options for persistingencryption counters. In contrast with prior work, our solutiondoes not require any software modifications and can beorthogonally augmented with prior work [13]. Furthermore,our solution chooses to provide consistency guarantees forall memory blocks rather than limit these guarantees to asubset of memory blocks, e.g., persistent data structures. Tothis end, our solution, Osiris, repurposes Error-CorrectionCode (ECC) bits of the data to provide a sanity-check for theencryption counter used to perform the decryption. By doingso, Osiris can reason about the correct encryption countereven in the event that the most-recent value is lost. Thusit can relax the strict atomic persistence requirement whileproviding secure and fast recovery of lost encryption counters.In this paper, we discuss our solution and evaluate its impacton write-endurance and performance. To evaluate our design,we use Gem5 to run a selection of eight memory-intensiveworkloads pulled from SPEC2006 and U.S. Department ofEnergy (DoE) proxy applications. Compared to a write-throughcounter-cache scheme, on average, Osiris can reduce 48.7% ofthe memory writes (increase lifetime by 1.95x), and reduce theperformance overhead from 51.5% (for write-through) to only5.8%. Furthermore, without the need for backup battery orextra power-supply hold-up time, Osiris performs better than abattery-backed Write-Back (5.8% vs. 6.6% overhead) and hasless write-traffic (2.6% vs. 5.9% overhead). In summary, themajor contributions of our paper are the following:

• We propose Osiris, a novel scheme that provides crashconsistency for encryption counters similar to strict counterpersistence schemes but with a significant reductionin performance overhead and number of NVM writes,without the need for an external/internal backup battery. Itsoptimized version, Osiris-Plus, further reduces the numberof writes by eliminating premature counter evictions from

the counter cache.• We discuss several design options for Osiris that provide

trade-offs between hardware complexity and performance.• We discuss how our scheme can work with and be

integrated with state-of-the-art data and counter integrityverification schemes.

The rest of the paper is organized as follows. In Section II,we briefly discuss the main concepts and background relevantto our paper. In Section III, we discuss our own solution,its design options, trade-offs and security impact. Section IVcovers our evaluation methods and analysis for Osiris. Later,in Section V, we discuss prior and related work. Finally, weconclude our work in Section VI.

II. BACKGROUND AND MOTIVATION

In this section, we discuss the main concepts that are relevantto our proposed solution, followed by motivation for our work.

A. BackgroundIn this part of the paper, we will discuss emerging NVMs

and state-of-the-art memory encryption implementations.1) Emerging NVMs: Emerging NVM technologies, such as

Phase-Change Memory (PCM) and Memristor, are promisingcandidates to be the main building blocks of future memorysystems. Vendors are already commercializing these technolo-gies due to their many benefits. NVMs’ read latencies arecomparable to DRAM while promising high densities andpotential for scaling better than DRAM. Furthermore, theyenable persistent applications. On the other hand, emergingNVMs have limited, slow and power consuming writes. NVMsalso have limited write endurance. For example, PCM’s writeendurance is between 10-100 million writes [3]. Moreover,emerging NVMs suffer from a serious security vulnerability:they keep their content even when the system is poweredoff. Accordingly, NVM devices are often paired with memoryencryption.

2) Memory Encryption and Data Integrity Verification:There are different encryption modes that can be used toencrypt the main memory. The first one is the direct encryption(a.k.a ECB mode), where an encryption algorithm, such asAES or DES, is used to encrypt each cache block when itis written back to memory and decrypt it when it enters theprocessor chip again. The main drawback of direct encryptionis system performance degradation due to adding the encryptionlatency to the memory access latency (the encryption algorithmtakes the memory data as its input). The second mode,which is commonly used in secure processors, is countermode encryption. In counter-mode encryption, the encryptionalgorithm (AES or DES) uses an initialization vector (IV) as itsinput to generate a one-time pad (OTP) as depicted in Figure 1.Once the data arrives, a simple bitwise XOR with the padis needed to complete the decryption. Thus, the decryptionlatency is overlapped with the memory access latency. In state-of-the-art designs [4], [10], each IV consists of a unique ID ofa page (to distinguish between swap space and main memoryspace), page offset (to guarantee different blocks in a pagewill get different IVs), a per-block minor counter (to makethe same value encrypted differently when written again to

the same address), and a per-page major counter (to guaranteeuniqueness of IV when minor counters overflow).

AES-ctr mode

PageID

PageOffset

MajorCounter

MinorCounter Padding

Initialization Vector (IV)

Key

Block to Write Back

Fetched Block(from NVM)

Ciphertext(to NVM)

Plaintext(to cache)

Counter Cache

Majorcounters Minor counters

PadXOR XOR

Fig. 1: State-of-the-art memory encryption using counter mode[4].

Similar to prior work [4], [6], [10], [13], [14], we assumecounter mode processor-side encryption. In addition to hidingthe encryption latency when used for memory encryption, italso provides strong security defenses against a wide rangeof attacks. Specifically, counter-mode encryption preventssnooping attacks, dictionary-based attacks, known-plaintextattacks and replay attacks. Typically, the encryption countersare organized as major counters, shared between cache blocksof the same page, and minor counters that are specific for eachcache block [10]. This organization of counters can fit 64 cacheblocks’ counters in a 64B block, 7-bit minor counters and a64-bit major counter. The major counter is only incrementedwhen one of its relevant minor counters overflows, in whichthe minor counters will be reset and the whole page will bere-encrypted using the new major counter for building theIV of each cache block of the page [10]. When the majorcounter of a page overflows (64-bit counter), the key mustbe changed and the whole memory will be re-encrypted withthe new key. This scheme provides a significant reduction ofthe re-encryption rate and minimizes the storage overhead ofencryption counters when compared to other schemes such asa monolithic counter scheme or using independent counters foreach cache block. Additionally, a split-counter scheme allowsfor better exploitation of spatial locality of encryption counters,achieving a higher counter cache hit rate. Similar to state-of-the-art work [4], [8]–[10], we use a split-counter scheme toorganize the encryption counters.

C0 C1

S1= Hash(C0,C1)

C2 C3 C4 C5 C6 C7

D2= Hash(S3,S4)

Root= Hash(D1,D2)Merkle Tree Root

S2= Hash(C2,C3) S3= Hash(C4,C5) S4= Hash(C6,C7)

D1= Hash(S1,S2)

Leaves (counters)

Intermediate Nodes

Secure Processor Boundary

Merkle Tree

Fig. 2: An example Merkle Tree for integrity verification.Data integrity is typically verified through a Merkle Tree

— a tree of hash values with the root maintained in a secureregion. However, encryption counter integrity must also bemaintained. As such, state-of-the-art designs combine bothdata integrity and encryption counter integrity using a single

Merkle Tree (Bonsai Merkle Tree [14]). As shown in Figure2, Bonsai Merkle tree is built around the encryption counters.Data blocks are protected by a MAC value that is calculatedover the counter and the data itself. Note that only the root ofthe tree needs to be kept in the secure region, other parts ofthe Merkle Tree are cached on-chip to improve performance.For the rest of the paper, we assume a Bonsai Merkle Tree.

B. Encryption Metadata Crash ConsistencyWhile crash consistency of encryption metadata has been

overlooked in most memory encryption work, it becomesessential in persistent memory systems. If a crash happens,the system is expected to recover and restore its encryptedmemory data. Figure 3 depicts the steps needed to ensure crashconsistency.

Merkle Tree

Counters

Data

NVM Memory

Secure Processor

LLC...

MT Cache

Counter Cache

MT Root

Memory Controller

+Encryption

Engine

1

2

3

Fig. 3: Steps for write operations to ensure crash consistency.

As shown in the Figure 3, when there is a write operation toNVM, first we need to update the root of the Merkle tree (asshown in step 1 ) and any cached intermediate nodes insidethe processor. Note that only the root of the Merkle Treeneeds to be kept in the secure region. In fact, maintainingintermediate nodes of the Merkle Tree can speed up theintegrity verification. Persisting updates of intermediate nodesinto memory is optional as it is feasible to reconstruct themfrom leaf nodes (counters) and then generate the root and verifyit through comparison with that kept inside the processor. Westress that the root of the Merkle Tree must persist safely acrosssystem failures, e.g., through internal processor NVM registers.Persisting updates to intermediate nodes of the Merkle Treeafter each access might speed up recovery time by reducing thetime of rebuilding the whole Merkle Tree after crash. However,the overheads of such a scheme and the infrequency of crashesmake rebuilding the tree a more reasonable option.

In step 2 of Figure 3, the updated counter block will bewritten back to memory as it gets updated in the countercache. Unlike Merkle Tree intermediate nodes, counters arecritical to keep and persist, otherwise the security of thecounter-mode encryption is compromised. Moreover, losing thecounter values can result in the inability to restore encryptedmemory data. As noted by Liu et al. [13], it is possible

to just persist the counters of persistent data structures (ora subset of them) to enable consistent recovery. However,this is not sufficient from a security point of view; losingcounters’ values, even for non-persistent memory locations,can cause reuse of an encryption counter with the samekey, which can compromise the security of the counter-modeencryption. Furthermore, legacy applications may rely on OS-level or periodic application-oblivious checkpointing, makingit challenging to expose their persistent ranges to the memorycontroller. Accordingly, a secure scheme that persists counterupdates and does not require software alteration is needed.Note that even for non-persistent applications, counters mustbe persisted on each update or the encryption key must bechanged and all of the memory must be re-encrypted with anew key. Moreover, if the persistent region in memory is large,which is likely in future NVM-based systems, most memorywrites will be accompanied by an operation to persist thecorresponding encryption counters, making step 2 a commonevent.

Finally, the written block will be sent to the NVM as shownin step 3 . Some portions of step 1 and step 2 are crucialfor correct and secure restoration of secure NVMs. Also notethat when updating the root of the Merkle Tree on the chip,updating the counter and writing the data are assumed tohappen atomically, either using three internal NVM registersto save them before trying to update them persistently or usinghold-up power that is sufficient to complete three writes. Toavoid incurring the high latency to update NVM registers foreach memory write, a hybrid approach can be used where threevolatile registers can be backed with hold-up power to writethem to the slower NVM registers inside processor. Ensuringwrite atomicity is beyond the scope of this paper; our focusis to avoid frequent persistence of updates to counter valuesin memory and using the fast volatile counter cache whileensuring safe and secure recoverability.

C. MotivationAs mentioned earlier, counter blocks must be persisted to

allow safe and secure recovery of the encrypted memory. Also,the Merkle Tree root cannot be written to NVM and must besaved in the secure processor. The intermediate nodes of the treecan be reconstructed after recovery from a crash, hence there isno need to persist updates to the affected nodes after each writeoperation. In other words, no extra work should be done tothe tree beyond what occurs in conventional secure processorswithout persistent memory. As the performance overhead ofBonsai Merkle Tree integrity verification is negligible (less than2% [14]), our focus in this paper is to reduce the overheadof persisting encryption counters. To persist the encryptioncounters, we employ two methods: Write-Through CounterCache (WT) and Battery-Backed Write-Back Counter Cache(WB). In the WT approach, whenever there is a write operationissued to NVM, the corresponding counter block is updatedin the counter cache and persisted in NVM before the datablocks are encrypted and updated in NVM to complete thewrite operation. However, the WT approach causes significantwrite-traffic and will severely degrade performance. In the WBapproach, the updated counter block is marked as dirty in the

counter cache and is only written back to NVM when it isevicted. The main issue with the WB counter-cache is that itrequires a sufficient and reliable power source after a crashto enable flushing the counter-cache contents to NVM, whichincreases the system cost and typically requires a large area.Most processor vendors, e.g., Intel, define the in-processorpersistent domain to only include the write-pending queue(WPQ) in the memory controller. Note that using UPS tohold-up the system until backing up important data is acommon practice in critical systems. However, we do notexpect commodity or low-power processors to shoulder thecosts and area requirements of UPS systems.

0.6

0.8

1

1.2

1.4

1.6

1.8

2

LBM LIBQUANTUM CACTUS PENNANT LULESH SIMPLEMOC MINIFE MCF Average

Exe

cutio

n T

ime

(Nor

mal

ized

to N

o-E

ncry

ptio

n)

Benchmarks

The Impact of Counter-Cache Persistence on Performance

No EncryptionWTWB

Fig. 4: Impact of counter-cache persistence on performance.

Shown in Figure 4, WT counter-cache entails significantperformance degradation for most of the benchmarks comparedto no-encryption and WB counter-cache schemes. On average,WT scheme degrades performance by 51.5% whereas WBscheme only degrades performance by 6.6% compared to a no-encryption (unprotected) scheme. In applications that are write-intensive, e.g., libquantum benchmark, the performanceoverhead of WT can reach 198% normalized to that of theno-encryption scheme.

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

LBM LIBQUANTUM CACTUS PENNANT LULESH SIMPLEMOC MINIFE MCF Average

# M

emor

y W

rites

(N

orm

aliz

ed to

No-

Enc

rypt

ion)

Benchmarks

The Impact of Counter-Cache Persistence on # Writes

No EncryptionWTWB

Fig. 5: Impact of counter-cache persistence on number ofwrites.

Another key metric relevant to NVM is the write traffic.NVMs have limited write bandwidth due to the limitedwrite-drivers and the large power-consumption for each writeoperation. More critically, NVMs have limited endurance for

writes. In WT scheme, besides data block write, each writeoperation requires an additional in-memory persistence of thecounter value, thus incurs twice the write traffic of the no-encryption scheme. In contrast, the WB scheme only writes theupdated counter blocks to the NVM when blocks are evictedfrom the counter cache. As shown in Figure 5, WB incursan extra 5.92% writes on average when compared to the no-encryption scheme, whereas WT, as expected, incurs twice thewrite traffic.

Note that the atomicity of the WT scheme includes both datawrite and counter write. Crashes (or power loss) can happenbetween transactions or in the middle of a transaction. There areseveral simple ways to guarantee atomicity. 1 Placing a smallNVM storage (e.g., 128B) inside the processor chip to log boththe to-be-written data block and counter block before trying tocommit them to memory. If a failure occurs, once the systemrestores, it begins with redoing the stored transactions. 2Guaranteeing enough power hold-up time (through reasonablesmall capacitors) to complete at least 2 write operations (dataand counter). However, solutions that aim to guarantee memorytransaction-level atomicity are beyond the scope of this paper.

In this paper, we propose a new scheme that has theadvantages of the WT scheme — no need for a battery orlong power hold-up time. It also has as small performance andwrite traffic overheads such as those of the WB scheme.

III. OSIRIS

In this section, we first discuss our assumed threat model.Then, we discuss the design of Osiris and the possible designoptions and trade-offs.

A. Threat ModelOur assumed threat model is similar to state-of-the-art work

on secure processors [4], [6], [8]. The trust base includesthe processor and all of its internal structures. Our threatmodel assumes that an attacker can snoop the memory bus,scan the memory content, tamper with the memory content(including rowhammer) and replay old packets. Differentialpower attacks and electromagnetic inference attacks are beyondthe scope of this paper. Furthermore, attacks that try to exploitprocessor bugs in speculative execution, such as Meltdown andSpectre, are beyond the scope of this paper. Such attacks canbe mitigated through more aggressive memory fencing aroundcritical code to prevent speculative execution. Finally, ourproposed solution does not preclude secure enclaves and hencecan operate in untrusted Operating System (OS) environments.Attack on Reusing Counter Values for Non-Persistent Data:While state-of-the-art [13] work relaxes persisting countersfor non-persistent data, if using split counter like in [15], itactually introduces serious security vulnerabilities. For example,assume an attacker application uses known-plaintext andwrites it to memory. If the memory location is non-persistent,the encrypted data will be written to memory but not thecounter. Thus, by observing the memory bus, the attacker candiscover the encryption pad by XORing the observed ciphertext,(Ekey(IVnew) ⊕ Plaintext), with the Plaintext. Note thatit is also easy to predict the plaintext for some accesses, forinstance, zeroing at first access. By now, the attacker knows

the value of Ekey(IVnew). Later, after a crash, the memorycontroller will read IVold and increment it, which generatesIVnew, and then encrypts the new application data writtento that location to become Ekey(IVnew)⊕ Plaintext2 . Theattacker can discover the value of Plaintext2 by XORing theciphertext with the previously observed Ekey(IVnew). Notethat the stale counter could have been already incrementedmultiple times before the crash; multiple writes of the newapplication can reuse counters with known encryption pads.Note that such an attack only needs a malicious application torun (or just predictable initial plaintext of an application) anda physical attacker or bus snooper.

B. Design OptionsBefore delving deep into the details of Osiris, let’s first

discuss the challenges of securely recovering the encryptioncounters after a crash. Without a mechanism to persist encryp-tion counters, once a crash occurs, you are only guaranteedthat the root (kept in processor) of the Merkle Tree is updatedand reflects the most recent counter values written to memory;any write operation before being sent to NVM will update theaffected parts/branches of the Merkle Tree up to the root. Notethat it is likely that the affected branches of the Merkle Treewill be cached on the processor and there is no need to persistthem as long as the processor can maintain the value of theroot after crash. Later, once the power is back and we want torestore the system, we may have stale counter values in theNVM and stale intermediate values of the Merkle Tree.

Once the system is powered back, any access to a memorylocation needs two main steps: 1 obtain the correspondingmost-recent encryption counter from memory and 2 verify theintegrity of data through the MAC and the integrity of the usedcounter value through the Merkle Tree. It is possible that Step1 results in a stale counter, in which case Step 2 will fail

due to a Merkle Tree mismatch. Remember that the root of theMerkle Tree was updated before the crash, thus using a stalecounter will be detected. As soon as the error is detected, therecovery process stops. One naı̈ve approach would be to try allpossible counter values and use the Merkle Tree to verify eachvalue. Unfortunately, such a brute-force approach is impracticalfor several reasons. First, finding out the actual value requirestrying all possible values for a counter paired with calculatingthe corresponding hash values to verify integrity, which isimpractical for typical encryption counter schemes where therecould be 264 possible values for each counter. Second, it isunlikely that only one counter value is stale; many updatedcounters in the counter cache will be lost. Thus, reconstructingthe Merkle Tree will be impractical if there are multiple stalecounters. For example, counters of blocks X and Y are lost,then we need to reconstruct the Merkle Tree with all possiblecombinations/values of X and Y and then compare the resultingroot with the one safely stored in processor. For simplicity weonly mention losing two counters; in a real system, where acounter cache is hundreds of kilobytes, there will likely bethousands of stale blocks.

Observation 1: Losing encryption counter values makesreconstructing the Merkle Tree nearly impossible. Approachessuch as a brute-force trial of all possibly lost counter values

to reconstruct the tree will take an impractical amount of timeespecially when multiple counter values have been lost.

One possible way to reduce reconstruction time is to employstop-loss mechanisms that limit the number of possible countervalues to verify for each counter after recovery. Unfortunately,since there is no way to pinpoint exactly which countershave lost their values, an aggressive search mechanism is stillneeded. If we limit the number of writes to each counter blockbefore persisting it to only N , then we need to try up to NS

combinations for reconstruction where S is the number of datablocks. For instance, let’s assume we have a 128GB NVMmemory and 64B cache blocks, then we have 2G blocks. If weonly set N to 2, then we need up to 22

31

= 22147483648 trials.Accordingly, a stop-loss mechanism could reduce the time toreconstruct the Merkle Tree. However, it is still impractical.

Obviously, a more explicit confirmation is needed beforeproceeding with an arbitrary counter value to reconstruct theMerkle Tree. In other words, we need a hint on what the mostrecent counter value was for each counter block. For instance,if the previously discussed stop-loss mechanism is used alongwith an approach to bookkeep the phase within the N trialsbefore writing the whole counter block, then we can start witha more educated guess. Specifically, each time we update acounter block N times in the counter-cache, we need to persistits N th update in the memory, which means that we needlog2N bits (i.e., phase) for each counter block to be writtenatomically with the data block. Later, when the system startsrecovery, it knows the exact difference between the most recentcounter value and the one used to encrypt the data through thephase value.

Observation 2: Co-locating the data blocks with a few bitsthat reflect most-recent counter value used for encryption canenable fast-recovery of the counter-value used for encryption.Note that if an attacker tries to replay old data along withtheir phase bits, then the Merkle Tree verification will detectthe tampering due to a mismatch in the resulting root of theMerkle Tree.

Although stop-loss along with phase storage can make therecovery time practical, adding more bits in memory for eachcache line is tricky for several reasons. First, as discussed in[13], increasing the bus-width requires adding more pins to theprocessor. Even avoiding extra pins by adding extra burst inaddition to the 8 bursts of 64-bit bus width for each 64B blockis expensive and requires support from DIMMs in additionto underutilization of data bus (only few bits are written inthe 64-bit wide memory bus in the last burst). Second, majormemory organization changes are needed, e.g., row-buffer size,memory controller timing and DIMM support. Additionally,cache blocks are no longer 64B aligned, which can causecomplexity in addressing. Finally, extra bit writes are neededfor each cache line to reflect the counter phase, which can mapto a different memory bank, hence additional occupation ofbank for write latency.

To retain the advantages of stop-loss paired with phasebookkeeping but without extra bits, Osiris repurposes alreadyexisting ECC bits as a fast counter recovery mechanism. Thefollowing subsection will discuss in details how Osiris can

elegantly employ ECC bits of data to find out the counter usedfor encryption.

C. DesignOsiris mainly relies on inferring the correctness of an

encryption counter by calculating the ECC associated withthe decrypted text and comparing it with that encrypted andstored with the encrypted cacheline. In conventional (notencrypted) systems, when the memory controller writes acache block to the memory it also calculates its ECC, e.g.,hamming code, and stores it with the cacheline. In other words,the tuple that will be written to the memory when writingcache block X to memory is {X, ECC(X)}. In contrast, inencrypted memory systems, there are two options to calculateECC: 1 using the plaintext, then encrypting it with thecacheline before writing both to memory or 2 encryptingthe plaintext, then calculating the ECC over the encryptedblock before both are written to memory. Although approach2 allows overlapping decryption and ECC verification, most

ECC implementations used in memory controllers, e.g., SEC-DED Hsiao Code [16], take less than a nanosecond to complete[17]–[19], which is negligible compared to cache or memoryaccess latencies [20]. Additionally, pipelining the arrival ofbursts with decryption and ECC bits decoding will completelyhide the latency. However, we observe that calculating ECC bitsover the plaintext and encrypting it along with the cachelinecan provide low-cost and fast way of verifying the correctnessof the encryption/decryption operation. Specifically, in counter-mode encryption, the data is decrypted using the following:{X,Z} = Ekey(IVX)⊕Y , where Y is supposedly/potentiallythe encryption of X along with its ECC and Z is potentiallyequal to ECC(X). In conventional systems, if ECC(X) 6= Z,then the reason is definitely an error (or tampering) occurredon X or Z. However, when counter-mode encryption is used,the reason could be that an error (or tampering) occurred onX or Z, or the wrong IV is used to do the encryption, i.e.,decryption is not done successfully.

Observation 3: When the ECC function is applied overthe plaintext and the resulting ECC bits are encrypted alongwith the data, the ECC bits can provide a sanity-check forthe encryption counter. Any tampering with the counter valuewill be detected by a clear mismatch of the ECC bits thatresult from that invalid decryption; results of Ekey(IVold) andEkey(IVnew) are very different and independent. Note that ina Bonsai Merkle Tree, data-integrity is protected through MACvalues that are calculated over each data and its correspondingcounter. While relying on ECC for sanity-checking, the ECCbits can fail to provide guarantees as strong as cryptographicMAC values. Accordingly, we adopt Bonsai Merkle Trees toadd additional protection for data integrity. However, ECCbits, when combined with counter-verification mechanisms,can provide tamper-resistance as strong as the error detectionof the used ECC algorithm.

Important Note: Stream ciphers, e.g., CTR and OFP modes,do not propagate errors, i.e., an error in ith bit of encrypted datawill result in an error in ith bit of decrypted data, hence thereliability is not affected. In encryption modes where an errorin encrypted data can result in completely unrelated decrypted

text, e.g., block cipher modes, careful consideration is requiredas encrypting ECC bits can render them useless when there isan error. For our scheme, we focus on state-of-the-art memoryencryption, which commonly uses CTR-mode for security andperformance reasons.

Now the question is how to proceed when there is an errordetected due to a mismatch between the expected ECC and thestored ECC (obtained after decryption). As the reader wouldexpect, the first step is to find if the error is correctable usingthe ECC code. Table I lists some common ECC errors andtheir associated fixes.

TABLE I: Common Sources of ECC Mismatch

Error Type Common fixError on stored data Can be fixed if the error is correctable, e.g.

single bit failureError on ECC Typically unrecoverableStale/Wrong IV Speculate the correct IV and verify

As shown in Table I, one reason for such an ECC mismatchis using an old IV value. To better understand how this canhappen, recall the counter cache persistence issue. If a cacheblock is updated in memory, it is also necessary to update andpersist its encryption counter, for both security and correctnessreasons. Given the ability to detect the use of stale counter/IV,we can implicitly reason about the likelihood of losing themost-recent counter value due to a sudden power loss. Tothat extent, in theory, we can try to decrypt the data withall possible IVs and stop when an IV successfully decryptsthe block, i.e., the resulting ECC matches the expected one(ECC(X) = Z). At that point, there is a high chance thatthe IV was actually the one used to encrypt the block, butit was either lost due to inability to persist the new countervalue after persisting the data, or due to a relaxed scheme.Osiris builds upon the later possibility; it uses a relaxedcounter persistence scheme to employ ECC bits to verify thecorrectness of the counter used for encryption. As discussedearlier, it is impractical to try all the possible IVs to infer theIV used to encrypt the block, thus Osiris deploys the stop-lossmechanism to limit such possibility to only N updates of acounter, i.e., the correct IV should be within [IV +1, IV +N ],where IV is the most recent IV that was stored/persisted inmemory. Note that once the speculated/chosen IV passes thefirst check through ECC sanity-check it also needs to beverified through Merkle Tree.

We propose two flavors for Osiris, baseline Osiris andOsiris-Plus. In baseline Osiris, during normal operation, allcounters being read from memory reflect their most-recentvalues; the most-recent value is either updated in cache orhas been evicted/written-back to memory. Thus, inconsistencybetween counters in memory and counters in cache canhappen due to crash or tampering with counters in memory. Incontrast, Osiris-Plus strives to even outperform Battery-BackedWrite-Back counter-cache scheme through purposely skippingcounter updates and recovering their most-recent values whenreading them back, hence inconsistency can happen duringnormal operation. Below is a further discussion on both

baseline Osiris and Osiris-Plus.

Normal Operation Stage: During normal operation, Osirisadopts a write-back mechanism by updating memory countersonly when evicted from the counter cache. Thus, Osiris canalways find the most-recent counter value either in cache orby fetching it from memory in the case of miss. Accordingly,in normal operation, Osiris is similar to conventional memoryencryption except that a counter is persisted once each N thupdate — acting like a write-through for the N th updateof each counter. In contrast, Osiris-Plus allows occasionaldropping of the most-recent values of encryption counters byrelying on run-time recovery mechanisms of the most-recentvalues of counters. In other words, Osiris-Plus relies on tryingmultiple possible values on each counter miss to recover themost-recent one before verifying it through a Merkle Tree;baseline Osiris only does this at system recovery time.System Recovery Stage: During system recovery time, bothOsiris and Osiris-Plus must reconstruct the Merkle Tree atthe time of restoration. Furthermore, both need to use themost-recent values of counters to reconstruct a tree that has aroot that matches the root stored in the secure processor. Inboth Osiris and Osiris-Plus, the system recovery will start bytraversing all memory locations (cache lines) in an integrityverified region (all memory in secure NVM). For each cacheline location, i.e., 64B address, the memory controller uses theECC value after decryption as a sanity check of the counterretrieved from memory. However, if the counter value failsthe check, all possible N values must be checked to find themost-recently used counter value. Later, the correct valuewill overwrite the current (stale value) counter in memory.After all memory locations are checked, the Merkle Treewill be reconstructed with the recovered counter values and,eventually, build up all intermediate nodes and the root. Inthe final step, the resulting root will be compared with thatsaved and kept in the processor. If a mismatch occurs, the dataintegrity of the memory cannot be verified and it is very likelythat an attacker has replayed both counter and correspondingencrypted data+ECC blocks.

Next, we will discuss the design of Osiris and Osiris+Plusby guiding the reader through the steps of read and writeoperations in both schemes.

1) Osiris Read and Write Operations: As shown in Figure6, during a write operation, once a cache block is evicted fromLLC, the memory controller will calculate the ECC of the datain the block, as shown in Step 1 . Note that write operationshappen in the background and are typically buffered in the write-pending queue. In Step 2 , the memory controller obtains thecorresponding counter in the event of a miss and evict/write-back the evicted counter block, if dirty. The counter valueobtained from Step 2 will be used to proactively generate theencryption pad, as shown in Step 3 . In Step 4 , the countervalue will be verified (in case of miss) and the counter andaffected Merkle Tree (including the root node) are updated inthe Merkle Tree cache. Unlike a typical write-back counter-cache, if the new counter value is a multiple of N or 0, then the

Counter Cache

NVM Memory

AES Engine IV (from counter)

Processor KeyECC Encoding

{Data, ECC(Data)}

XOR Encryption Pad

Data + ECC

Evicted Cache BlockMerkle Tree

CacheMerkle Tree

RootMC Controller

(if counter%N ==0)

1

3

5

4

2

Fig. 6: Osiris Write Operation

new counter value is also persisted before proceeding. Finally,in Step 5 , the data+ECC is encrypted using the encryptionpad and written to memory.

Counter Cache

NVM Memory

AES EngineIV (from counter)

Processor Key

{Data, ECC} XOR

Encryption PadData + ECC

Read Request

Merkle TreeCache

Merkle TreeRoot

MC Controller

1

3

4

6

2

Encrypted Data+ECC

Verify ECC

Verify

5

Response data7

Fig. 7: Osiris Read OperationDuring a read operation, as shown in Figure 7, the memory

controller will obtain the corresponding counter value from thecounter cache (in the case of hit) or memory (in the case of amiss) and evict the victim block, if dirty, as shown in Steps1 and 2 . The counter value will be used to generate the

encryption pad as shown in Step 3 . In Step 4 , the actualdata block is read from memory and decrypted using the padgenerated in Step 3 . In Step 5 , traditional ECC checkingoccurs. Finally, if the counter block is fetched from memory(miss), the integrity of the counter value is verified, as shown inStep 6 . Finally, as shown in Step 7 , the memory controllerreceives the decrypted data, which is then forwarded to thecache hierarchy by the memory controller. Note that many ofthe steps can be overlapped without any conflicts, for instance,Step 6 and steps 4 and 5 .

2) Osiris-Plus Read and Write Operations: The writeoperation in Osiris-Plus is very similar to the write-operationin baseline Osiris; the difference is that it does not write backevicted dirty blocks from the counter cache, as could happenin Step 2 of Figure 6. Instead, Osiris-Plus recovers the most-recent value of the counter each-time it is read from memory

and only updates it in memory each N th update. Figure 8depicts the read operation of Osiris-Plus.

Counter Cache

NVM Memory

AES EngineIV (from counter)

Processor Key

{Data, ECC} XOR

Encryption PadData + ECC

Read Request

Merkle TreeCache

Merkle TreeRoot

MC Controller

1

3

4

6

2

AES Engine

AES Engine

...IV + 1

IV + N -1XOR

XOR

{Data, ECC}Data + ECC

{Data, ECC}Data + ECC

Encrypted Data+ECC

Verify ECC

Verify ECC

Verify ECC

MUXEncoder

......

...+IV

7

Verify

5

Update

Response data8

Fig. 8: Osiris-Plus Read Operation

The main difference between Osiris and Osiris-Plus is shownin Steps 5 and 6 in Figure 8. Osiris-Plus utilizes additionalencryption engines and ECC units to allow fast recovery of themost-recent counter value. Note that given the fact that mostAES encryption engines are pipelined and the candidate countervalues are sequential, using a single encryption engine insteadof N engines can only add N extra cycles. Once a countervalue is detected, through post-decryption ECC verification, tobe the most-recent one, it will be verified through a Merkle Treesimilar to baseline Osiris. Once the counter is verified, the dataresulting from decrypting the ciphertext with the most-recentcounter value will be returned to the memory controller. Notethat the recovered counter value is updated in the counter-cache,as shown in Step 7 .

D. Reliability and Recovery from CrashAs mentioned earlier, to recover from a crash, Osiris and

Osiris-Plus must first recover the most-recent counter valuesutilizing the post-decryption ECC bits to find the correctcounters. The Merkle Tree will then be constructed and theresulting root will be verified and compared with the root keptin the processor. While the process seems reasonably simple,it can get complicated in the presence of errors. Specifically,uncorrectable errors in the data or encryption counters makegenerating a matching Merkle Tree root impossible, whichmeans that the system is unable to verify the integrity ofmemory. Note that such uncorrectable errors will have the sameeffect on integrity-verified memories even without deployingOsiris.

Uncorrectable errors in encryption counters protected bya Merkle Tree can potentially fail the recovery process.Specifically, when the system attempts to reconstruct the MerkleTree, it will fail to generate a matching root. Furthermore, itis not feasible to know which part of the Merkle Tree causesthe problem. Only the root is maintained; all other parts of thetree should generate the same root or none should be trusted.

One way to mitigate this single-point of failure is to keep otherparts of the Merkle Tree in the processor and guarantee theyare never lost. For instance, for an 8-ary Merkle Tree, the 8immediate children of the root are also saved. In a more capablesystem, the root, its 8 immediate children and their children arekept — a total of 73 MAC values. If the recovery process failsto produce the root of the Merkle Tree, we can look at whichchildren are mismatched and declare that part of the tree asun-verifiable and warn the user/OS. While we provide insightinto this problem, we assume the system architects choose toonly save the root. However, having more NVM registers insidethe processors to save more levels of the Merkle Tree can beimplemented for high error systems. We leave reconstructing aMerkle Tree in the presence of uncorrectable errors as futurework.

To formally describe our recovery process success rate, wecan look at two cases. In the first case, when no errors occur inthe data, since each 64B memory block has 8B ECC (64 bits),the probability that a wrong counter results in correct (indicateno errors) ECC bits for the whole block is only 1

264 , i.e., similarprobability to guessing a 64-bit key correctly, which is nextto impossible. Note that each 64B block is practically dividedinto 8 64-bit words [21], each has its own 8-bit ECC, i.e., eachbus burst will have 64-bit data and 8-bit ECC (total of 72 bits).This is why most ECC-enabled memory systems will have 9chips per rank instead of 8, where each chip provides 8 bits.The 9th chip provides the ECC 8-bits for the burst/word. Alsonote that the 64B block will be read/written as 8 bursts on a72-bit memory bus.

Bearing in mind the organization of ECC codewords foreach 64B memory block, let’s now discuss the case wherethere is actually an error in data. For an incorrect counter, theprobability that an 8-bit ECC codeword indicates that thereis no error is Pno−error = 1

28 and the probability of notindicating that there is no-error, i.e., that there is an error, isPerror = 1 − 1

28 . The probability that k codewords of the8 ECC codewords indicate an error can be represented bya Bernoulli Trial as Pk =

(8k

)× (1− 1

28 )K × ( 1

28 )8−K . For

example, P2 represents the probability of having 2 of the 8ECC codewords flagging an error for a wrongly decrypteddata and ECC (semi-randomly generated). Accordingly, if weuse our metric to filter encryption counters that have 4 ormore ECC codewords indicating an error, then the probabilityof our success in detecting wrong counters can be given byPk≥4 = 1− (P0 +P1 +P2 +P3). We find that Pk≥4 is nearly100% (improbability less than 5.04×10−11). Thus, by filteringout counters cause 4 or more incorrect ECC codewords, we canwith a probability of almost 100% detect any wrong counter.However, if the data has 4 or more (out of 8) words with atleast an error on each of them, then also the correct counterwill be filtered out and Merkle Tree verification will be usedwith each of the possible counter values to identify the correctcounter before correcting data. In other words, Osiris willreliably filter out wrong counters immediately except for therare case of having at least 4 words each with at least an errorin the same data block. However, since in case of Osiris onlyfew thousands of blocks (those were in cache) are stale at

recovery time, the probability that one of them correspondsto a data block that has 4 or more faulty words is very small,which then would cause an additional step to find the correctcounter. Note that Osiris does not affect reliability but ratherincurs additional step for identifying the correct counter whenthe original data has large number of faulty words.

It is also important to note that many PCM prototypes usesimilar ECC organization but with larger number of ECCbits per 64-bit words, e.g., 16-bit ECC for each 64-bit word[22], which makes our detection success rate even closer toperfect. Specifically, using Bernoulli trial for 16-bit ECC perword, the probability of a wrong counter causes 5 or moremismatching ECC codewords is almost 100% (improbabilityless than 2× 10−13), and thus only blocks with at least 5 (outof 8) words are faulty would require additional Merkle Treeverification to recognize the correct counter. Finally, in the rarecase that a wrong counter is falsely not filtered out, Osiris willrely on Merkle Tree to identify the correct counter from thosenot filtered out.

E. Systems without ECC

Instead of ECC bits, some systems employ MAC values thatcan be co-located with data [8]. For instance, the ECC chips inthe DIMM can be used to store the MAC values of the BonsaiMerkle Tree, allowing the system to obtain data integrity-verification MACs along with the data. While our descriptionof Osiris and Osiris-Plus focuses on using ECC, MAC valuescan also be used for the same purpose — as a sanity-check forthe decryption counter. The only difference is when an erroris detected. With ECC, the number of mismatched bits canbe used to guess the counter value. This isn’t true for MACvalues, which tend to differ significantly in the presence of anerror. To solve this problem, MAC values that are aggregationsof multiple MAC-Per-Word, e.g., 64 bits that are made of 8bits for each 64 data bits, are used. The difference between thegenerated MACs and the stored MACs for different countervalues can be used as a selection criteria for the candidatecounter value. Our proposed schemes also work with ECC andMAC co-designs such as Synergy (parity+MAC) [8].

F. Security of Osiris and Osiris-Plus

Since Osiris and Osiris-Plus use Merkle Trees for finalverification, they have security guarantees and tamper-resistancesimilar to any scheme that uses a Bonsai Merkle Tree, suchas state-of-the-art secure memory encryption schemes [4], [6],[8].

IV. EVALUATION

In this section, we evaluate the performance of Osiris withother state-of-the-art persisting cache designs. Section IV-Adescribes the methodology we use for the experiments. SectionIV-B discusses the performance impact of the design forOsiris/Osiris Plus in terms of limit value and the performancecomparison with other cache designs.

TABLE II: Configuration of the simulated system.

ProcessorCPU 4-core, 1GHz, out-of-order x86-64L1 Cache private, 2 cycles, 32KB, 2-way, 64B blockL2 Cache private, 20 cycles, 512KB, 8-way, 64B blockL3 Cache shared, 32 cycles, 8MB, 64-way, 64B block

DDR-based PCM Main MemoryCapacity 16 GBPCM Latencies 60ns read, 150ns write [23]Organization 2 ranks/channel, 8 banks/rank, 1KB row buffer,

Open Adaptive page policy, RoRaBaChCo address map-ping

DDR Timing tRCD 55ns, tXAW 50ns, tBURST 5ns, tWR 150ns, tRFC5ns [13], [23]tCL 12.5ns, 64-bit bus width, 1200 MHz Clock

Encryption ParametersCounter Cache 256KB, 16-way, 64B block

A. Methodology

We model Osiris in Gem5 [24] with the system configurationpresented in the Table II. We implement a 256KB, 16-wayset associative counter cache, with a total number of 4Kcounters. To stress-test our design, we select memory-intensiveapplications. Specifically, we select eight memory-intensiverepresentative benchmarks from the SPEC2006 suite [25] andfrom proxy applications provided by the U.S. Department ofEnergy (DoE) [26]. The goal is to evaluate the performanceand the write traffic overheads of our design. In all experiments,the applications are fast-forwarded to skip the initializationphase, and then followed by the simulation of 500 millioninstructions. Similar to prior work [9], we assume the overallAES encryption latency to be 24 cycles, and we overlap fetchingdata with encryption pad generation. Below are the schemeswe use in our evaluation:No-encryption Scheme: The baseline NVM system withoutmemory encryption.Osiris Scheme: Our base solution that eliminates the need fora battery while minimizing the performance and write trafficoverheads.Osiris Plus Scheme: An optimized version of the Osirisscheme that also eliminates the need for evicting dirty counterblocks at the cost of extra online checking to recover the mostrecent counter value.Write-through (WT) Scheme: A strict counter-atomicityscheme that, for each write operation, enforces both counterand data blocks to be written to NVM memory. Note thatthis scheme does not require any unrealistic battery or powersupply hold-up time.Write-back (WB) Scheme: A battery-backed counter cachescheme. The WB scheme only writes to memory dirty evictionsfrom counter cache. However, the WB scheme assumes abattery is sufficient to flush all dirty blocks in counter cache.B. Analysis

As discussed in Section III-C, the purpose of Osiris/Osiris-Plus is to persist the encryption counters in NVM memoryin response to system crash recovery but with reduced perfor-mance overhead and write traffic. Therefore, the selection ofnumber N for update interval (also discussed in Section III-C) iscritical in determining the performance improvement. As such,in this section, we study the impact of choosing different N(limit) for Osiris/Osris-Plus on performance. Next, we present

the performance analysis of multiple benchmarks in responseto different persistent schemes discussed and compared in thispaper. Our evaluation results are consistent with the goal ofour design for Osiris and Osiris-Plus.

1) Impact of Osiris Limit: To understand the impact ofOsiris limit (also called N ) on performance and write traffic,we vary the limit in multiples of two and observe thecorresponding performance and write traffic. Figure 9 showsthe performance of Osiris and Osiris-Plus normalized to no-encryption while varying the limit. The figure also shows WBand WT performance normalized to no-encryption to facilitatethe comparison.

1

1.1

1.2

1.3

1.4

1.5

1.6

2 4 8 16 32

Execution T

ime (

Norm

aliz

ed to N

o-E

ncry

ption)

Osiris Limit

The Impact of Osiris Limit on Performance

Osiris PlusOsiris

WBWT

Fig. 9: The impact of Osiris limit on performance

From Figure 9, we can observe that both Osiris and Osiris-Plus benefit clearly from increasing the limit. However, asdiscussed earlier, having large limit values can cause increasein recovery time and potentially large number of encryptionengines and ECC units (in case of Osiris-Plus). Accordingly,it is important to find the point which can no longer bring injustifiable gains if increased. From Figure 9, we can observe thatOsiris at limit 4 has an average performance overhead of 8.9%compared to 6.6% and 51.5% for WB and WT, respectively. Incontrast, also at limit 4, Osiris-Plus has only 5.8% performanceoverhead, which is even better than WB scheme. Accordingly,we can observe that at limit 4, both Osiris and Osiris-Plusperform close to or outperform the WB scheme even thoughthey do not need any battery or hold-up power. In large limitvalues, e.g., 32, Osiris performs similar to write-back; dirtyblocks will be evicted before absorbing the limit of numberof writes, hence the counter block is rarely persisted beforeeviction. In contrast, Osiris-Plus brings down the performanceoverheads to only 3.4% (compared to 6.6% for WB), but againat the cost of large number of encryption engines and ECCunits.

A similar pattern can be observed in Figure 10 demonstratingless write traffic for both schemes. At limit 4, Osiris has awrite traffic overhead of 8.5% whereas WB and WT have 5.9%and 100%, respectively. In contrast, at limit 4, Osiris-Plushas only 2.6% write traffic overhead. With larger limit values,e.g., 32, Osiris-Plus has nearly zero extra writes, whereasOsiris has write traffic overhead similar to WB. Based on thisobservation, we use limit 4 as a reasonable trade-off between

1

1.2

1.4

1.6

1.8

2

2.2

2 4 8 16 32

# o

f W

rite

(N

orm

aliz

ed to N

o-E

ncry

ption)

Osiris Limit

The impact of Osiris Limit on # of Writes

Osiris PlusOsiris

WBWT

Fig. 10: The impact of Osiris limit on number of writes.

the performance overhead and the required additional stagesor extra checking at the time of recovery.

2) Impact of Osiris and Osiris Plus persistency on multiplebenchmarks: As we now understand the overall impact ofOsiris-limit, i.e., N , on performance and write traffic, we nowzoom-in on the performance and write traffic of individualbenchmarks when using limit 4 as suggested in Section IV-B1.

0.8

1

1.2

1.4

1.6

1.8

2

2.2

LBM CACTUS LIBQUANTUM MCF PENNANT LULESH SIMPLEMOC MINIFE Average

Execution T

ime (

Norm

aliz

ed to N

o-E

ncry

ption)

Benchmarks

The Impact of Osiris/Osiris Plus on Performance

WTOsiris

Osiris PlusWB

Fig. 11: The impact of Osiris and Osiris Plus persistence onperformance

As we can observe from Figure 11 and Figure 12, formost of the benchmarks, Osiris-Plus outperforms all otherschemes in both performance and reduction in number ofwrites. Meanwhile, for most benchmarks, Osiris performs closeto WB scheme. As noted earlier, Osiris performance and writetraffic are bounded by the WB scheme; if the updated countersrarely get persisted before eviction from counter cache, i.e.,updated less than N time, then Osiris performs similar to WBbut without need for battery. Note that it is not common tohave the same counter updated many times before evictionfrom counter cache; once a data cache block gets evicted fromthe cache hierarchy, it is very unlikely that it will be evictedagain very soon. However, since Osiris-Plus can additionallyeliminate premature (before N th write) counter evictions, it canactually outperform WB. The only exception we can observe isLibquantum, which is mainly due to its behavior of repeated

0.8

1

1.2

1.4

1.6

1.8

2

2.2

LBM CACTUS LIBQUANTUM MCF PENNANT LULESH SIMPLEMOC MINIFE Average

# o

f W

rite

s (

Norm

aliz

ed to N

o-E

ncry

ption)

Benchmarks

The Impact of Osiris/Osiris Plus on # of Writes

WTOsiris

Osiris PlusWB

Fig. 12: The impact of Osiris and Osiris Plus persistence onnumber of writes

streaming behavior over a small array (4MB array); manycache hierarchy evictions due to conflicts, however, since eachcounter cache block covers the counters of 4KB page, manyevictions/writes of the same data block will result in hits inthe counter cache. Accordingly, a WB scheme performs betteras there are few evictions (write-backs) from counter cachecompared to the writes due to persistence of counter values ateach N th write in Osiris and Osiris-Plus. However, we observethat with larger limit value, such as 16, Osiris-Plus clearlyoutperforms WB (7.8% vs. 14% overhead), whereas Osirisperforms similarly. In summary, for all benchmarks, Osiris andOsiris-Plus performs better than strict-counter persistence (WT)and very close or even better than battery-backed WB scheme.In addition, we also tested some aggressive cases, such asread latency 300ns and write latency 1000ns. On average, theOsiris-Plus outperforms WB in both execution overhead andnumber of writes; Osiris, although slightly degraded, is stillvery close to WB scheme’s performance.

3) Sensitivity to Counter Cache Size: While it is impracticalto assume a very large battery-backed write-back cache, weverify the robustness of Osiris and Osiris-Plus when largecounter caches are used. Even at 1MB cache size, Osirisand Osiris-Plus have performance overhead of only 3.8% and3%, whereas WT and WB has 48.9% and 1.4%, respectively.Similarly, Osiris and Osiris-Plus have write traffic overheadof only 3.8% and 2.6%, whereas WT and WB has 100% and1.2%, respectively. Note that WB is only slightly better dueto the long tenure of cache blocks before eviction. Due to thelimited space, we do not include different cache sizes between256KB and 1MB, however, the trend is as expected and alwaysin favor of Osiris and Osiris-Plus.

4) Recovery Time: At recovery time in a conventional NVMsystem, the Merkle Tree must be first reconstructed as earlyas the first memory access. The process will build up theintermediate nodes and the root to compare it with that keptinside the processor. To do so, it will also need to verify thecounters’ and data integrity through comparison with the MACcalculated over the counter and data.

Baseline Osiris, in the worst case scenario of a crash,

probably have lost all the updated counter cache blocks, i.e.,2048 64B counter blocks for a 128KB counter cache. Duringrecovery, the memory controller will iterate over all encryptioncounters and build the Merkle Tree. However for counters thathave a mismatched ECC, an average of 4 trials (when N equals8) of counter values will be tried before finding the correctvalue. Thus, for a 16GB NVM memory, there will be 256Mdata blocks and 4M counter blocks. Assuming that readinga data block and verifying its counter takes 100ns, then in aconventional system we need 100ns× 256M which is roughly25.6 seconds.

In Osiris, only 131072 counters (corresponding to the lost2048 counter blocks) will require additional checking. In theworst case, each counter has been updated 8 times, whichwill need 131072 × 8 × 100ns extra delay, which extendsthe recovery time to 25.7 seconds – only an additional 0.4%recovery time. Also note that Osiris overhead depends on thecounter cache size. Using large NVM memory capacity willsignificantly reduce the overhead percentage due to an increasein actual recovery time.

In contrast, as mentioned in the paper, Osiris-Plus deploysparallel engines that check all possible values in parallel,which means that even for wrong values the detection andverification latency will be similar to those of the correctcounters and the correct counter values will be used to buildupper intermediate levels. Accordingly, there is no additionaloverhead in the recovery time for Osiris-Plus. However, thisrequires extra parallel ECC and AES engines. Note that mostmodern processors have multiple memory channels and a highdegree of parallelism in memory modules, which makes thecost of 100ns to sequentially access and verify each block beconservative. We estimate recovery time to be even faster forboth conventional systems and Osiris.

V. RELATED WORK

In this section, we review and discuss the relevant work inNVM security, mostly in respect to encryption and memorypersistence in this section.NVM Security: Most research advocates counter-mode en-cryption (CME) to provide strong security while benefitingfrom short latency due to the overlap of encryption-padgeneration with data fetch from memory [4], [6], [13], [27]–[30].Further optimization is carried out based on this fundamentalmodel. DEUCE [6] proposes a dual-counter CME schemethat only re-encrypts the modified word for each write.SECRET [27] integrates zero-based partial writes with XOR-based energy masking to reduce both overhead and powerconsumption for encryption. ASSURE [28] further eliminatescell writes of SECRET by enabling partial MAC computationand constructing efficient Merkle Tree. Distinctively, SilentShredder [4] repurposes the IVs in CME to eliminate thedata shredding writes. More recently, Synergy [8] described asecurity-reliability co-design that co-locates MAC and data fordata reconstruction due to single-chip error and for reducing theoverall memory traffic. In contrast with prior work, our workis the first to provide a security-reliability guaranteed solutionfor persisting encryption counters for all memory blocks.Memory Persistence: As a new technology possessing both

the properties of main memory and storage, persistent memoryhas promising recovery capabilities and the potential to replacemain memory. Thus it attracts much attention from bothhardware and software research communities [31]–[34]. Pelleyet al. [35] refer to memory persistency as the constraints ofwrite ordering with respect to failure and propose severalpersistency models in either a strict or a relaxed way. DPO[36] and WHISPER [37] are different persistent frameworksthat use strict and relaxed ordering, respectively, with differentgranularities. In our paper, we assume a conventional persis-tence model, i.e., cacheline flushing ordered by store-fencing,e.g., CLWB then SFENCE, to persist memory locations [38]and we focus on the persistence from a hardware perspective.

The most related work to Osiris is counter-atomicity [13].Counter-atomicity introduces a counter write queue and aready bit to enforce write persistence of data and countersfrom the same request to NVM simultaneously. To reducethe write overhead, counter-atomicity provides some APIs forusers to selectively persist self-defined critical data along withtheir counters, relaxing the need for all-counter persistence.Osiris tackles the same problem but by persisting countersperiodically in encrypted NVMs. Osiris relies on ECC bitsto provide sanity-check for the used encryption counter andhelps with its recovery. While our solution does not requireany software modification and is completely transparent tousers, it is orthogonal to and can be augmented with selectivecounter-atomicity scheme. Moreover, Osiris and Osiris-Plus canbe used in addition to selective counter-atomicity to cover allmemory locations at low-cost, hence avoiding known-plaintextattacks discussed earlier in Section III-A. The work discussedearly in this paper consider failure as a broad concept [35],whereas some work specifically handle power failure, such asWSP [39] and i-NVMM [5]. By using residual energy and byrelying on a battery supply, WSP and i-NVMM can persistdata and encryption to NVMM. However, in our case, we donot require any external battery back up for counter persistencerather rely on ECC-based sanity-check for counters recoveryand application-level persistence of data.

VI. CONCLUSIONWe propose a novel scheme called Osiris that persists

encryption counters to NVM similar to strict counter persistenceschemes but with significant reduction in performance overheadand NVM writes. We also present an optimized version, OsirisPlus, that improves the performance by eliminating prematurecounter evictions. We have shown how Osiris works and canbe integrated with state-of-the-art data and counter integrityverification (ECC and Merkle tree) for rapid failure/counterrecovery. We compare Osiris with other memory persistencyschemes and show trade-offs in terms of hardware complexityand performance.

Our evaluation of eight representative benchmarks showsdesirable performance overhead and NVM writes when em-ploying Osiris/Osiris-Plus in comparison with other persistencyschemes. For future work, we will study different ways topersist intermediate nodes in a Merkle Tree for faster recovery.

ACKNOWLEDGMENTWe would like to thank Dr. Zakhia Abichar and Dr. Simon

Hammond for their feedback and discussion.

REFERENCES

[1] B. C. Lee, P. Zhou, J. Yang, Y. Zhang, B. Zhao, E. Ipek, O. Mutlu,and D. Burger, “Phase-change technology and the future of mainmemory,” in 2010 43rd Annual IEEE/ACM International Symposium onMicroarchitecture (MICRO). IEEE, 2010.

[2] Z. Li, R. Zhou, and T. Li, “Exploring high-performance and energyproportional interface for phase change memory systems,” IEEE 20thInternational Symposium on High Performance Computer Architecture(HPCA), 2013.

[3] M. K. Qureshi, J. Karidis, M. Franceschini, V. Srinivasan, L. Lastras,and B. Abali, “Enhancing lifetime and security of pcm-based mainmemory with start-gap wear leveling,” in 2009 42nd Annual IEEE/ACMInternational Symposium on Microarchitecture (MICRO). IEEE, 2009.

[4] A. Awad, P. Manadhata, S. Haber, Y. Solihin, and W. Horne, “Silentshredder: Zero-cost shredding for secure non-volatile main memorycontrollers,” in Proceedings of the Twenty-First International Conferenceon Architectural Support for Programming Languages and OperatingSystems, ser. ASPLOS ’16. New York, NY, USA: ACM, 2016. [Online].Available: http://doi.acm.org/10.1145/2872362.2872377

[5] S. Chhabra and Y. Solihin, “i-nvmm: A secure non-volatile mainmemory system with incremental encryption,” in Proceedings of the38th Annual International Symposium on Computer Architecture, ser.ISCA ’11. New York, NY, USA: ACM, 2011. [Online]. Available:http://doi.acm.org/10.1145/2000064.2000086

[6] V. Young, P. J. Nair, and M. K. Qureshi, “Deuce: Write-efficientencryption for non-volatile memories,” in Proceedings of theTwentieth International Conference on Architectural Support forProgramming Languages and Operating Systems, ser. ASPLOS’15. New York, NY, USA: ACM, 2015. [Online]. Available:http://doi.acm.org/10.1145/2694344.2694387

[7] V. Costan and S. Devadas, “Intel sgx explained.” IACR Cryptology ePrintArchive, vol. 2016, 2016.

[8] G. Saileshwar, P. J. Nair, P. Ramrakhyani, W. Elsasser, and M. K.Qureshi, “Synergy: Rethinking secure-memory design for error-correctingmemories,” in The 24th International Symposium on High-PerformanceComputer Architecture (HPCA), 2018.

[9] A. Awad, Y. Wang, D. Shands, and Y. Solihin, “Obfusmem: A low-overhead access obfuscation for trusted memories,” in Proceedings of the44th Annual International Symposium on Computer Architecture. ACM,2017.

[10] C. Yan, D. Englender, M. Prvulovic, B. Rogers, and Y. Solihin,“Improving cost, performance, and security of memory encryption andauthentication,” in ACM SIGARCH Computer Architecture News, vol. 34,no. 2. IEEE Computer Society, 2006.

[11] J. Chang and A. Sainio, “Nvdimm-n cookbook: A soup-to-nuts primeron using nvdimm-ns to improve your storage performance.”

[12] A. M. Rudoff, “Deprecating the pcommit instruction,” 2016. [Online].Available: https://software.intel.com/en-us/blogs/2016/09/12/deprecate-pcommit-instruction

[13] S. Liu, A. Kolli, J. Ren, and S. Khan, “Crash consistency in encrypted non-volatile main memory systems,” in 2018 IEEE International Symposiumon High Performance Computer Architecture (HPCA), Feb 2018.

[14] B. Rogers, S. Chhabra, M. Prvulovic, and Y. Solihin, “Using addressindependent seed encryption and bonsai merkle trees to make secureprocessors os-and performance-friendly,” in 2007 40th Annual IEEE/ACMInternational Symposium on Microarchitecture (MICRO). IEEE, 2007.

[15] P. Zuo and Y. Hua, “Secpm: a secure and persistent memorysystem for non-volatile memory,” in 10th USENIX Workshopon Hot Topics in Storage and File Systems (HotStorage 18).Boston, MA: USENIX Association, 2018. [Online]. Available:https://www.usenix.org/conference/hotstorage18/presentation/zuo

[16] M.-Y. Hsiao, “A class of optimal minimum odd-weight-column sec-dedcodes,” IBM Journal of Research and Development, vol. 14, 1970.

[17] S. Cha and H. Yoon, “Efficient implementation of single error correctionand double error detection code with check bit pre-computation formemories,” JSTS: Journal of Semiconductor Technology and Science,vol. 12, 2012.

[18] R. Naseer and J. Draper, “Dec ecc design to improve memory reliabilityin sub-100nm technologies,” in 15th IEEE International Conference onElectronics, Circuits and Systems, 2008. ICECS 2008. IEEE, 2008.

[19] R. Naseer and J. Draper, “Parallel double error correcting code design tomitigate multi-bit upsets in srams,” in 34th European Solid-State CircuitsConference, 2008. ESSCIRC 2008. IEEE, 2008.

[20] T. N. Miller, R. Thomas, J. Dinan, B. Adcock, and R. Teodor-escu, “Parichute: Generalized turbocode-based error correction for

near-threshold caches,” in 2010 43rd Annual IEEE/ACM InternationalSymposium on Microarchitecture (MICRO). IEEE, 2010.

[21] T. Instruments, “Discriminating Between Soft Errorsand Hard Errors in RAM.” [Online]. Available:http://www.ti.com/lit/wp/spna109/spna109.pdf

[22] A. Akel, A. M. Caulfield, T. I. Mollov, R. K. Gupta, and S. Swanson,“Onyx: A protoype phase change memory storage array,” in Proceedingsof the 3rd USENIX Conference on Hot Topics in Storage and File Systems,ser. HotStorage’11. Berkeley, CA, USA: USENIX Association, 2011.[Online]. Available: http://dl.acm.org/citation.cfm?id=2002218.2002220

[23] B. Lee, E. Ipek, O. Mutlu, and D. Burger, “Architecting phase changememory as a scalable dram alternative,” in International Symposium onComputer Architecture (ISCA), 2009.

[24] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi,A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen,K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “Thegem5 simulator,” SIGARCH Comput. Archit. News, vol. 39, Aug. 2011.[Online]. Available: http://doi.acm.org/10.1145/2024716.2024718

[25] J. L. Henning, “Spec cpu2006 benchmark descriptions,” SIGARCHComput. Archit. News, vol. 34, Sep. 2006. [Online]. Available:http://doi.acm.org/10.1145/1186736.1186737

[26] M. Heroux, R. Neely, and S. Swami-narayan, “Asc co-design proxy app strategy.” [On-line]. Available: http://www.lanl.gov/projects/codesign/proxy-apps/assets/docs/proxyapps-strategy.pdf

[27] S. Swami, J. Rakshit, and K. Mohanram, “Secret: Smartly encryptedenergy efficient non-volatile memories,” in Proceedings of the 53rdACM/EDAC/IEEE Design Automation Conference (DAC), 2016. IEEE,2016.

[28] J. Rakshit and K. Mohanram, “Assure: Authentication scheme for secureenergy efficient non-volatile memories,” in 2017 54th ACM/EDAC/IEEEDesign Automation Conference (DAC), June 2017.

[29] S. Swami and K. Mohanram, “Covert: counter overflow reduction forefficient encryption of non-volatile memories,” in Proceedings of theConference on Design, Automation & Test in Europe. European Designand Automation Association, 2017.

[30] X. Zhang, C. Zhang, G. Sun, J. Di, and T. Zhang, “An efficientrun-time encryption scheme for non-volatile main memory,” inProceedings of the 2013 International Conference on Compilers,Architectures and Synthesis for Embedded Systems, ser. CASES’13. Piscataway, NJ, USA: IEEE Press, 2013. [Online]. Available:http://dl.acm.org/citation.cfm?id=2555729.2555753

[31] J. Coburn, A. M. Caulfield, A. Akel, L. M. Grupp, R. K. Gupta, R. Jhala,and S. Swanson, “Nv-heaps: making persistent objects fast and safe withnext-generation, non-volatile memories,” ACM Sigplan Notices, vol. 46,2011.

[32] J. Ren, J. Zhao, S. Khan, J. Choi, Y. Wu, and O. Mutlu, “Thynvm:Enabling software-transparent crash consistency in persistent memorysystems,” in 2015 48th International Symposium on Microarchitecture(MICRO). IEEE, 2015.

[33] A. Kolli, V. Gogte, A. Saidi, S. Diestelhorst, P. M. Chen, S. Narayanasamy,and T. F. Wenisch, “Language-level persistency,” in Proceedings of the44th Annual International Symposium on Computer Architecture. ACM,2017.

[34] J. Zhao, S. Li, D. H. Yoon, Y. Xie, and N. P. Jouppi, “Kiln: Closing theperformance gap between systems with and without persistence support,”in 2013 46th Annual International Symposium on Microarchitecture(MICRO). IEEE, 2013.

[35] S. Pelley, P. M. Chen, and T. F. Wenisch, “Memory persistency,” in ACMSIGARCH Computer Architecture News, vol. 42, no. 3. IEEE Press,2014.

[36] A. Kolli, J. Rosen, S. Diestelhorst, A. Saidi, S. Pelley, S. Liu, P. M. Chen,and T. F. Wenisch, “Delegated persist ordering,” in 2016 49th AnnualIEEE/ACM International Symposium on Microarchitecture (MICRO).IEEE, 2016.

[37] S. Nalli, S. Haria, M. D. Hill, M. M. Swift, H. Volos, and K. Keeton,“An analysis of persistent memory use with whisper,” in Proceedings ofthe Twenty-Second International Conference on Architectural Supportfor Programming Languages and Operating Systems. ACM, 2017.

[38] Intel, “Intel 64 and IA-32 ArchitecturesSoftware Developer’s Manual.” [Online]. Avail-able: https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf

[39] D. Narayanan and O. Hodson, “Whole-system persistence,” ACMSIGARCH Computer Architecture News, vol. 40, 2012.

osiris: a low-cost mechanism to enable …...osiris: a low-cost mechanism to enable restoration of...

Documents