ms thesis defense

MS Thesis Defense

“Improving Performance, Power, and Security of Multicore Systems using Cache Organization”

By

Tania Jareen

CoE EECS Department

April 21, 2014

Jareen 2

About Me

Tania Jareen

MS in Electrical Engineering with Thesis GTA for Routing and Switching–II Publications:

“An Effective Locking-Free Caching Technique for Power-Aware Multicore Computing Systems,” accepted in the IEEE ICIEV-2014 conference.

“A Novel Level-1 Cache Mapping Approach to Improve System Security without Compromising Performance to Power Ratio,” currently preparing.

Jareen 3

Committee Members

Dr. Abu Asaduzzaman, EECS Dept.

Dr. Ramazan Asmatulu, ME Dept.

Dr. Zheng Chen, EECS Dept.

Jareen 4


Outline ►

IntroductionProblem StatementSome Important TermsPrevious WorkProposalSimulationSimulation Results ConclusionsFuture Work

Q U E S T I O N S ? Any time, please.

Jareen 5

Introduction

Multicore System Multicore system is a collection of parallel or concurrent

processing units, divides a large complex problem into many small tasks

Main goal : to process a complex problem faster

Dual-core System

Jareen 6

Problem Statement

Challenges for Multicore System

High Average Memory Latency

High Total Power Consumption

Cache Side Channel Security Attack

Jareen 7

Contributions

Propose a multicore system design to reduce the average memory latency

Propose a multicore system design to reduce the total power consumption

Propose a multicore system design to provide hardware level security

Jareen 8

Some Important Terms

■ Cache

A small buffer to store

recent information

Helps to mitigate the speed gap

between processor and main memory

Increases the overall performance of the system significantly

Logically cache is placed between CPU and main memory

Cache and Main Memory (Computer Desktop Encyclopedia)

Jareen 9


■ Cache Organization

Cache Hit – The requested data contains in the cache

Cache Miss – The requested data does not contain in the cache

Cache Organization

Jareen 10


■ Cache Replacement Policy

Some blocks from the

cache need to be replaced

to store new blocks as the

cache memory size is limited

Replacement should be in a manner so that the miss ratio will be low

Some of the well know cache replacement policies – Least Recent Used (LRU), Random, Most Recent Used, First In First Out etc

Cache Replacement Policy (Aaron Toponce)

Jareen 11


■ Memory Update Policy

This is a combination of Read policy and Write policy

Read Policy – indicates how a word is to be read.

Write Policy – indicates how the write of a memory block will be handled. Example: Write-Through, Write-Back

Jareen 12


■ Cache Locking

Lock the most usable data for future

During replacement, these locked blocks

will not be replaced

Increases the hit ratio and performance

Reduces average memory access time and power consumption

Problem : Hard to predict locking blocks, all processor configuration does not suit, reduces effective cache size

Locked Cache

Jareen 13


■ Victim Cache

Oldest and the most popular

technique to improve performance

Used in between CL1 and CL2

Holds the victim blocks during cache replacement

Reduces average memory latency and total power consumption

Victim cache Organization

Jareen 14


■ Stream Buffering

During cache miss, required blocks along with some additional blocks come from main memory to CL2 and then copy to CL1

The additional blocks are kept in Stream Buffer

Helps to reduce average memory latency and total power consumption

Jareen 15


■ Cache Side Channel Attack

Hardware attack, mainly on cache

Exploits important information from cache by passively monitoring

Uses physical properties (Example: time variation, power

consumption, sound variation, heat production) [1,2,3,4]

Silent attack, but most dangerous

Jareen 16


■ A-Symmetric Encryption Step 1: Receiver generates private and public key and shares the public key with the sender.

Step 2. Sender encrypts the information using the public key

Step 3: Sender sends the encrypted information to receiver

Step 4: Receiver decrypts the information using its own private key

A-Symmetric Encryption

Jareen 17

Previous Work

■ To Improve Average Memory Latency and Total Power Consumption:

Victim Cache between CL1 and main memory and Stream Buffering [6]

Problem – no guarantee that the victim blocks are with maximum number of miss

Selective Victim Caching [7]

Problem – possibility to pollute the cache, need prediction

Jareen 18

Previous Work

Selective Pre-Fetching [8]

Problem – need a history of references

Cache Locking [9]

Problem – hard to predict the blocks with high cache miss, all processor configuration do not support

■ To Improve Cache Level Security: Partitioned Cache [1]

Problem – cache underutilization, need to depend on software

Dynamic Memory-to-Cache Remapping [5]

Jareen 19

Proposed Mechanism

■ Smart Victim Cache (SVC) MCB = Miss Cache Block

VCB = Victim Cache Block

SBB = Stream Buffering Block

BACMI = Block Address and Cache Miss Information

SLLC = Shared Last Level Cache

Proposed Cache Organization with SVC

Jareen 20

Work Flow Diagram

Work Flow Diagram

Jareen 21

Proposed Mechanism

Block size = 128 Bytes; main memory = 4 GB

SVCSize(KB)

Num. ofBlocks

SVC1:MCB

(Block)

SVC2:VCB+SBB

(Block)

Max. Num.of BACMIs(MCB*16)

2 168 5 + 3 128

5 8 + 3 80

4 328 21 + 3 128

16 13 + 3 256

8 648 53 + 3 128

48 13 + 3 768

16 1288 117 + 3 128

112 13 + 3 1792

32 2568 245 + 3 128

240 13 + 3 3840

Maximum Number of BACMI entries for a given SVC with various MCB

MCB = Miss Cache Block

VCB = Victim Cache Block

SBB = Stream Buffering Block

BACMI = Block Address and Cache Miss Information

Jareen 22

Simulation

■ Assumptions

SVC can be enabled and disabled

All cores equally share SVC

LRU replacement policy is used

Write-Back update policy is used

Jareen 23

Simulation

■ Workload

Moving Picture Experts

Group’s – 4 (MPEG-4) Advanced Video

Coding (H.264/AVC) Matrix Inversion (MI) Fast Fourier Transform (FFT)

H.264/AVC behaves similar to MPEG-4 MI behaves similar to FFT

Jareen 24

Simulation

■ Input Parameters

Number of cores = 4

SVC size = 2, 4, 8, 16, 32 KB

I1/D1 size of CL1 = 8/8, 16/16, 32/32, 64/64, 128/128 KB

CL2 size = 256, 512, 1024, 2048, 4096 KB

Line size = 16, 32, 64, 128, 256 B

Associativity level = 1- ,2-, 4-, 8-, 16 - way

Jareen 25

Simulation

■ Assumption for Delay Penalty

Number of cycle to load and store any operation = 100

Number of cycle to branch any operation = 150

Satisfy Any Instruction at

Number of Cycles

ALU 1

Private CL1 3

Shared CL2 10

Jareen 26

Simulation

■ Assumption for Power Consumption

Power Consumption by mWatts/Operation

CPU 3.6

I1 2.7

Other Components 2.1

Jareen 27

Simulation Results

■ Impact of SVC Size

Jareen 28

Simulation Results■ Impact of SVC and CL1 Size

Both the latency and total power consumption decreases for MPEG-4 when the cache size increases For MPEG-4, both latency and power decreases mostly with SVC and no locking

Impact of SVC and CL1 Size on Memory Latency and Total Power Consumption

Jareen 29

Simulation Results

■ Impact of SVC and Line Size

With the increase of line size of MPEG-4, latency and power consumption decreases For MPEG-4, latency and power consumption both decreases with SVC and no locking

Impact of SVC and Line Size on Memory latency and Total power Consumption

Jareen 30

Simulation Results

■ Impact of SVC and Associativity Level

For MPEG-4, with increase of associativity level, latency and power consumption decreases For MPEG-4, latency and power consumption decreases most for SVC and no locking

Impact of SVC and Associativity Level on Memory Latency and Total Power Consumption

Jareen 31

Simulation Results

■ Impact of SVC and CL2/SLLC Size

With the increase of CL2 size, for MPEG-4, latency becomes stable but power consumption increasesBoth latency and power consumption for MPEG-4, decreases mostly for using SVC and no locking

Impact of SVC and CL2/SLLC Size on Memory Latency and Total power Consumption

Jareen 32

Simulation Results■ Comparison of SVC and Cache Line Locking

Average memory latency and total power consumption decreases as locked CL2 cache increases from 0% to 25% locking. Average memory latency and total power consumption both decreases with using SVC and no locking compared to using locking or no SVC and no locking.

Comparison of SVC and Cache Line Locking

Jareen 33

Proposed Solution for Security Improvement

■ Randomized Cache Mapping Between D1X and CL1 (Solution-1)

Randomized Cache Mapped Between D1X and CL1

Jareen 34

Proposed Solution for Security Improvement

■ Problem with Solution-1

Extra hardware D1X implementation

Increase memory latency for processing

Increase total power consumption about 17%

Jareen 35

Proposed Modified Solution for Security Improvement

■ Randomized Cache Mapping Between Main Memory and CL1 (Solution-2)

Randomized Cache Mapped between CL1 and Main Memory

It is expected that the probability of cache side channel attack decreases from 40K to 1 for 16 blocks of CL1

Jareen 36

Conclusions

Using several levels of cache in multicore systems cause serious performance and power issue

Shared cache among various cores in multicore system cause hardware level security threat

Proposed SVC significantly increases the system performance by

reducing the memory latency, power consumption

Proposed cache randomization technique between main memory and CL1 reduces the probability of cache attack significantly

Jareen 37

Conclusions

Average memory latency is reduced with SVC by 17% compared to CL2 cache locking

Total power consumption is reduced with SVC by 21% compared to CL2 cache locking

According to our estimates the probability of cache side channel attack reduces from 40K to 1 for 16 block of CL1

Jareen 38

Future Work

Explore the impact of SVC on average memory latency, total power consumption for real time embedded system and Handheld computers

Explore the randomized cache mapping between CL1 and main memory technique on real time embedded system and handheld computers

Jareen 39

QUESTION


Jareen 40

Thank You

Contact:

Full Name: Tania Jareen

Telephone: (316) 516-8516

E-mail: [email protected]


Jareen 41

References

1. D. Page, “Partitioned Cache Architecture as a Side-Channel Defense Mechanism,” in Cryptology ePrint Archive, Report 2005/280, 2005.

2. O. Aciicmez, “Yet another Micro Architectural Attack: exploiting I-Cache,” in CSAW ’07 Proceedings of the 2007 ACM workshop on Computer security architecture, pp. 11-18, DOI: 10.1145/1314466.1314469, 2007.

3. C.P. Kocher, “Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS, and Other Systems,” Springer Berlin Heidelberg. pp. 104-113, DOI: 10.1007/3-540-68697-5_9, 1996.

4. P. Kocher, et al., "Differential Power Analysis," in Proceedings of the 19th Annul International Cryptology Conference on Advances in Cryptology, 1999.

Jareen 42

References

5. Z. Wang and R.B. Lee, "A novel cache architecture with enhanced performance and security," in Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture, 2008.

6. N.P. Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers,” Western Research Laboratory (WRL), Digital Equipment Corporation, URL:https://www.cis.upenn.edu/~cis501/papers/joupp victim.pdf, 1990.

7. D. Stiliadia and A. Varma, “Selective Victim Caching: A Method to Improve the Performance of Direct-Mapped Caches,” in IEEE Transactions on Computers, Vol. 46, No. 5. pp. 603-610, DOI: 10.1109/12.589235, 2002.

Jareen 43

References

8. R. Pendse and H. Katta, “Selective Prefetching: Prefetching when only required,” in the 42nd Midwest Symposium on Circuits and Systems, Vol. 2. pp. 866-869, DOI: 10.1109/MWSCAS.1999.867772, 1999.

9. A. Asaduzzaman, F.N. Sibai, and M. Rani, “Improving cache locking performance of modern embedded systems via the addition of a miss table at the L2 cache level,” in the EUROMICRO Journal of Systems Architecture, Vol. 56, Issue 4-6. pp 151-162, 2010.

ms thesis defense

Documents

cache cache organizationjareen

cache memory size

cache replacement policies

power ratio

multicore system design

system security

security of multicore

cache mapping approach