practical data compression for modern memory hierarchieskey data patterns in real applications 12...

56
Practical Data Compression for Modern Memory Hierarchies Gennady Pekhimenko Thesis Oral Committee: Todd Mowry (Co-chair) Onur Mutlu (Co-chair) Kayvon Fatahalian David Wood, University of Wisconsin-Madison Doug Burger, Microsoft Michael Kozuch, Intel Presented in partial fulfillment of the requirements for the degree of Doctor of Philosophy

Upload: others

Post on 04-Dec-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Practical Data Compression for

Modern Memory Hierarchies

Gennady Pekhimenko

Thesis Oral

Committee:Todd Mowry (Co-chair)Onur Mutlu (Co-chair)Kayvon FatahalianDavid Wood, University of Wisconsin-MadisonDoug Burger, MicrosoftMichael Kozuch, Intel

Presented in partial fulfillment of the requirements for the degree of Doctor of Philosophy

Page 2: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Performance and Energy Efficiency

2

Energy efficiency

Applications today are data-intensive

Memory Caching

Databases Graphics

Page 3: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Computation vs. Communication

Data movement is very costly– Integer operation: ~1 pJ

– Floating operation: ~20 pJ

– Low-power memory access: ~1200 pJ

Implications– ½ bandwidth of modern mobile phone memory

exceeds power budget

– Transfer less or keep data near processing units3

Modern memory systems are bandwidth constrained

Page 4: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Data Compression across the System

4

Processor

Cache

Memory

Disk

Network

Page 5: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Software vs. Hardware Compression

5

Layer Disk Cache/Memory

Latency milliseconds nanoseconds

Software vs. Hardware

Algorithms Dictionary-based Arithmetic

Existing dictionary-based algorithms are too slow for main memory hierarchies

Page 6: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Key Challenges for Compression in Memory Hierarchy

• Fast Access Latency

• Practical Implementation and Low Cost

• High Compression Ratio

6

Page 7: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Thesis Statement

7

It is possible to develop a new set of designs for data compression within modern memory hierarchies that is:

Fast

Simple

Effective

in saving storage space and consumed bandwidth

so that the resulting improvements in performance, cost, and energy efficiency will make it attractive to implement in future systems

Page 8: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Contributions of This Dissertation

• Base-Delta-Immediate (BDI) Compression algorithm with low latency and high compression ratio

• Compression-Aware Management Policies (CAMP) that incorporate compressed block size into cache management decisions

• Linearly Compressed Pages (LCP) framework for efficient main memory compression

• Toggle-Aware Bandwidth compression mechanisms for energy-efficient bandwidth compression

8

Page 9: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Practical Data Compression in Memory

9

Processor

Cache

Memory

Disk

1. Cache Compression

2. Compression and Cache Replacement

3. Memory Compression

4. Bandwidth Compression

Page 10: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

1. Cache Compression

10

PACT2012

Page 11: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Background on Cache Compression

• Key requirement:– Low decompression latency

11

CPUL2

CacheData

CompressedDecompressionUncompressed

L1 Cache

Hit (~15 cycles)

Hit (3-4 cycles)

Page 12: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Key Data Patterns in Real Applications

12

0x00000000 0x00000000 0x00000000 0x00000000 …

0x000000C0 0x000000C0 0x000000C0 0x000000C0 …

0x000000C0 0x000000C8 0x000000D0 0x000000D8 …

0xC04039C0 0xC04039C8 0xC04039D0 0xC04039D8 …

Zero Values: initialization, sparse matrices, NULL pointers

Repeated Values: common initial values, adjacent pixels

Narrow Values: small values stored in a big data type

Other Patterns: pointers to the same memory region

Page 13: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

How Common Are These Patterns?

0%

20%

40%

60%

80%

100%

lib

qu

antu

m

lb

m

mcf

tp

ch1

7

sje

ng

om

net

pp

tp

ch2

sp

hin

x3

xal

ancb

mk

bzi

p2

tp

ch6

les

lie3

d

ap

ach

e

gro

mac

s

ast

ar

go

bm

k

so

ple

x

gcc

hm

me

r

wrf

h2

64

ref

zeu

smp

cac

tusA

DM

Ge

msF

DTD

Ari

th.M

ean

Cac

he

Co

vera

ge (

%)

Zero

Repeated Values

Other Patterns

13

SPEC2006, databases, web workloads, 2MB L2 cache“Other Patterns” include Narrow Values

43%

43% of the cache lines belong to key patterns

Page 14: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Key Data Patterns in Real Applications

14

0x00000000 0x00000000 0x00000000 0x00000000 …

0x000000C0 0x000000C0 0x000000C0 0x000000C0 …

0x000000C0 0x000000C8 0x000000D0 0x000000D8 …

0xC04039C0 0xC04039C8 0xC04039D0 0xC04039D8 …

Zero Values: initialization, sparse matrices, NULL pointers

Repeated Values: common initial values, adjacent pixels

Narrow Values: small values stored in a big data type

Other Patterns: pointers to the same memory region

Page 15: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Key Data Patterns in Real Applications

15

Low Dynamic Range: Differences between values are significantly

smaller than the values themselves

• Low Latency Decompressor

• Low Cost and Complexity Compressor

• Compressed Cache Organization

Page 16: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

32-byte Uncompressed Cache Line

Key Idea: Base+Delta (B+Δ) Encoding

16

0xC04039C0 0xC04039C8 0xC04039D0 … 0xC04039F8

4 bytes

0xC04039C0

Base

0x00

1 byte

0x08

1 byte

0x10

1 byte

… 0x38 12-byte Compressed Cache Line

20 bytes saved

Effective: good compression ratio

Page 17: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Δ0B0

B+Δ Decompressor Design

17

Δ1 Δ2 Δ3

Compressed Cache Line

V0 V1 V2 V3

+ +

Uncompressed Cache Line

+ +

B0 Δ0

B0 B0 B0 B0

Δ1 Δ2 Δ3

V0V1 V2 V3

Vector addition

Fast Decompression: 1-2 cycles

Page 18: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Can We Get Higher Compression Ratio?

• Uncompressible cache line (with a single base):

• Key idea - use more bases

– More cache lines can be compressed

– Unclear how to find these bases efficiently

– Higher overhead (due to additional bases)

18

0x000000000x09A40178 0x0000000B0x09A4A838 …

struct A {int* next;int count;};

Page 19: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

B+Δ with Multiple Arbitrary Bases

19

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

Co

mp

ress

ion

Rat

io

1 2 3 4 8 10 16

2 bases – empirically the best option

# of bases is fixed

Page 20: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

How to Find Two Bases Efficiently?

1. First base - first element in the cache line

2. Second base - implicit base of 0

20

Base+Delta part

Immediate part

Base-Delta-Immediate (BΔI) Compression

Page 21: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Conventional 2-way cache with 32-byte cache lines

BΔI Cache Organization

21

Tag0 Tag1

… …

… …

Tag Storage:

Set0

Set1

Way0 Way1

Data0

Set0

Set1

Way0 Way1

Data1

32 bytesData Storage:

BΔI: 4-way cache with 8-byte segmented data

Tag0 Tag1

… …

… …

Tag Storage:

Way0 Way1 Way2 Way3

… …

Tag2 Tag3

… …

Set0

Set1

Twice as many tags

C - Compr. encoding bitsC

Set0

Set1

… … … … … … … …

S0S0 S1 S2 S3 S4 S5 S6 S7

… … … … … … … …

8 bytes

Tags map to multiple adjacent segments

Page 22: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Methodology

22

• Simulator– x86 event-driven simulator (MemSim [Seshadri+, PACT’12])

• Workloads– SPEC2006 benchmarks, TPC, Apache web server

– 1 – 4 core simulations for 1 billion representative instructions

• System Parameters– L1/L2/L3 cache latencies from CACTI

– BDI (1-cycle decompression)

– 4GHz, x86 in-order core, cache size (1MB - 16MB)

Page 23: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Comparison Summary

23

Comp. Ratio 1.531.51

Decompression 1-2 cycles5-9 cycles

Compression 1-9 cycles3-10+ cycles

Prior Work vs. BΔI

Average performance of a twice larger cache

Page 24: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

24

Processor

Cache

Memory

Disk

1. Cache Compression

2. Compression and Cache Replacement

3. Memory Compression

4. Bandwidth Compression

Page 25: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

2. Compression and Cache Replacement

25

HPCA 2015

Page 26: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Cache Management Background

• Not only about size

– Cache management policies are important

– Insertion, promotion and eviction

• Belady’s OPT Replacement

– Optimal for fixed-size caches fixed-size blocks

– Assumes perfect future knowledge

26

Y

Cache:

X Z

X A Y B CAccess stream:

A

Page 27: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Block Size Can Indicate Reuse

• Sometimes there is a relation between the compressed block size and reuse distance

27

compressed block sizedata

structure

• This relation can be detected through the compressed block size

• Minimal overhead to track this relation (compressed block information is a part of design)

reuse distance

Page 28: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Code Example to Support Intuition

int A[N]; // small indices: compressible

double B[16]; // FP coefficients: incompressible

for (int i=0; i<N; i++) {

int idx = A[i];

for (int j=0; j<N; j++) {

sum += B[(idx+j)%16];

}

}

28

long reuse, compressible

short reuse, incompressible

Compressed size can be an indicator of reuse distance

Page 29: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Block Size Can Indicate Reuse

29

0

2000

4000

6000

1 8 16 20 24 34 36 40 64

Re

use

Dis

tan

ce

(# o

f m

em

ory

acce

sse

s)

Block size, bytes

Different sizes have different dominant reuse distances

Page 30: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

The benefits are on-par with cache compression - 2X additional increase in capacity

Compression-Aware Management Policies (CAMP)

30

CAMP

MVE: Minimal-Value

Eviction

SIP:Size-based

Insertion Policy

Value = Probability of reuse

Compressed block size

compressed block size

data structure

reuse distance

Page 31: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

31

Processor

Cache

Memory

Disk

1. Cache Compression

2. Compression and Cache Replacement

3. Memory Compression

4. Bandwidth Compression

Page 32: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

3. Main Memory Compression

32

MICRO 2013

Page 33: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Challenges in Main Memory Compression

33

1. Address Computation

2. Mapping and Fragmentation

Page 34: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

L0 L1 L2 . . . LN-1

Cache Line (64B)

Address Offset 0 64 128 (N-1)*64

L0 L1 L2 . . . LN-1Compressed Page

0 ? ? ?Address Offset

Uncompressed Page

Address Computation

34

Page 35: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Mapping and Fragmentation

35

Virtual Page (4KB)

Physical Page (? KB) Fragmentation

Virtual Address

Physical Address

Page 36: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Shortcomings of Prior Work

36

CompressionMechanisms

CompressionRatio

Address Comp.Latency

Decompression Latency

Complexity and Cost

IBM MXT[IBM J.R.D. ’01]

64 cycles

Page 37: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Shortcomings of Prior Work

37

CompressionMechanisms

CompressionRatio

Address Comp.Latency

Decompression Latency

Complexity and Cost

IBM MXT[IBM J.R.D. ’01]

Robust Main Memory Compression [ISCA’05]

5 cycles

Page 38: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Shortcomings of Prior Work

38

CompressionMechanisms

CompressionRatio

Address Comp.Latency

Decompression Latency

Complexityand Cost

IBM MXT[IBM J.R.D. ’01]

Robust Main Memory Compression [ISCA’05]

LinearlyCompressed Pages:Our Proposal

Page 39: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Linearly Compressed Pages (LCP): Key Idea

39

64B 64B 64B 64B . . .

. . .

4:1 Compression

64B

Uncompressed Page (4KB: 64*64B)

Compressed Data (1KB) LCP effectively solves challenge 1:

address computation

LCP sacrifices some compression ratio in favor of design simplicity

Page 40: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

LCP: Key Idea (2)

40

64B 64B 64B 64B . . .

. . . M E

Metadata (64B): ? (compressible)

ExceptionStorage

4:1 Compression

64B

Uncompressed Page (4KB: 64*64B)

Compressed Data (1KB)

Page 41: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

LCP Framework Overview

41

• Page Table entry extension

• compression type and size

• OS support for multiple page sizes

• 4 memory pools (512B, 1KB, 2KB, 4KB)

• Handling uncompressible data

• Hardware support

• memory controller logic

• metadata (MD) cache

PTE

Page 42: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Physical Memory Layout

42

4KB

2KB 2KB

1KB 1KB 1KB 1KB

512B 512B ... 512B

4KB

Page Table

PA0offset

PA1

PA1 + 512

Page 43: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

• Metadata cache

• Avoids additional requests to metadata

• Memory bandwidth reduction:

• Zero pages and zero cache lines

• Handled separately in TLB (1-bit) and in metadata

(1-bit per cache line)

LCP Optimizations

43

64B 64B 64B 64B1 transfer

instead of 4

Page 44: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Summary of the Results

44

Comp. Ratio 1.621.59

Performance +14%-4%

Energy Consumption ↓5%↑6%

Prior Work vs. LCP

Page 45: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

45

Processor

Cache

Memory

Disk

1. Cache Compression

2. Compression and Cache Replacement

3. Memory Compression

4. Bandwidth Compression

Page 46: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

4. Energy-Efficient Bandwidth Compression

46

HPCA 2016

CAL 2015

Page 47: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Energy Efficiency: Bit Toggles

47

0

1

0011Previous data:

Bit Toggles

Energy = C*V2

How energy is spent in data transfers:

0101New data:

0

01

0

11

Energy:

Energy of data transfers (e.g., NoC, DRAM) is proportional to the bit toggle count

Page 48: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Excessive Number of Bit Toggles

48

0x00003A00 0x8001D000 0x00003A01 0x8001D008 …

Flit 0

Flit 1

XOR

000000010….00001=

# Toggles = 2Compressed Cache Line

0x5 0x3A00 0x7 8001D000 0x5 0x3A01

Flit 0

Flit 1

XOR

001001111 … 110100011000=

# Toggles = 31

0x7 8001D008 …

5 3A00 7 8001D000 5 1D

1 01 7 8001D008 5 3A02 1

Uncompressed Cache Line

Page 49: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Effect of Compression on Bit Toggles

49

0.8

1

1.2

1.4

1.6

1.8

2

2.2FP

C

BD

I

BD

I+FP

C

Fib

on

acci

C-P

ack

FPC

BD

I

BD

I+FP

C

Fib

on

acci

C-P

ack

FPC

BD

I

BD

I+FP

C

Fib

on

acci

C-P

ack

Discrete Mobile Open-Source

No

rmal

ize

d B

it T

ogg

le #

Compression significantly increases bit toggle count

Page 50: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Energy Control

• Bit toggle count: compressed vs. uncompressed

• Use a heuristic (Energy X Delay or Energy X Delay2 metric) to estimate the trade-off

• Take bandwidth utilization into account

• Throttle compression when it is not beneficial

50

Page 51: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Methodology

• Simulator: GPGPU-Sim 3.2.x and in-house simulator

• Workloads:

– NVIDIA apps (discrete and mobile): 221 apps

– Open-source (Lonestar, Rodinia, MapReduce): 21 apps

• System parameters (Fermi):– 15 SMs, 32 threads/warp

– 48 warps/SM, 32768 registers, 32KB Shared Memory

– Core: 1.4GHz, GTO scheduler, 2 schedulers/SM

– Memory: 177.4GB/s BW, GDDR5

– Cache: L1 - 16KB; L2 - 768KB

51

Page 52: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Effect of EC on Bit Toggle Count

52

0.81

1.21.41.61.8

22.2

FPC

BD

I

BD

I+FP

C

Fib

on

acci

C-P

ack

FPC

BD

I

BD

I+FP

C

Fib

on

acci

C-P

ack

FPC

BD

I

BD

I+FP

C

Fib

on

acci

C-P

ack

Discrete Mobile Open-Source

No

rmal

ize

d B

it T

ogg

le C

ou

nt Without EC With EC

EC significantly reduces the bit toggle count Works for different compression algorithms

Page 53: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Effect of EC on Compression Ratio

53

0.8

1

1.2

1.4

1.6

1.8FP

C

BD

I

BD

I+FP

C

Fib

on

acci

C-P

ack

FPC

BD

I

BD

I+FP

C

Fib

on

acci

C-P

ack

FPC

BD

I

BD

I+FP

C

Fib

on

acci

C-P

ack

Discrete Mobile Open-Source

Co

mp

ress

ion

Rat

io

Without EC With EC

EC preserves most of the benefits of compression

Page 54: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Acknowledgments

• Todd Mowry and Onur Mutlu

• Phil Gibbons and Mike Kozuch

• Kayvon Fatahalian, David Wood, and Doug Burger

• SAFARI and LBA group members

• Collaborators at CMU, MSR, NVIDIA and GaTech

• CALCM and PDL

• Deb Cavlovich

• Family and friends

54

Page 55: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Conclusion

• Data stored in memory hierarchies has significant redundancy– Inefficient usage of existing limited resources

• Simple and efficient mechanisms for hardware-based data compression– On-chip caches

– Main memory

– On-chip/off-chip interconnects

• Our mechanisms improve performance, cost and energy efficiency

55

Page 56: Practical Data Compression for Modern Memory HierarchiesKey Data Patterns in Real Applications 12 0x00000000 0x00000000 0x00000000 0x00000000 … 0x000000C0 0x000000C0 0x000000C0 0x000000C0

Practical Data Compression for

Modern Memory Hierarchies

Gennady Pekhimenko

Thesis Oral

Committee:Todd Mowry (Co-chair)Onur Mutlu (Co-chair)Kayvon FatahalianDavid Wood, University of Wisconsin-MadisonDoug Burger, MicrosoftMichael Kozuch, Intel

Presented in partial fulfillment of the requirements for the degree of Doctor of Philosophy