controlled resource and data sharing in multi-core...

1

1

Controlled Resource and Data

Sharing in Multi-Core Platforms*

Sandhya Dwarkadas

Department of Computer Science

University of Rochester

*Joint work with Arrvindh Shriraman, Hemayet Hossain, Xiao Zhang,

Hongzhou Zhao, Rongrong Zhong,

Michael L. Scott, Michael Huang, Kai Shen

2

The University of Rochester

• Small private research university

• 4400 undergraduates

• 2800 graduate students

• Set on the Genesee River in Western NY State, near the south shore of Lake Ontario

• 250km by road from Toronto; 590km from New York City

3 4

The Computer Science Dept.

• 15 tenure-track

faculty; 45 Ph.D.

students

• Specializing in AI,

theory, and parallel

and distributed

systems

• Among the best

small departments in

the US

2

5

Handheld DeviceJava

Internet

ClusterFortran/C

data

InterWeave Server

IW libraryIW library

IW library

cache

DesktopC/C++

cache

cache

The Hardware-Software Interface

TreadMarks

Cashmere-2L

InterWeave

RTM FlexTM

Concurrency: Coherence,

Synchronization, Consistency

Memory Systems

Peer-to-peer systems

Power-Aware Computing CAP

MCD

P P P

M

…

ARMCO DDCache LADLE

FCS Multi-Cores

Resource-Aware OS Scheduling

DT-CMT

Distributed Systems

Operating Systems

Willow

Sentry:

Protection Support

RPPT

6

The Implications of Technology

Scaling

• Many more transistors for compute power

• Energy constraints

• Large volumes of data

• High-speed communication

• Concurrency (parallel or distributed)

• Need support for

– Scalable sharing

– Reliability

– Protection and security

– Performance isolation Source: http://weblog.infoworld.com/tech-bottom-line/archives/IT-In-The-Clouds_hp.jpg

7

Multi-Core Challenges

• Ensuring performance isolation

• Providing protected and controlled sharing

across cores

• Scaling support for data sharing

8

Current Projects

• CoSyn: Communication and Synchronization Mechanisms for Emerging Multi-Core Processors

– Collaboration with Professors Michael Scott and Michael Huang

– Arrvindh Shriraman, Hemayet Hossain, Hongzhou Zhao

• Operating System-Level Resource Management in the Multi-Core Era

– Collaboration with Professor Kai Shen

– Xiao Zhang and Rongrong Zhong

See http://www.cs.rochester.edu/research/cosyn

and http://www.cs.rochester.edu/~sandhya

http://www.cs.rochester.edu/research/cosynhttp://www.cs.rochester.edu/~sandhya

3

9

Multi-Core Challenges

• Ensuring performance isolation

• Providing protected and controlled sharing

across cores

• Scaling support for data sharing

10

Performance Isolation

11

11

Resource Sharing is (and will be)

Ubiquitous!

• Floating point, integer, state, cache with multiple threads on a core

• Second-level cache with multiple cores on a chip

• Interconnect bandwidth on multiprocessors

Sun UltraSparc T1, … Intel’s 6-core (12-thread), …

AMD’s 12-core, …

12

Resource Sharing on Multicore Chip

• Memory bandwidth and last level cache are

commonly shared by sibling cores sitting on the

same chip

http://download.intel.com/pressroom/kits/45nm/Penryn Die Photo_300.jpg

4

13

13

Resource Management To Date

• Capitalistic - generation of more requests results

in more resource usage

– Performance: resource contention can result

in significantly reduced overall performance

– Fairness: equal time slice does not

necessarily guarantee equal progress

14

Poor Performance Due to

Uncontrolled Resource Contention

Experiments were conducted on a 3Ghz Intel Core 2 Duo processor with a shared 4MB L2 cache

Win-win situation

15

Fluctuating Performance Due to

Uncontrolled Resource Contention

Performance of art when co-running with different applications

on an Intel dual-core processor with a 4MB shared L2 cache

16

Fairness and Security Concerns

• Priority inversion

• Poor fairness among competing applications

• Information leakage at chip level

• Denial of service attack at chip level

5

17

Big Picture

A B D C

Select which applications run

together

X ………

…..

Control resource usage of co-running

applications

Resource-aware scheduling

[USENIX’10]

Page coloring

[Eurosys’09]

or Hardware throttling

[USENIX’09]

18

Existing Mechanism(I):

Software based Page Coloring

Thread A

Thread B

Shared Cache

Way-1 Way-n …………

Memory page A1

A2

A3

A4

A5

Thread A’s footprint • Classic technique to reduce cache misses,

now used by OS to manage cache

partitioning

• Partition cache at coarse

granularity

• No need for hardware

support

19

Drawbacks of Page Coloring • Expensive re-coloring cost

– Prohibitive in a dynamic environment where

frequent re-coloring may be necessary

• Complex memory management

– Introduces artificial memory pressure

Thread A

Thread B

Shared Cache

Way-1 Way-n …………

Memory page

A1

A2

A3

A4

A5

Thread A’s footprint

20

Toward Practical Page Coloring

• Hotness-based Page Coloring – Efficiently find a small group of hot pages

– Restrain page coloring or re-coloring to hot pages

– Pay less re-coloring overhead while achieving most of the cache partitioning benefit (separate competing applications’ most frequently accessed pages)

• Key challenge – Efficient way to track page hotness

6

21

Methods to Track Page Hotness

• Using page protection – Capture page accesses by triggering page faults

– Microseconds overhead per page fault

• Using access bits – A single bit stored in each Page Table Entry (PTE)

– Generally available on x86, automatically set by hardware upon page access

– Tens of cycles per page table entry check

– Recycle spare bits in PTE as hotness counter • Counter is aged to reflect recency and frequency

22

Sampling of Access Bits

• Decouple sampling frequency and window – Hotness sampling accuracy is determined by sampling time

window T

– Hotness sampling overhead is determined by sampling

frequency N

0

N 2N 3N 4N

T N+T 2N+T 3N+T 4N+T

Clear all access bits

Check all access bits

Time

Check all access bits

Clear all access bits

In our experiments, T = 2 milliseconds N = 100 or 10 milliseconds

23

Miss-Ratio-Curve Driven

Cache Partition Policy

Thread A’s Miss Ratio

Thread B’s Miss Ratio

Cache Allocation

Cache Allocation

0.5 0.7 0.3 0.2

0

0

Optimal partition point

Cache Size = ∑A,B Cache Allocation

4M

4M

System optimization metric =

24

Hot Page Coloring

• Budget control of page re-coloring overhead

– % of time slice, e.g. 5%

• Recolor from hottest until budget is reached

– Maintain a set of hotness bins during sampling

• bin[ i ][ j ] = # of pages in color i with normalized hotness in

range [ j, j+1]

– Given a budget K, K-th hottest page’s hotness value is

estimated in constant time by searching hotness bins

– Make sure hot pages are uniformly distributed among colors

7

25

Re-coloring Procedure

Budget = 3 pages

Cache share decrease

100

83

71

…

…

…

14

0

98 97 99

87

74

…

…

…

10

1

82

75

…

…

…

11

3

81

73

…

…

…

12

2

X

hotness counter value

Page sorted in

hotness ascending

order

Color Red Color Blue Color Green Color Gray 26

Performance Comparison

4 SPECcpu2k benchmarks (art,

equake, mcf, and twolf) are running

on 2 sibling cores (Intel core2duo)

that share a 4MB L2 cache.

27

Additional Benefit of

Hotness-based Page Coloring

• Page coloring introduces artificial memory pressure – App’s footprint is larger

than its entitled memory color pages, but system still has an abundance of memory pages

• Allow app to “steal”

other’s colors, but it preferentially copies cold pages to other’s memory colors

Thread A

Thread B

Cache Way-1 Way-n …………

Memory page A1

A2

A3

A4

A5

Thread A’s footprint

28

Big Picture

A B D C


together

X ………

…..


applications


Page coloring or Hardware throttling

8

29

Hardware Execution Throttling • Instead of directly controlling resource allocation,

throttle the execution speed of application that overuses resource

• Available throttling knobs

– Duty-cycle modulation

– Frequency/voltage scaling

– Cache prefetchers

30

Comparing Hardware Execution

Throttling to Page Coloring

• Kernel code modification complexity – Code length: 40 lines in a single file, as a reference our

page coloring implementation takes 700+ lines of code

crossing 10+ files

• Runtime overhead of configuration – Less than 1 microseconds, as a reference re-coloring a

page takes 3 microseconds

31

Existing Mechanism(II):

Scheduling Quantum Adjustment

• Shorten the time slice of app that overuses cache

• May let core idle if there is no other active thread

available

Thread B

Thread A idle

Thread B

Thread A idle

Thread B

Thread A idle Core 0

Core 1

time

32

Drawback of Scheduling Quantum

Adjustment Coarse-grained control at scheduling quantum granularity may

result in fluctuating service delays for individual transactions

9

33

New Mechanism:

Hardware Execution Throttling [Usenix’09]

• Throttle the execution speed of app that overuses cache

– Duty cycle modulation

• CPU works only in duty cycles and stalls in non-duty cycles

• Different from Dynamic Voltage Frequency Scaling

– Per-core vs. per-processor control

– Thermal vs. power management

– Enable/disable cache prefetchers

• L1 prefetchers

– IP: keeps track of instruction pointer for load history

– DCU: when detecting multiple loads from the same line within a time limit,

prefetches the next line

• L2 prefetchers

– Adjacent line: Prefetches the adjacent line of required data

– Stream: looks at streams of data for regular patterns

34

Comparison of Hardware Execution

Throttling to other two mechanisms • Comparison to page coloring

– Little complexity to kernel • Code length: 40 lines in a single file, as a reference our page coloring implementation

takes 700+ lines of code crossing 10+ files

– Lightweight to configure • Read plus write register: duty-cycle 265 + 350 cycles, prefetcher 298 + 2065 cycles

• Less than 1 microseconds, as a reference re-coloring a page takes 3 microseconds

• Comparison to scheduling quantum adjustment

– More fine-grained controlling

Thread B

Core 0

Core 1

Thread A idle

Quantum adjustment Hardware execution

throttling

time

35

35

Fairness Comparison

• On average all three mechanisms are effective in improving fairness

• Case {swim, SPECweb} illustrates limitation of page coloring

• Unfairness factor: coefficient of variation (deviation-to-mean ratio, σ / μ) of co-running apps’ normalized performances (normalization base is the execution-time/throughput when the application monopolizes the whole chip)

36

36

Performance Comparison

• System efficiency: geometric mean of co-running apps’ normalized performances

• On average all three mechanisms achieve system efficiency comparable to default sharing

• Case where severe inter-

thread cache conflicts exist

favors segregation, e.g.

{swim, mcf}

• Case where well-interleaved

cache accesses exist favors

sharing, e.g. {mcf, mcf}

10

37

Policies for Hardware Throttling Based

Multicore Management

• User-defined service level agreements (SLAs) – Proportional progress among competing threads

• Unfairness metric: coefficient of variation of threads’ performance

– Quality of service guarantee for high-priority application(s)

• Key challenge

– Throttling configuration space grows exponentially as

the number of cores increases

– Quickly determining optimal or close to optimal

throttling configurations is challenging

38

Model-Driven Iterative Framework

• Customizable performance estimation model

• Reference configuration set and linear approximation

• Currently incorporates duty cycle modulation and

frequency/voltage scaling

• Iterative refinement

• Prediction accuracy gets improved over time as more

configurations are added into reference set

39

Iterative Refinement Patterns

40

Online Deployment:

Hill-Climbing Search Acceleration

• For a m-throttling-level n-core system, need to compute nm times to predict a “best” one

• Hill-climbing searches along the best child rather than all children

• Prunes the computation space to (m-1)n2

(X-

1,Y,Z,U)

(X,Y,Z,U)

(X,Y-

1,Z,U)

(X,Y,Z-

1,U)

(X,Y,Z,U-

1)

(X-1,Y-

1,Z,U) (X,Y-2,Z,U)

(X,Y-1,Z-

1,U)

(X,Y-1,Z,U-

1)

(X-1,Y-1,Z-

1,U) (X,Y-2,Z-1,U) (X,Y-1,Z-2,U)

(X,Y-1,Z-1,U-

1)

… … … …

11

41

Accuracy Evaluation

• Test platform

• A quad-core Nehalem processor with 8MB shared L3

cache

• Search space from full CPU speed (duty cycle level 8) to

half CPU speed (duty cycle level 4), so 369 configurations

for each test

• Benchmarks: SPECCPU2k • Set-1: {mesa, art, mcf, equake}

• Set-2: {swim, mgrid, mcf, equake}

• Set-3: {swim, art, equake, twolf}

• Set-4: {swim, applu, equake, twolf}

• Set-5: {swim, mgrid, art, equake}

42

Capability of Satisfying SLAs • Service Level Agreements (SLAs)

• Fairness-oriented: keep the unfairness below a threshold

• QoS-oriented: keep the QoS-core above a QoS

threshold

• 4 different unfairness/QoS thresholds for 5 sets

• Optimization goal: satisfy SLAs while optimizing

performance or power efficiency

# Passing tests Avg. num of

samples

Avg.

performance of

picked configs

that pass tests

Oracle 39/40 0 100%

Model 39/40 4.1 99.4%

Random 25/40 15 91.1% Recall the search space has 369 configurations

43

Accuracy of Performance Estimation

Error Rate = |Prediction – Actual| / Actual

44

Big Picture

A B D C


together

X ………

…..


applications


Page coloring or Hardware throttling

12

45

Resource-aware Scheduling

• Scheduling decision could significantly affect performance

46

Similarity Grouping Scheduling

• Group applications with similar cache miss ratio on the same chip

– Separate high and low miss ratio apps on different chips

• Benefits – Mitigate cache thrashing effect

– Avoid over-saturating memory bandwidth

– Engage per-chip DVFS-based power savings

• A single voltage setting applies to all sibling cores on existing multicore chips

• High-miss-ratio chip runs at low frequency while low-miss-ratio chip runs at high frequency

47

Frequency-to-Performance Model

• Objective: explore power savings with bounded performance loss

• Assumptions – An application’s performance is linearly determined by

cache and memory access latencies

– Frequency scaling only affects on-chip accesses

– Miss ratio does not vary across frequencies

Normalized performance at frequency f = T(F) / T(f)

48

Model Accuracy

Error Rate = (Prediction - Actual) / Actual

13

49

Model-based Dynamic Frequency Setting

• Dynamically adjust CPU frequency based on

current running application’s behavior

– Collect cache miss ratio every 10 milliseconds

– Calculate an appropriate frequency setting based on

performance estimation model

• Guided by performance degradation threshold (e.g. 10%)

50

Model-based Dynamic Frequency Setting

51

Hardware Counter-based Power

Containers: An OS Resource

• Cross-core activity influence

• Online calibration with actual measurement

• Application-transparent online request context

tracking

52

Power Conditioning Using Power

Containers

14

53

Power Conditioning Achieved

Using Targeted Throttling

54

Ongoing Work

• Variation-directed information and management

– Using behavior fluctuation to trigger

monitoring

– Supporting fine-grain resource accounting

– Developing policies to reshape behavior for

high dependability and low jitter

– Request-level power attribution, modeling,

and management

55

See

http://www.cs.rochester.edu/research/cosyn

http://www.cs.rochester.edu/~sandhya

56

Arch/App: Shared Memory ++: DIMM [TRANSACT’06, ISCA’07, ISCA’08,ICS’09]

• Data Isolation (DI)

– Provide control over propagation of writes

– Buffer writes and allow group undo or propagation

Applications: Sand-boxing, transactional programming, speculation

• Memory Monitoring (MM)

– Monitor memory at summary or individual cache line level

Applications: Synchronization/event notification, reliability, security,

watchpoints/debugging

http://www.cs.rochester.edu/research/cosyn

15

57

Arch/App/OS:

Protection: Separation of Privileges

Reality: Today’s programs often consist of multiple modules

written by different programmers

Reliability and composability requires developing access and

interface conventions

58

Sentry: Light-Weight Auxiliary Memory

Access Control [ISCA’10]

• Access checks on an L1

miss

– Saves 90x energy

– Simplifies

implementation

• Metadata cache (M-cache)

accessed in parallel with

the L2 to speed up check

59

1 1 1 1

0 0 1 1

1 1 1 1

0 0 0 0

1 0

0 0 0

0

1

1

60

P0

P4

P11

The Indirection Problem

PACT 2008

A (M)

Home A

Load A

Data A

A(S)

1

DG A

Data A

2

A(S)

3 3 Load A

DG A

Data A

Data A

Longer distance means longer latency

16

61

Fine-Grain Data Sharing

Simultaneous access to the same data by more

than one core while the data still resides in

some L1 cache

Key Idea: Fine-grain sharing can be leveraged to localize

communication

62

Goal: Localize Shared Data Communication

P0

P4

P11

A (M)

Home A

Load A

Data A A(S)

A(S)

1 2

Load A Data A

Data availability at P0:

2 vs 10 physical hops

(whether P4 holds A in M or S)

63

Summary

• Harnessing 50B transistors requires a fresh look at conventional hardware-software boundaries and interfaces with support for

– Scalable coherence design

– Controlled data sharing via architectural support for

• Memory monitoring, isolation, and protection

– Controlled resource sharing via operating system-level policies for

• Performance isolation • We have examined coherence protocol additions to allow

– Fast event-based communication

– Fine-grain access control

– Programmable support for isolation

– Low-latency access for fine-grain data sharing

– Software to determine policy decisions in a flexible manner

A combined hardware/software approach to support for concurrency with improved performance and scalability

controlled resource and data sharing in multi-core...

Documents