controlled resource and data sharing in multi-core...
TRANSCRIPT
-
1
1
Controlled Resource and Data
Sharing in Multi-Core Platforms*
Sandhya Dwarkadas
Department of Computer Science
University of Rochester
*Joint work with Arrvindh Shriraman, Hemayet Hossain, Xiao Zhang,
Hongzhou Zhao, Rongrong Zhong,
Michael L. Scott, Michael Huang, Kai Shen
2
The University of Rochester
• Small private research university
• 4400 undergraduates
• 2800 graduate students
• Set on the Genesee River in Western NY State, near the south shore of Lake Ontario
• 250km by road from Toronto; 590km from New York City
3 4
The Computer Science Dept.
• 15 tenure-track
faculty; 45 Ph.D.
students
• Specializing in AI,
theory, and parallel
and distributed
systems
• Among the best
small departments in
the US
-
2
5
Handheld DeviceJava
Internet
ClusterFortran/C
data
InterWeave Server
IW libraryIW library
IW library
cache
DesktopC/C++
cache
cache
The Hardware-Software Interface
TreadMarks
Cashmere-2L
InterWeave
RTM FlexTM
Concurrency: Coherence,
Synchronization, Consistency
Memory Systems
Peer-to-peer systems
Power-Aware Computing CAP
MCD
P P P
M
…
ARMCO DDCache LADLE
FCS Multi-Cores
Resource-Aware OS Scheduling
DT-CMT
Distributed Systems
Operating Systems
Willow
Sentry:
Protection Support
RPPT
6
The Implications of Technology
Scaling
• Many more transistors for compute power
• Energy constraints
• Large volumes of data
• High-speed communication
• Concurrency (parallel or distributed)
• Need support for
– Scalable sharing
– Reliability
– Protection and security
– Performance isolation Source: http://weblog.infoworld.com/tech-bottom-line/archives/IT-In-The-Clouds_hp.jpg
7
Multi-Core Challenges
• Ensuring performance isolation
• Providing protected and controlled sharing
across cores
• Scaling support for data sharing
8
Current Projects
• CoSyn: Communication and Synchronization Mechanisms for Emerging Multi-Core Processors
– Collaboration with Professors Michael Scott and Michael Huang
– Arrvindh Shriraman, Hemayet Hossain, Hongzhou Zhao
• Operating System-Level Resource Management in the Multi-Core Era
– Collaboration with Professor Kai Shen
– Xiao Zhang and Rongrong Zhong
See http://www.cs.rochester.edu/research/cosyn
and http://www.cs.rochester.edu/~sandhya
http://www.cs.rochester.edu/research/cosynhttp://www.cs.rochester.edu/~sandhya
-
3
9
Multi-Core Challenges
• Ensuring performance isolation
• Providing protected and controlled sharing
across cores
• Scaling support for data sharing
10
Performance Isolation
11
11
Resource Sharing is (and will be)
Ubiquitous!
• Floating point, integer, state, cache with multiple threads on a core
• Second-level cache with multiple cores on a chip
• Interconnect bandwidth on multiprocessors
Sun UltraSparc T1, … Intel’s 6-core (12-thread), …
AMD’s 12-core, …
12
Resource Sharing on Multicore Chip
• Memory bandwidth and last level cache are
commonly shared by sibling cores sitting on the
same chip
http://download.intel.com/pressroom/kits/45nm/Penryn Die Photo_300.jpg
-
4
13
13
Resource Management To Date
• Capitalistic - generation of more requests results
in more resource usage
– Performance: resource contention can result
in significantly reduced overall performance
– Fairness: equal time slice does not
necessarily guarantee equal progress
14
Poor Performance Due to
Uncontrolled Resource Contention
Experiments were conducted on a 3Ghz Intel Core 2 Duo processor with a shared 4MB L2 cache
Win-win situation
15
Fluctuating Performance Due to
Uncontrolled Resource Contention
Performance of art when co-running with different applications
on an Intel dual-core processor with a 4MB shared L2 cache
16
Fairness and Security Concerns
• Priority inversion
• Poor fairness among competing applications
• Information leakage at chip level
• Denial of service attack at chip level
-
5
17
Big Picture
A B D C
Select which applications run
together
X ………
…..
Control resource usage of co-running
applications
Resource-aware scheduling
[USENIX’10]
Page coloring
[Eurosys’09]
or Hardware throttling
[USENIX’09]
18
Existing Mechanism(I):
Software based Page Coloring
Thread A
Thread B
Shared Cache
Way-1 Way-n …………
Memory page A1
A2
A3
A4
A5
Thread A’s footprint • Classic technique to reduce cache misses,
now used by OS to manage cache
partitioning
• Partition cache at coarse
granularity
• No need for hardware
support
19
Drawbacks of Page Coloring • Expensive re-coloring cost
– Prohibitive in a dynamic environment where
frequent re-coloring may be necessary
• Complex memory management
– Introduces artificial memory pressure
Thread A
Thread B
Shared Cache
Way-1 Way-n …………
Memory page
A1
A2
A3
A4
A5
Thread A’s footprint
20
Toward Practical Page Coloring
• Hotness-based Page Coloring – Efficiently find a small group of hot pages
– Restrain page coloring or re-coloring to hot pages
– Pay less re-coloring overhead while achieving most of the cache partitioning benefit (separate competing applications’ most frequently accessed pages)
• Key challenge – Efficient way to track page hotness
-
6
21
Methods to Track Page Hotness
• Using page protection – Capture page accesses by triggering page faults
– Microseconds overhead per page fault
• Using access bits – A single bit stored in each Page Table Entry (PTE)
– Generally available on x86, automatically set by hardware upon page access
– Tens of cycles per page table entry check
– Recycle spare bits in PTE as hotness counter • Counter is aged to reflect recency and frequency
22
Sampling of Access Bits
• Decouple sampling frequency and window – Hotness sampling accuracy is determined by sampling time
window T
– Hotness sampling overhead is determined by sampling
frequency N
0
N 2N 3N 4N
T N+T 2N+T 3N+T 4N+T
Clear all access bits
Check all access bits
Time
Check all access bits
Clear all access bits
In our experiments, T = 2 milliseconds N = 100 or 10 milliseconds
23
Miss-Ratio-Curve Driven
Cache Partition Policy
Thread A’s Miss Ratio
Thread B’s Miss Ratio
Cache Allocation
Cache Allocation
0.5 0.7 0.3 0.2
0
0
Optimal partition point
Cache Size = ∑A,B Cache Allocation
4M
4M
System optimization metric =
24
Hot Page Coloring
• Budget control of page re-coloring overhead
– % of time slice, e.g. 5%
• Recolor from hottest until budget is reached
– Maintain a set of hotness bins during sampling
• bin[ i ][ j ] = # of pages in color i with normalized hotness in
range [ j, j+1]
– Given a budget K, K-th hottest page’s hotness value is
estimated in constant time by searching hotness bins
– Make sure hot pages are uniformly distributed among colors
-
7
25
Re-coloring Procedure
Budget = 3 pages
Cache share decrease
100
83
71
…
…
…
14
0
98 97 99
87
74
…
…
…
10
1
82
75
…
…
…
11
3
81
73
…
…
…
12
2
X
hotness counter value
Page sorted in
hotness ascending
order
Color Red Color Blue Color Green Color Gray 26
Performance Comparison
4 SPECcpu2k benchmarks (art,
equake, mcf, and twolf) are running
on 2 sibling cores (Intel core2duo)
that share a 4MB L2 cache.
27
Additional Benefit of
Hotness-based Page Coloring
• Page coloring introduces artificial memory pressure – App’s footprint is larger
than its entitled memory color pages, but system still has an abundance of memory pages
• Allow app to “steal”
other’s colors, but it preferentially copies cold pages to other’s memory colors
Thread A
Thread B
Cache Way-1 Way-n …………
Memory page A1
A2
A3
A4
A5
Thread A’s footprint
28
Big Picture
A B D C
Select which applications run
together
X ………
…..
Control resource usage of co-running
applications
Resource-aware scheduling
Page coloring or Hardware throttling
-
8
29
Hardware Execution Throttling • Instead of directly controlling resource allocation,
throttle the execution speed of application that overuses resource
• Available throttling knobs
– Duty-cycle modulation
– Frequency/voltage scaling
– Cache prefetchers
30
Comparing Hardware Execution
Throttling to Page Coloring
• Kernel code modification complexity – Code length: 40 lines in a single file, as a reference our
page coloring implementation takes 700+ lines of code
crossing 10+ files
• Runtime overhead of configuration – Less than 1 microseconds, as a reference re-coloring a
page takes 3 microseconds
31
Existing Mechanism(II):
Scheduling Quantum Adjustment
• Shorten the time slice of app that overuses cache
• May let core idle if there is no other active thread
available
Thread B
Thread A idle
Thread B
Thread A idle
Thread B
Thread A idle Core 0
Core 1
time
32
Drawback of Scheduling Quantum
Adjustment Coarse-grained control at scheduling quantum granularity may
result in fluctuating service delays for individual transactions
-
9
33
New Mechanism:
Hardware Execution Throttling [Usenix’09]
• Throttle the execution speed of app that overuses cache
– Duty cycle modulation
• CPU works only in duty cycles and stalls in non-duty cycles
• Different from Dynamic Voltage Frequency Scaling
– Per-core vs. per-processor control
– Thermal vs. power management
– Enable/disable cache prefetchers
• L1 prefetchers
– IP: keeps track of instruction pointer for load history
– DCU: when detecting multiple loads from the same line within a time limit,
prefetches the next line
• L2 prefetchers
– Adjacent line: Prefetches the adjacent line of required data
– Stream: looks at streams of data for regular patterns
34
Comparison of Hardware Execution
Throttling to other two mechanisms • Comparison to page coloring
– Little complexity to kernel • Code length: 40 lines in a single file, as a reference our page coloring implementation
takes 700+ lines of code crossing 10+ files
– Lightweight to configure • Read plus write register: duty-cycle 265 + 350 cycles, prefetcher 298 + 2065 cycles
• Less than 1 microseconds, as a reference re-coloring a page takes 3 microseconds
• Comparison to scheduling quantum adjustment
– More fine-grained controlling
Thread B
Core 0
Core 1
Thread A idle
Quantum adjustment Hardware execution
throttling
time
35
35
Fairness Comparison
• On average all three mechanisms are effective in improving fairness
• Case {swim, SPECweb} illustrates limitation of page coloring
• Unfairness factor: coefficient of variation (deviation-to-mean ratio, σ / μ) of co-running apps’ normalized performances (normalization base is the execution-time/throughput when the application monopolizes the whole chip)
36
36
Performance Comparison
• System efficiency: geometric mean of co-running apps’ normalized performances
• On average all three mechanisms achieve system efficiency comparable to default sharing
• Case where severe inter-
thread cache conflicts exist
favors segregation, e.g.
{swim, mcf}
• Case where well-interleaved
cache accesses exist favors
sharing, e.g. {mcf, mcf}
-
10
37
Policies for Hardware Throttling Based
Multicore Management
• User-defined service level agreements (SLAs) – Proportional progress among competing threads
• Unfairness metric: coefficient of variation of threads’ performance
– Quality of service guarantee for high-priority application(s)
• Key challenge
– Throttling configuration space grows exponentially as
the number of cores increases
– Quickly determining optimal or close to optimal
throttling configurations is challenging
38
Model-Driven Iterative Framework
• Customizable performance estimation model
• Reference configuration set and linear approximation
• Currently incorporates duty cycle modulation and
frequency/voltage scaling
• Iterative refinement
• Prediction accuracy gets improved over time as more
configurations are added into reference set
39
Iterative Refinement Patterns
40
Online Deployment:
Hill-Climbing Search Acceleration
• For a m-throttling-level n-core system, need to compute nm times to predict a “best” one
• Hill-climbing searches along the best child rather than all children
• Prunes the computation space to (m-1)n2
(X-
1,Y,Z,U)
(X,Y,Z,U)
(X,Y-
1,Z,U)
(X,Y,Z-
1,U)
(X,Y,Z,U-
1)
(X-1,Y-
1,Z,U) (X,Y-2,Z,U)
(X,Y-1,Z-
1,U)
(X,Y-1,Z,U-
1)
(X-1,Y-1,Z-
1,U) (X,Y-2,Z-1,U) (X,Y-1,Z-2,U)
(X,Y-1,Z-1,U-
1)
… … … …
-
11
41
Accuracy Evaluation
• Test platform
• A quad-core Nehalem processor with 8MB shared L3
cache
• Search space from full CPU speed (duty cycle level 8) to
half CPU speed (duty cycle level 4), so 369 configurations
for each test
• Benchmarks: SPECCPU2k • Set-1: {mesa, art, mcf, equake}
• Set-2: {swim, mgrid, mcf, equake}
• Set-3: {swim, art, equake, twolf}
• Set-4: {swim, applu, equake, twolf}
• Set-5: {swim, mgrid, art, equake}
42
Capability of Satisfying SLAs • Service Level Agreements (SLAs)
• Fairness-oriented: keep the unfairness below a threshold
• QoS-oriented: keep the QoS-core above a QoS
threshold
• 4 different unfairness/QoS thresholds for 5 sets
• Optimization goal: satisfy SLAs while optimizing
performance or power efficiency
# Passing tests Avg. num of
samples
Avg.
performance of
picked configs
that pass tests
Oracle 39/40 0 100%
Model 39/40 4.1 99.4%
Random 25/40 15 91.1% Recall the search space has 369 configurations
43
Accuracy of Performance Estimation
Error Rate = |Prediction – Actual| / Actual
44
Big Picture
A B D C
Select which applications run
together
X ………
…..
Control resource usage of co-running
applications
Resource-aware scheduling
Page coloring or Hardware throttling
-
12
45
Resource-aware Scheduling
• Scheduling decision could significantly affect performance
46
Similarity Grouping Scheduling
• Group applications with similar cache miss ratio on the same chip
– Separate high and low miss ratio apps on different chips
• Benefits – Mitigate cache thrashing effect
– Avoid over-saturating memory bandwidth
– Engage per-chip DVFS-based power savings
• A single voltage setting applies to all sibling cores on existing multicore chips
• High-miss-ratio chip runs at low frequency while low-miss-ratio chip runs at high frequency
47
Frequency-to-Performance Model
• Objective: explore power savings with bounded performance loss
• Assumptions – An application’s performance is linearly determined by
cache and memory access latencies
– Frequency scaling only affects on-chip accesses
– Miss ratio does not vary across frequencies
Normalized performance at frequency f = T(F) / T(f)
48
Model Accuracy
Error Rate = (Prediction - Actual) / Actual
-
13
49
Model-based Dynamic Frequency Setting
• Dynamically adjust CPU frequency based on
current running application’s behavior
– Collect cache miss ratio every 10 milliseconds
– Calculate an appropriate frequency setting based on
performance estimation model
• Guided by performance degradation threshold (e.g. 10%)
50
Model-based Dynamic Frequency Setting
51
Hardware Counter-based Power
Containers: An OS Resource
• Cross-core activity influence
• Online calibration with actual measurement
• Application-transparent online request context
tracking
52
Power Conditioning Using Power
Containers
-
14
53
Power Conditioning Achieved
Using Targeted Throttling
54
Ongoing Work
• Variation-directed information and management
– Using behavior fluctuation to trigger
monitoring
– Supporting fine-grain resource accounting
– Developing policies to reshape behavior for
high dependability and low jitter
– Request-level power attribution, modeling,
and management
55
See
http://www.cs.rochester.edu/research/cosyn
http://www.cs.rochester.edu/~sandhya
56
Arch/App: Shared Memory ++: DIMM [TRANSACT’06, ISCA’07, ISCA’08,ICS’09]
• Data Isolation (DI)
– Provide control over propagation of writes
– Buffer writes and allow group undo or propagation
Applications: Sand-boxing, transactional programming, speculation
• Memory Monitoring (MM)
– Monitor memory at summary or individual cache line level
Applications: Synchronization/event notification, reliability, security,
watchpoints/debugging
http://www.cs.rochester.edu/research/cosyn
-
15
57
Arch/App/OS:
Protection: Separation of Privileges
Reality: Today’s programs often consist of multiple modules
written by different programmers
Reliability and composability requires developing access and
interface conventions
58
Sentry: Light-Weight Auxiliary Memory
Access Control [ISCA’10]
• Access checks on an L1
miss
– Saves 90x energy
– Simplifies
implementation
• Metadata cache (M-cache)
accessed in parallel with
the L2 to speed up check
59
1 1 1 1
0 0 1 1
1 1 1 1
0 0 0 0
1 0
0 0 0
0
1
1
60
P0
P4
P11
The Indirection Problem
PACT 2008
A (M)
Home A
Load A
Data A
A(S)
1
DG A
Data A
2
A(S)
3 3 Load A
DG A
Data A
Data A
Longer distance means longer latency
-
16
61
Fine-Grain Data Sharing
Simultaneous access to the same data by more
than one core while the data still resides in
some L1 cache
Key Idea: Fine-grain sharing can be leveraged to localize
communication
62
Goal: Localize Shared Data Communication
P0
P4
P11
A (M)
Home A
Load A
Data A A(S)
A(S)
1 2
Load A Data A
Data availability at P0:
2 vs 10 physical hops
(whether P4 holds A in M or S)
63
Summary
• Harnessing 50B transistors requires a fresh look at conventional hardware-software boundaries and interfaces with support for
– Scalable coherence design
– Controlled data sharing via architectural support for
• Memory monitoring, isolation, and protection
– Controlled resource sharing via operating system-level policies for
• Performance isolation • We have examined coherence protocol additions to allow
– Fast event-based communication
– Fine-grain access control
– Programmable support for isolation
– Low-latency access for fine-grain data sharing
– Software to determine policy decisions in a flexible manner
A combined hardware/software approach to support for concurrency with improved performance and scalability