accelerating asynchronous programs through event sneak peek gaurav chadha, scott mahlke, satish...
TRANSCRIPT
Accelerating Asynchronous Programs through Event Sneak Peek
Gaurav Chadha, Scott Mahlke, Satish Narayanasamy17 June 2015
University of MichiganElectrical Engineering and Computer Science
Internet-of-ThingsMobile Web
Servers (node.js) Sensor networks
Asynchronous programs are ubiquitous
Asynchronous programming hides I/O latency
Synchronous Sequential model
Task 1
Task 2
Task 3
Asynchronousmodel
Waiting for I/O
speedup
Asynchronous programming is well-suitedto handle wide array of asynchronous inputs
• Computation is driven by events• The Hollywood Principle (“Don’t call us, we’ll call you”)
Illustration:Asynchronous Programming Model
Event Queue
Pop an event for execution
Web
Waits on events
LooperThread
onClick
getLocation
onImageLoad
Conventional architecture is not optimized for asynchronous programs
Processor View
Asynchronousmodel
Short events execute varied tasks
Large instruction footprint
Destroys cache locality
Little hot code causespoor branch prediction
Event Queue
Large performance improvement potentialin asynchronous programs
PARSEC
SPECint 2006
Web Apps
L1-I mpki
24
2.3
0.7
L1-D miss rate
4.4
5.5
1.3
Branch Mis-prediction rate
9.8
8.4
6.3
Maximum Performance Improvement (%)
Web Apps
52 69 79
Execute asynchronous program on a specialized Event Sneak Peek (ESP) core
Heterogeneous
Multi-core Processor
CPU
Execute asynchronous program on a specialized Event Sneak Peek (ESP) core
Heterogeneous
Multi-core Processor
ESPCPU
Browser EngineAsynchronous
JavaScript Events
Parse CSS
Layout Render
Parse CSS
Layout Render
Zhu & Reddi, ISCA ‘14
WebCore
Parse
Layout Render
CSS
How to customize a core for asynchronous programs?
HTML5 asynchronous programming modelguarantees sequential execution of events
Looper Thread
Event Queue
Opportunity:Event-Level Parallelism (ELP)
Event Queue
Advance knowledge offuture events
Events arefunctionally independent
How to exploit this ELP?
#1: Parallel Execution
Event Queue
Not provably independent
#2: Optimistic Concurrency
Event Queue
Speculative parallelization (e.g., transactions)
>99% of event pairs conflict Primarily, low-level memory dependencies
– Maintenance code– Memory pool recycling– …
98% of events “match” with a 99% accuracy– Control flow paths– Addresses
Speculative pre-execution
Good match
Event Queue
Observation
How to customize a core for asynchronous programs?
Exploit ELP using speculative pre-execution
ESP Design:Expose event-queue to hardware
Event Queue
Software
Hardware
ISA
H/WEvent Queue
H/W Event QueueLLC miss
LLC miss
ESP Design:Speculatively pre-execute future events on stalls
Isolate
millions of instructions
Warm-Up
Memoize
Trigger
speedup
Realizing ESP design
Isolation Memoization Triggering
Correctness– Isolate speculative updates
Performance– Avoid destructive interference between execution contexts
RRAT
Isolation of multiple execution contexts
PC
Fetch Unit
L1-I cache
ESP
PC
Register State Memory State Branch Predictor
Core Pipeline
D-CacheletI-Cachelet
Isolation of multiple execution contexts
Cachelets isolate speculative updates Performance:
– Avoid L1 pollution – Capture 95% of reuse
L1-I Cache
L1-D Cache
Register State Memory State Branch Predictor
ESP
Isolation of multiple execution contexts
PIR tracks path history
Isolating PIR is adequate
Register State Memory State Branch Predictor
Predictor Tables
PIR
PIR
Branch Predictor
ESP
Realizing ESP design
Isolation Memoization Triggering
Warm-up during speculative pre-execution is ineffective Future events might execute millions of instructions later
D-ListI-List
Record instruction and data addresses, along with instruction count
Memoization of architectural bottlenecks
Addresses Branches
I-CacheletL1-I Cache
D-Cachelet L1-D Cache
ESP
Record branch outcomes Branch address, directions and targets, instruction count
Predictor Tables
PIR
PIR
Branch Predictor
Memoization of architectural bottlenecks
B-List
Addresses Branches
ESP
Realizing ESP design
Isolation Memoization Triggering
Use memoized lists Launch timely prefetches Warm-up branch predictor ahead of branches
Triggering timely prefetches using memoized information
ESP
~100 instr.Start Prefetches
>
Current Instr. Count
AddressInstr. Count
Prefetch
Prefetch
Baseline Architecture
L2 cache
NL-I
NL-D,S
Core Pipeline
PIR
Branch PredictorPr
edic
tor
RRAT
PC
Fetch Unit
L1-I Cache
L1-D Cache
Event Queue
ESP Architecture
L2 cache
NL-I
NL-D,S
Core Pipeline
PIR
Branch PredictorPr
edic
tor
RRAT
PC
Fetch Unit
L1-I Cache
L1-D Cache
ESP Mode
ESP
Event Queue
ESP Architecture
L2 cache
NL-I
NL-D,S
Core Pipeline
PIR
Branch PredictorPr
edic
tor
RRAT
PC
Fetch Unit
ESP Mode
PC
ESP
L1-I Cache I-Cachelet L1-D
CacheD-Cachelet
PIR
Event Queue
ESP Architecture
L2 cache
Core Pipeline
Branch PredictorPr
edic
tor
RRAT
PC
Fetch Unit
ESP Mode
PC
ESP
L1-I Cache I-Cachelet L1-D
CacheD-Cachelet
NL-I
NL-D,S D-ListI-List
PIR
PIR
B-List
ESP Architecture
Event Queue
ESP Mode
L2 cache
NL-I
NL-D,S
Core Pipeline
ESP-1ESP-2
PIR
PIR
PIR
Branch Predictor
B-List
Pred
icto
r
RRAT
PC
Fetch Unit
PC PC
I-List
I-CacheletL1-I Cache
L1-D CacheD-Cachelet
D-List
Methodology
Timing: Trace-driven simulator, Sniper Sim– Instrumented Chromium– Collected and simulated traces of JavaScript events
Energy: McPAT and CACTI
Architectural Model
Core: 4-wide issue, OoO, 1.66 GHz
L1-(I,D) Cache: 32 KB, 2-way
L2 Cache: 2 MB, 16-way
Energy Modeling: Vdd = 1.2 V, 32 nm
Limitations of Runahead
Event Queue
Speculative pre-execution
Data cache miss
Reduces data cache misses – Not a significant problem in web applications
Cannot mitigate I-cache missesDoes not exploit ELP – No notion of events– Future events are a rich source of independent instructions
[Dundas, et. al. ’97, Mutlu, et. al. ‘03]
Events are short
Action # Events # Instructions Event Size (instr) Web App
Buy headphones 7,787 433 million 55k amazon
53k bing
91k cnn
232k facebook
372k gdocs
472k gmaps
56k pixlr
Short events execute varied tasks
Large instruction footprint
Destroys cache locality
Little hot code causespoor branch prediction
ESP outperforms other designs
21.8
12.5
14.0
ESP
Runahead
Baseline
Performance improvement w.r.t. no prefetching (%)
Baseline : Next-line (NL) + Stride
ESP outperforms other designs
32.1
21.3
14.0
ESP + NL
Runahead + NL
Baseline
Performance improvement w.r.t. no prefetching (%)
Baseline : Next-line (NL) + Stride
ESP
Max
I-Cache Branch Predictor D-Cache
Largest performance improvementcomes from improved I-cache performance
Performance Improvement (%)
52 69 79
21 28 32
ESP consumes less static energy, butexpends more dynamic energy
ESP executes 21% more instructions, but consumes only 8% more energy
NL
ESP
0 0.2 0.4 0.6 0.8 1 1.2
Static Energy Dynamic Energy
Energy consumed w.r.t. no prefetching
Hardware area overhead
Cachelets
Lists
Registers
12.6KB
1.2KB
ESP-1 ESP-2
Summary
Accelerators for asynchronous programs
ESP exploits Event-Level Parallelism (ELP)– Expose event queue to hardware– Speculatively pre-execute future events
Performance: 16%
Accelerating Asynchronous Programs through Event Sneak Peek
Gaurav Chadha, Scott Mahlke, Satish Narayanasamy17 June 2015
University of MichiganElectrical Engineering and Computer Science
Jumping ahead two events is sufficient
Normal
ESP1 ESP2 ESP3 ESP4 ESP5 ESP6 ESP7 ESP80
1
10
100
1000
10000
Max 95% 85%
# ca
che
lines
Impact of JS execution on response time
Chow, et. al., ’14
JavaScript
DOMCSS
Network
Server
Client delay
Chow, et. al., ’14