accelerating asynchronous programs through event sneak peek gaurav chadha, scott mahlke, satish...

Accelerating Asynchronous Programs through Event Sneak Peek

Gaurav Chadha, Scott Mahlke, Satish Narayanasamy17 June 2015

University of MichiganElectrical Engineering and Computer Science

Internet-of-ThingsMobile Web

Servers (node.js) Sensor networks

Asynchronous programs are ubiquitous

Asynchronous programming hides I/O latency

Synchronous Sequential model

Task 1

Task 2

Task 3

Asynchronousmodel

Waiting for I/O

speedup

Asynchronous programming is well-suitedto handle wide array of asynchronous inputs

• Computation is driven by events• The Hollywood Principle (“Don’t call us, we’ll call you”)

Illustration:Asynchronous Programming Model

Event Queue

Pop an event for execution

Web

Waits on events

LooperThread

onClick

getLocation

onImageLoad

Conventional architecture is not optimized for asynchronous programs

Processor View

Asynchronousmodel

Short events execute varied tasks

Large instruction footprint

Destroys cache locality

Little hot code causespoor branch prediction

Event Queue

Large performance improvement potentialin asynchronous programs

PARSEC

SPECint 2006

Web Apps

L1-I mpki

24

2.3

0.7

L1-D miss rate

4.4

5.5

1.3

Branch Mis-prediction rate

9.8

8.4

6.3

Maximum Performance Improvement (%)

Web Apps

52 69 79

Execute asynchronous program on a specialized Event Sneak Peek (ESP) core

Heterogeneous

Multi-core Processor

CPU

Execute asynchronous program on a specialized Event Sneak Peek (ESP) core

Heterogeneous

Multi-core Processor

ESPCPU

Browser EngineAsynchronous

JavaScript Events

Parse CSS

Layout Render

Parse CSS

Layout Render

Zhu & Reddi, ISCA ‘14

WebCore

Parse

Layout Render

CSS

How to customize a core for asynchronous programs?

HTML5 asynchronous programming modelguarantees sequential execution of events

Looper Thread

Event Queue

Opportunity:Event-Level Parallelism (ELP)

Event Queue

Advance knowledge offuture events

Events arefunctionally independent

How to exploit this ELP?

#1: Parallel Execution

Event Queue

Not provably independent

#2: Optimistic Concurrency

Event Queue

Speculative parallelization (e.g., transactions)

>99% of event pairs conflict Primarily, low-level memory dependencies

– Maintenance code– Memory pool recycling– …

98% of events “match” with a 99% accuracy– Control flow paths– Addresses

Speculative pre-execution

Good match

Event Queue

Observation

How to customize a core for asynchronous programs?

Exploit ELP using speculative pre-execution

ESP Design:Expose event-queue to hardware

Event Queue

Software

Hardware

ISA

H/WEvent Queue

H/W Event QueueLLC miss

LLC miss

ESP Design:Speculatively pre-execute future events on stalls

Isolate

millions of instructions

Warm-Up

Memoize

Trigger

speedup

Realizing ESP design

Isolation Memoization Triggering

Correctness– Isolate speculative updates

Performance– Avoid destructive interference between execution contexts

RRAT

Isolation of multiple execution contexts

PC

Fetch Unit

L1-I cache

ESP

PC

Register State Memory State Branch Predictor

Core Pipeline

D-CacheletI-Cachelet


Cachelets isolate speculative updates Performance:

– Avoid L1 pollution – Capture 95% of reuse

L1-I Cache

L1-D Cache


ESP


PIR tracks path history

Isolating PIR is adequate


Predictor Tables

PIR

PIR

Branch Predictor

ESP



Warm-up during speculative pre-execution is ineffective Future events might execute millions of instructions later

D-ListI-List

Record instruction and data addresses, along with instruction count

Memoization of architectural bottlenecks

Addresses Branches

I-CacheletL1-I Cache

D-Cachelet L1-D Cache

ESP

Record branch outcomes Branch address, directions and targets, instruction count

Predictor Tables

PIR

PIR

Branch Predictor

Memoization of architectural bottlenecks

B-List

Addresses Branches

ESP



Use memoized lists Launch timely prefetches Warm-up branch predictor ahead of branches

Triggering timely prefetches using memoized information

ESP

~100 instr.Start Prefetches

>

Current Instr. Count

AddressInstr. Count

Prefetch

Prefetch

Baseline Architecture

L2 cache

NL-I

NL-D,S

Core Pipeline

PIR

Branch PredictorPr

edic

tor

RRAT

PC

Fetch Unit

L1-I Cache

L1-D Cache

Event Queue

ESP Architecture

L2 cache

NL-I

NL-D,S

Core Pipeline

PIR

Branch PredictorPr

edic

tor

RRAT

PC

Fetch Unit

L1-I Cache

L1-D Cache

ESP Mode

ESP

Event Queue

ESP Architecture

L2 cache

NL-I

NL-D,S

Core Pipeline

PIR

Branch PredictorPr

edic

tor

RRAT

PC

Fetch Unit

ESP Mode

PC

ESP

L1-I Cache I-Cachelet L1-D

CacheD-Cachelet

PIR

Event Queue

ESP Architecture

L2 cache

Core Pipeline

Branch PredictorPr

edic

tor

RRAT

PC

Fetch Unit

ESP Mode

PC

ESP

L1-I Cache I-Cachelet L1-D

CacheD-Cachelet

NL-I

NL-D,S D-ListI-List

PIR

PIR

B-List

ESP Architecture

Event Queue

ESP Mode

L2 cache

NL-I

NL-D,S

Core Pipeline

ESP-1ESP-2

PIR

PIR

PIR

Branch Predictor

B-List

Pred

icto

r

RRAT

PC

Fetch Unit

PC PC

I-List

I-CacheletL1-I Cache

L1-D CacheD-Cachelet

D-List

Methodology

Timing: Trace-driven simulator, Sniper Sim– Instrumented Chromium– Collected and simulated traces of JavaScript events

Energy: McPAT and CACTI

Architectural Model

Core: 4-wide issue, OoO, 1.66 GHz

L1-(I,D) Cache: 32 KB, 2-way

L2 Cache: 2 MB, 16-way

Energy Modeling: Vdd = 1.2 V, 32 nm

Limitations of Runahead

Event Queue

Speculative pre-execution

Data cache miss

Reduces data cache misses – Not a significant problem in web applications

Cannot mitigate I-cache missesDoes not exploit ELP – No notion of events– Future events are a rich source of independent instructions

[Dundas, et. al. ’97, Mutlu, et. al. ‘03]

Events are short

Action # Events # Instructions Event Size (instr) Web App

Buy headphones 7,787 433 million 55k amazon

53k bing

91k cnn

232k facebook

372k gdocs

472k gmaps

56k pixlr

Short events execute varied tasks

Large instruction footprint

Destroys cache locality

Little hot code causespoor branch prediction

ESP outperforms other designs

21.8

12.5

14.0

ESP

Runahead

Baseline

Performance improvement w.r.t. no prefetching (%)

Baseline : Next-line (NL) + Stride

ESP outperforms other designs

32.1

21.3

14.0

ESP + NL

Runahead + NL

Baseline

Performance improvement w.r.t. no prefetching (%)

Baseline : Next-line (NL) + Stride

ESP

Max

I-Cache Branch Predictor D-Cache

Largest performance improvementcomes from improved I-cache performance

Performance Improvement (%)

52 69 79

21 28 32

ESP consumes less static energy, butexpends more dynamic energy

ESP executes 21% more instructions, but consumes only 8% more energy

NL

ESP

0 0.2 0.4 0.6 0.8 1 1.2

Static Energy Dynamic Energy

Energy consumed w.r.t. no prefetching

Hardware area overhead

Cachelets

Lists

Registers

12.6KB

1.2KB

ESP-1 ESP-2

Summary

Accelerators for asynchronous programs

ESP exploits Event-Level Parallelism (ELP)– Expose event queue to hardware– Speculatively pre-execute future events

Performance: 16%

Accelerating Asynchronous Programs through Event Sneak Peek

Gaurav Chadha, Scott Mahlke, Satish Narayanasamy17 June 2015

University of MichiganElectrical Engineering and Computer Science

Jumping ahead two events is sufficient

Normal

ESP1 ESP2 ESP3 ESP4 ESP5 ESP6 ESP7 ESP80

1

10

100

1000

10000

Max 95% 85%

# ca

che

lines

Impact of JS execution on response time

Chow, et. al., ’14

JavaScript

DOMCSS

Network

Server

Client delay

Chow, et. al., ’14

accelerating asynchronous programs through event sneak peek gaurav chadha, scott mahlke, satish...

Documents

eventqueuenow hardware

event pairs conflictprimarily

event sneak peekgaurav

execution webwaits

events match

aware of future events

solution exploit elp

hardware contextin