exploring the performance implications of memory safety ... · computer architecture group...

COMPUTER ARCHITECTURE GROUP

Exploring the Performance Implications of Memory Safety

Primitives in Many-core Processors Executing Multi-threaded Workloads

Masab Ahmad, Syed Kamran Haider, Farrukh Hijaz, Marten van Dijk, Omer Khan

University of ConnecAcut


Agenda MoAvaAon CharacterizaAon Methodology

CharacterizaAon Results Insights and Possible Improvements


Agenda Mo#va#on CharacterizaAon Methodology



Security Vulnerabilities in Processors

Vulnerable Applications Buffer Overflow Attacks Code Injection

OS

CPU

DRAM I/O Disk

Applications Compromised OS Privacy Leakage

Malicious I/Os DoS Attacks

Other Devices Data Tampering

Side Channels


Security Vulnerabilities in Processors

Vulnerable Applications Buffer Overflow Attacks Code Injection

OS

CPU

DRAM I/O Disk

Applications Compromised OS Privacy Leakage

Malicious I/Os DoS Attacks

Other Devices Data Tampering

Side Channels

Performance

Security


Key Questions • What are the performance implicaAons of these security schemes?

•  How do they affect performance in the context of mulAcores?

•  IEEE Computer Vision 2022 predicts that MulAcores as a key enabling technology by 2022

•  To get started, we look at Buffer Overflow Protec/on Schemes

•  How do they affect tradeoffs in MulAcores?


A Primer on Buffer Overflow Attacks •  The stack return address is located aRer the program contents

•  An aSacker enters a modified string, overwrites the return address

•  The return address now points to the malicious code

a+2

a+2

Malicious Code

a+1

a

a+n

a+2

……

a+3

a+2

a+2

Return Address

Other Variables

Buffer

a+1

a

a+n

a+2

……

a+3

Overflow ASack

overwrites Return Address

A)acker Comes In

A)acker takes Control


Security Solutions and Prior Works •  Dynamic InformaAon Flow Tracking (DIFT) :

•  Flags and Tags all data from I/O •  Raises excepAons of tagged data propagates into program control flow •  High False posiAve rates, Zero false negaAves (Hence not used in applicaAons)

•  Context based Checking : •  Creates a “context”, an idenAfier for each variable/data structure in the program

•  Checks on each variable read/write •  Large data structure required, 1 element for each array element. Pointer aliasing problems (Hence also not used in applicaAons)

•  Bounds Checking : •  Creates a base and bounds Metadata data structure for a program •  Reads and writes are checked and updated as meta data •  Close to Zero False posiAves and negaAves (Used a lot in Safe languages etc)


Bound Checking Schemes • Hardware based Schemes:

•  CHERI •  HardBound •  WatchdogLite

•  SoRware based Schemes: •  So5Bound •  Cyclone •  SafeCode •  Ccured •  And many others •  Safe Languages such as Python and Java

Return Address Other Variables

Buffer

a+1

a

a+n

a+2

……

a+3

Overflow ASack

cannot write beyond the Bound

Base

Bound

Compute Pipeline

Bounds Check

Extra Memory Accesses

Caches Bounds Metadata

DRAM Bounds Metadata

Bounds Checks


Performance Implications in Multicores

Compute Pipeline

Bounds Check

Increased Coherency Bo)lenecks

Shared Cache

Bounds Metadata

Compute Pipeline

Bounds Check

Core Core

Bounds Metadata

Cache Pollu#on and Interference

Synchroniza#on Bo)lenecks as more #me spent in cri#cal

code sec#ons

Compute Stalls for Bound Checks

Reduced Scalability

…

•  Previous works have not analyzed memory safety schemes in the context of mulAcores

•  All prior works use sequenAal benchmarks such as SPEC

Extra Memory Accesses

DRAM Bounds Metadata


Performance Implications in Multicores : Critical Code •  CriAcal code secAons such as atomically locked program funcAons are most affected.

•  Due to bounds checking stalls, a thread keeps a locked secAon for a longer Ame, depriving other threads from exploiAng scalability.

Code Without So5Bound

1. int a = 1;

2. pthread_mutex_lock(&lock);

3. do_parallel_work();

4. pthread_mutex_unlock(&lock);

5. return 0;

Code With So5Bound

1. int a = 1;

2. pthread_mutex_lock(&lock);

3. do_parallel_work();

4. Get Security Metadata();

5. SoRBound Checks ();

6. pthread_mutex_unlock(&lock);

7. return 0;


Agenda MoAvaAon Characteriza#on Methodology



System Model

Mul#core Processor

SoRBound

.EXE Executables compiled with SoRBound

ApplicaAon Data

DRAM

SoRBound Metadata

Tile

Graph

ite Sim

ulator


System Parameters •  The Graphite Simulator is used to analyze Parallel Benchmarks

•  256 Cores @ 1 GHz •  Results for both in-‐order and

Out-‐of-‐Order cores •  L1-‐I Cache : 32 KB per core •  L1-‐D Cache : 32 KB per core •  L2 Cache : 256 KB per core •  OOO ROB Buffer Size: 168

•  An 8 Core Intel machine used as well (Trends similar as Graphite results) (Results in Paper)

•  POSIX Threading Model is used •  Evalua#on Includes results for 1-‐256 Threads


Evaluated Benchmarks •  Prefix Scan – 16M elements

•  Each Thread gets a chunk to scan, then the Master Thread reduces the scans of other threads to get the final soluAon

•  Barriers used to do explicit SynchronizaAon

•  Matrix Mul#ply – 1K x 1K •  Each Thread Tiles a chunk of the matrix, and mulAplies to get the final matrix •  Barriers used to do explicit SynchronizaAon

•  Breadth First Search (BFS) – 1M ver#ces, 16M edges •  Each Thread gets a chunk of the graph to search on •  Atomic locks used on all graph verAces to ensure that no race condiAons occur in shared verAces

•  Dijkstra – 16K ver#ces, 134M edges •  Checking neighboring nodes for shortest path distances parallelized among threads •  Explicit Barriers to progress each node check



Characteriza#on Results Insights and Possible Improvements


Benchmark Characterization : Prefix Scan

0

0,2

0,4

0,6

0,8

1

1,2

1 2 4 8 16 32 64 96 128 160 192 224 256 Normalized

Com

ple#

on Tim

e

Thread Count

Compute L1Cache-‐L2Home

L2Home-‐OffChip Synchroniza#on

•  Shows Weak Scaling



0

0,2

0,4

0,6

0,8

1

1,2

1 2 4 8 16 32 64 96 128 160 192 224 256 Normalized

Com

ple#

on Tim

e

Thread Count



•  Shows Weak Scaling

Weak Scaling



0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

1 S 2 S 4 S 8 S 16 S 32 S 64 S 96 S

128 S

160 S

192 S

224 S

256 S

Normalized

Com

ple#

on Tim

e

Thread Count



•  SoRBound has higher overheads due to extra compute and memory accesses

•  ‘S’ shows results with SoRBound

Baseline So5Bound



0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

1 S 2 S 4 S 8 S 16 S 32 S 64 S 96 S

128 S

160 S

192 S

224 S

256 S

Normalized

Com

ple#

on Tim

e

Thread Count



•  Concurrency hides SoRBound’s overhead at high thread counts

•  More Compute than CommunicaAon in this Workload


Concurrency Improvement!

Baseline So5Bound


Benchmark Characterization : Matrix Multiply

0

0,5

1

1,5

2

2,5

1 S 2 S 4 S 8 S 16 S 32 S 64 S 96 S

128 S

160 S

192 S

224 S

256 S Normalized

Com

ple#

on Tim

e

Thread Count



0

0,05

0,1

160 S

192 S

224 S

256 S

•  Significant Overhead with SoRBound

•  Memory Bounds ApplicaAons lead to more Metadata and security checks

•  ‘S’ shows results with SoRBound Baseline So5Bound



0

0,5

1

1,5

2

2,5

1 S 2 S 4 S 8 S 16 S 32 S 64 S 96 S

128 S

160 S

192 S

224 S

256 S Normalized

Com

ple#

on Tim

e

Thread Count



0

0,05

0,1

160 S

192 S

224 S

256 S

•  Concurrency does reduce SoRBound’s overhead at high thread counts

•  However, overhead sAll quite significant

•  ‘S’ shows results with SoRBound Baseline So5Bound



0

0,5

1

1,5

2

2,5

1 S 2 S 4 S 8 S 16 S 32 S 64 S 96 S

128 S

160 S

192 S

224 S

256 S Normalized

Com

ple#

on Tim

e

Thread Count



0

0,05

0,1

160 S

192 S

224 S

256 S

•  One can get the Efficiency of the baseline by using addiAonal Threads for SoRBound


Baseline So5Bound

So5Bound Equal to Baseline


Benchmark Characterization : Breadth First Search (BFS)

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

1 S 2 S 4 S 8 S 16 S 32 S 64 S 96 S

128 S

160 S

192 S

224 S

256 S

Normalized

Com

ple#

on Tim

e

Thread Count



0

0,1

0,2

160 S

192 S

224 S

256 S

•  Another Graph Workload, however with higher scalability and Locality

•  Concurrency does reduce SoRBound’s overhead at high thread counts

•  However, overhead sAll quite significant


Baseline So5Bound


Benchmark Characterization : Breadth First Search (BFS)

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

1 S 2 S 4 S 8 S 16 S 32 S 64 S 96 S

128 S

160 S

192 S

224 S

256 S

Normalized

Com

ple#

on Tim

e

Thread Count



0

0,1

0,2

160 S

192 S

224 S

256 S

•  Concurrency helps reduce SoRBound overhead at high thread counts

•  However, overhead sAll quite significant because of fine grained synchronizaAon in this workload


Baseline So5Bound


Benchmark Characterization : Dijkstra

0

0,5

1

1,5

2

2,5

1 S 2 S 4 S 8 S 16 S 32 S 64 S 96 S

128 S

160 S

192 S

224 S

256 S Normalized

Com

ple#

on Tim

e

Thread Count



0

0,2

16 S 32 S 64 S 96 S

•  SoRBound results in higher on-‐chip and off-‐chip data accesses, due to large working set


Baseline So5Bound


Benchmark Characterization : Dijkstra

0

0,5

1

1,5

2

2,5

1 S 2 S 4 S 8 S 16 S 32 S 64 S 96 S

128 S

160 S

192 S

224 S

256 S Normalized

Com

ple#

on Tim

e

Thread Count



0

0,2

16 S 32 S 64 S 96 S

•  Scalability is limited due to fine grained communicaAon

•  SoRBound checks within criAcal secAons hurts performance


Baseline So5Bound


Benchmark Characterization : Summary of Slowdowns

1

1,5

2

2,5

3

3,5

4

0 50 100 150 200 250

So5Bo

und’s N

ormalized

Overhead

Thread Count

PREFIX

Concurrency Improvement!

Baseline Scales Faster Ini#ally

•  Concurrency helps hide SoRBound Overheads



1

1,5

2

2,5

3

3,5

4

0 50 100 150 200 250

Slow

down du

e to So5

Boun

d

Thread Count

MATMUL

Larger Metadata Size Inhibits Scalability

So5Bound Scales Ini#ally

•  Concurrency helps hide SoRBound Overheads iniAally at ~32-‐64 threads •  Then worsens since metadata impacts the available local cache size



1

1,5

2

2,5

3

3,5

4

0 50 100 150 200 250

Slow

down du

e to So5

Boun

d

Thread Count

BFS


So5Bound Scales Ini#ally

•  Concurrency helps hide SoRBound Overheads iniAally at ~2-‐8 threads •  Then worsens since metadata impacts criAcal secAons



1

1,5

2

2,5

3

3,5

4

0 50 100 150 200 250

Slow

down du

e to So5

Boun

d

Thread Count

DIJK


So5Bound Scales a bit

Cache Effects

•  Concurrency helps hide SoRBound Overheads iniAally at ~2-‐8 threads •  Then worsens since metadata impacts criAcal secAons


Benchmark Characterization : Slowdowns in Out-of-Order Cores

1

1,5

2

2,5

3

3,5

4

0 50 100 150 200 250

Slow

down du

e to So5

Boun

d

Thread Count

DIJK PREFIX BFS MATMUL •  All Prior Results had In-‐Order Cores

•  ReducAon in performance degradaAon observed

•  OOOs improve criAcal code secAons

•  Latency hiding helps So5Bound scale be)er


0

0,5

1

1,5

2

2,5

IN OOC IN OOC IN OOC IN OOC

So5Bo

und Overhead

DIJK



Benchmark Characterization : In-Order vs. Out-of-Order

Prefix BFS Matrix Mul#ply

•  Parallel SoRBound results for both In-‐Order and OOO core types

•  At thread counts showing highest speedups

OOO Improvements in Compute Bound Workloads

•  OOOs can not improve parallel performance much •  Alterna#ve

architectural improvements needed


Insights from Characterization •  Most boSlenecks in MulA-‐threaded SoRBound workloads stem from Memory Accesses and SynchronizaAon

•  Latency Hiding does improve SoRBound using OoO Cores slightly

•  Increase in thread count allows SoRBound to perform as well as a baseline at a lower thread counts

•  However, addi#onal so5ware/architectural mechanisms are needed to further improve Performance


Potential Future Improvements •  Improve SoRBound’s Metadata structures

•  Compression •  BeSer Layout (e.g. a beSer tree type of structure)

•  Use ParallelizaAon Strategies (e.g. Blocking) that are aware of Security Metadata’s presence

•  Devise a Prefetcher for Security Metadata •  Prefetch SoRBound metadata from DRAM •  Hides SoRBounds latency

exploring the performance implications of memory safety ... · computer architecture group...

Documents