exploring the performance implications of memory safety ... · computer architecture group...

COMPUTER ARCHITECTURE GROUP

Exploring the Performance Implications of Memory Safety

Primitives in Many-core Processors Executing Multi-threaded Workloads

Masab Ahmad, Syed Kamran Haider, Farrukh Hijaz, Marten van Dijk, Omer Khan

University of ConnecAcut

Agenda MoAvaAon CharacterizaAon Methodology

CharacterizaAon Results Insights and Possible Improvements

Agenda Mo#va#on CharacterizaAon Methodology

Security Vulnerabilities in Processors

Vulnerable Applications Buffer Overflow Attacks Code Injection

DRAM I/O Disk

Applications Compromised OS Privacy Leakage

Malicious I/Os DoS Attacks

Other Devices Data Tampering

Side Channels

Security Vulnerabilities in Processors

Vulnerable Applications Buffer Overflow Attacks Code Injection

DRAM I/O Disk

Applications Compromised OS Privacy Leakage

Malicious I/Os DoS Attacks

Other Devices Data Tampering

Side Channels

Performance

Security

Key Questions • What are the performance implicaAons of these security schemes?

•  How do they affect performance in the context of mulAcores?

•  IEEE Computer Vision 2022 predicts that MulAcores as a key enabling technology by 2022

•  To get started, we look at Buffer Overflow Protec/on Schemes

•  How do they affect tradeoffs in MulAcores?

A Primer on Buffer Overflow Attacks •  The stack return address is located aRer the program contents

•  An aSacker enters a modified string, overwrites the return address

•  The return address now points to the malicious code

Malicious Code

……

Return Address

Other Variables

Buffer

……

Overflow ASack

overwrites Return Address

A)acker Comes In

A)acker takes Control

Security Solutions and Prior Works •  Dynamic InformaAon Flow Tracking (DIFT) :

•  Flags and Tags all data from I/O •  Raises excepAons of tagged data propagates into program control flow •  High False posiAve rates, Zero false negaAves (Hence not used in applicaAons)

•  Context based Checking : •  Creates a “context”, an idenAfier for each variable/data structure in the program

•  Checks on each variable read/write •  Large data structure required, 1 element for each array element. Pointer aliasing problems (Hence also not used in applicaAons)

•  Bounds Checking : •  Creates a base and bounds Metadata data structure for a program •  Reads and writes are checked and updated as meta data •  Close to Zero False posiAves and negaAves (Used a lot in Safe languages etc)

Security Solutions and Prior Works •  Dynamic InformaAon Flow Tracking (DIFT) :

•  Flags and Tags all data from I/O •  Raises excepAons of tagged data propagates into program control flow •  High False posiAve rates, Zero false negaAves (Hence not used in applicaAons)

•  Context based Checking : •  Creates a “context”, an idenAfier for each variable/data structure in the program

•  Checks on each variable read/write •  Large data structure required, 1 element for each array element. Pointer aliasing problems (Hence also not used in applicaAons)

•  Bounds Checking : •  Creates a base and bounds Metadata data structure for a program •  Reads and writes are checked and updated as meta data •  Close to Zero False posiAves and negaAves (Used a lot in Safe languages etc)

Bound Checking Schemes • Hardware based Schemes:

•  CHERI •  HardBound •  WatchdogLite

•  SoRware based Schemes: •  So5Bound •  Cyclone •  SafeCode •  Ccured •  And many others •  Safe Languages such as Python and Java

Return Address Other Variables

Buffer

……

Overflow ASack

cannot write beyond the Bound

Compute Pipeline

Bounds Check

Extra Memory Accesses

Caches Bounds Metadata

DRAM Bounds Metadata

Bounds Checks

Performance Implications in Multicores

Compute Pipeline

Bounds Check

Increased Coherency Bo)lenecks

Shared Cache

Bounds Metadata

Compute Pipeline

Bounds Check

Core Core

Bounds Metadata

Cache Pollu#on and Interference

Synchroniza#on Bo)lenecks as more #me spent in cri#cal

code sec#ons

Compute Stalls for Bound Checks

Reduced Scalability

•  Previous works have not analyzed memory safety schemes in the context of mulAcores

•  All prior works use sequenAal benchmarks such as SPEC

Extra Memory Accesses

DRAM Bounds Metadata

Performance Implications in Multicores : Critical Code •  CriAcal code secAons such as atomically locked program funcAons are most affected.

•  Due to bounds checking stalls, a thread keeps a locked secAon for a longer Ame, depriving other threads from exploiAng scalability.

Code Without So5Bound

1. int a = 1;

2. pthread_mutex_lock(&lock);

3. do_parallel_work();

4. pthread_mutex_unlock(&lock);

5. return 0;

Code With So5Bound

1. int a = 1;

2. pthread_mutex_lock(&lock);

3. do_parallel_work();

4. Get Security Metadata();

5. SoRBound Checks ();

6. pthread_mutex_unlock(&lock);

7. return 0;

Agenda MoAvaAon Characteriza#on Methodology

System Model

Mul#core Processor

SoRBound

.EXE Executables compiled with SoRBound

ApplicaAon Data

SoRBound Metadata

ite Sim

ulator

System Parameters •  The Graphite Simulator is used to analyze Parallel Benchmarks

•  256 Cores @ 1 GHz •  Results for both in-‐order and

Out-‐of-‐Order cores •  L1-‐I Cache : 32 KB per core •  L1-‐D Cache : 32 KB per core •  L2 Cache : 256 KB per core •  OOO ROB Buffer Size: 168

•  An 8 Core Intel machine used as well (Trends similar as Graphite results) (Results in Paper)

•  POSIX Threading Model is used •  Evalua#on Includes results for 1-‐256 Threads

Evaluated Benchmarks •  Prefix Scan – 16M elements

•  Each Thread gets a chunk to scan, then the Master Thread reduces the scans of other threads to get the final soluAon

•  Barriers used to do explicit SynchronizaAon

•  Matrix Mul#ply – 1K x 1K •  Each Thread Tiles a chunk of the matrix, and mulAplies to get the final matrix •  Barriers used to do explicit SynchronizaAon

•  Breadth First Search (BFS) – 1M ver#ces, 16M edges •  Each Thread gets a chunk of the graph to search on •  Atomic locks used on all graph verAces to ensure that no race condiAons occur in shared verAces

•  Dijkstra – 16K ver#ces, 134M edges •  Checking neighboring nodes for shortest path distances parallelized among threads •  Explicit Barriers to progress each node check

Characteriza#on Results Insights and Possible Improvements

Benchmark Characterization : Prefix Scan

1 2 4 8 16 32 64 96 128 160 192 224 256 Normalized

on Tim

Thread Count

Compute L1Cache-‐L2Home

L2Home-‐OffChip Synchroniza#on

•  Shows Weak Scaling

1 2 4 8 16 32 64 96 128 160 192 224 256 Normalized

on Tim

Thread Count

1 2 4 8 16 32 64 96 128 160 192 224 256 Normalized

on Tim

Thread Count

1 2 4 8 16 32 64 96 128 160 192 224 256 Normalized

on Tim

Thread Count

1 2 4 8 16 32 64 96 128 160 192 224 256 Normalized

on Tim

Thread Count

1 2 4 8 16 32 64 96 128 160 192 224 256 Normalized

on Tim

Thread Count

Weak Scaling

1 S 2 S 4 S 8 S 16 S 32 S 64 S 96 S

Normalized

on Tim

Thread Count

•  SoRBound has higher overheads due to extra compute and memory accesses

•  ‘S’ shows results with SoRBound

Baseline So5Bound

1 S 2 S 4 S 8 S 16 S 32 S 64 S 96 S

Normalized

on Tim

Thread Count

•  Concurrency hides SoRBound’s overhead at high thread counts

•  More Compute than CommunicaAon in this Workload

Concurrency Improvement!

Baseline So5Bound

Benchmark Characterization : Matrix Multiply

1 S 2 S 4 S 8 S 16 S 32 S 64 S 96 S

256 S Normalized

on Tim

Thread Count

•  Significant Overhead with SoRBound

•  Memory Bounds ApplicaAons lead to more Metadata and security checks

•  ‘S’ shows results with SoRBound Baseline So5Bound

1 S 2 S 4 S 8 S 16 S 32 S 64 S 96 S

256 S Normalized

on Tim

Thread Count

•  Concurrency does reduce SoRBound’s overhead at high thread counts

•  However, overhead sAll quite significant

•  ‘S’ shows results with SoRBound Baseline So5Bound

1 S 2 S 4 S 8 S 16 S 32 S 64 S 96 S

256 S Normalized

on Tim

Thread Count

•  One can get the Efficiency of the baseline by using addiAonal Threads for SoRBound

Baseline So5Bound

So5Bound Equal to Baseline

Benchmark Characterization : Breadth First Search (BFS)

1 S 2 S 4 S 8 S 16 S 32 S 64 S 96 S

Normalized

on Tim

Thread Count

•  Another Graph Workload, however with higher scalability and Locality

•  Concurrency does reduce SoRBound’s overhead at high thread counts

•  However, overhead sAll quite significant

Baseline So5Bound

Benchmark Characterization : Breadth First Search (BFS)

1 S 2 S 4 S 8 S 16 S 32 S 64 S 96 S

Normalized

on Tim

Thread Count

•  Concurrency helps reduce SoRBound overhead at high thread counts

•  However, overhead sAll quite significant because of fine grained synchronizaAon in this workload

Baseline So5Bound

Benchmark Characterization : Dijkstra

1 S 2 S 4 S 8 S 16 S 32 S 64 S 96 S

256 S Normalized

on Tim

Thread Count

16 S 32 S 64 S 96 S

•  SoRBound results in higher on-‐chip and off-‐chip data accesses, due to large working set

Baseline So5Bound

Benchmark Characterization : Dijkstra

1 S 2 S 4 S 8 S 16 S 32 S 64 S 96 S

256 S Normalized

on Tim

Thread Count

16 S 32 S 64 S 96 S

•  Scalability is limited due to fine grained communicaAon

•  SoRBound checks within criAcal secAons hurts performance

Baseline So5Bound

Benchmark Characterization : Summary of Slowdowns

0 50 100 150 200 250

und’s N

ormalized

Overhead

Thread Count

PREFIX

Concurrency Improvement!

Baseline Scales Faster Ini#ally

•  Concurrency helps hide SoRBound Overheads

0 50 100 150 200 250

down du

e to So5

Thread Count

MATMUL

Larger Metadata Size Inhibits Scalability

So5Bound Scales Ini#ally

•  Concurrency helps hide SoRBound Overheads iniAally at ~32-‐64 threads •  Then worsens since metadata impacts the available local cache size

0 50 100 150 200 250

down du

e to So5

Thread Count

So5Bound Scales Ini#ally

•  Concurrency helps hide SoRBound Overheads iniAally at ~2-‐8 threads •  Then worsens since metadata impacts criAcal secAons

0 50 100 150 200 250

down du

e to So5

Thread Count

So5Bound Scales a bit

Cache Effects

•  Concurrency helps hide SoRBound Overheads iniAally at ~2-‐8 threads •  Then worsens since metadata impacts criAcal secAons

Benchmark Characterization : Slowdowns in Out-of-Order Cores

0 50 100 150 200 250

down du

e to So5

Thread Count

DIJK PREFIX BFS MATMUL •  All Prior Results had In-‐Order Cores

•  ReducAon in performance degradaAon observed

•  OOOs improve criAcal code secAons

•  Latency hiding helps So5Bound scale be)er

IN OOC IN OOC IN OOC IN OOC

und Overhead

Benchmark Characterization : In-Order vs. Out-of-Order

Prefix BFS Matrix Mul#ply

•  Parallel SoRBound results for both In-‐Order and OOO core types

•  At thread counts showing highest speedups

OOO Improvements in Compute Bound Workloads

•  OOOs can not improve parallel performance much •  Alterna#ve

architectural improvements needed

Insights from Characterization •  Most boSlenecks in MulA-‐threaded SoRBound workloads stem from Memory Accesses and SynchronizaAon

•  Latency Hiding does improve SoRBound using OoO Cores slightly

•  Increase in thread count allows SoRBound to perform as well as a baseline at a lower thread counts

•  However, addi#onal so5ware/architectural mechanisms are needed to further improve Performance

Potential Future Improvements •  Improve SoRBound’s Metadata structures

•  Compression •  BeSer Layout (e.g. a beSer tree type of structure)

•  Use ParallelizaAon Strategies (e.g. Blocking) that are aware of Security Metadata’s presence

•  Devise a Prefetcher for Security Metadata •  Prefetch SoRBound metadata from DRAM •  Hides SoRBounds latency

exploring the performance implications of memory safety ... · computer architecture group...

Documents

exploring the implications of oil and gas infrastructure

exploring memory consistency for massively threaded...

exploring dual long memory in returns and volatility

giddens’ globalization: exploring dynamic...

exploring the implications of low growth in electricity...

exploring the association among working memory, anxiety

exploring .net memory management (isense)

exploring the science of complexity: ideas and implications...

learning in web2.0 exploring the implications for ubiquitous...

a survey on exploring memory optimizations in smartphones

exploring .net memory management - jetbrains webinar

exploring societal implications of nanotechnology

software implications of high-performance memory systems

exploring the 64-bit memory mapped arithmetic unit

strengthening memory safety in rust: exploring cheri

exploring learning implications of the distance …

snu ux lab) exploring memory in email refinding

exploring memory options for data transfer on ... ·...

system-level implications of disaggregated memory

exploring alternative memory architectures for erlang...