exploring the performance implications of memory safety ... · computer architecture group...
TRANSCRIPT
COMPUTER ARCHITECTURE GROUP
Exploring the Performance Implications of Memory Safety
Primitives in Many-core Processors Executing Multi-threaded Workloads
Masab Ahmad, Syed Kamran Haider, Farrukh Hijaz, Marten van Dijk, Omer Khan
University of ConnecAcut
COMPUTER ARCHITECTURE GROUP
Agenda MoAvaAon CharacterizaAon Methodology
CharacterizaAon Results Insights and Possible Improvements
COMPUTER ARCHITECTURE GROUP
Agenda Mo#va#on CharacterizaAon Methodology
CharacterizaAon Results Insights and Possible Improvements
COMPUTER ARCHITECTURE GROUP
Security Vulnerabilities in Processors
Vulnerable Applications Buffer Overflow Attacks Code Injection
OS
CPU
DRAM I/O Disk
Applications Compromised OS Privacy Leakage
Malicious I/Os DoS Attacks
Other Devices Data Tampering
Side Channels
COMPUTER ARCHITECTURE GROUP
Security Vulnerabilities in Processors
Vulnerable Applications Buffer Overflow Attacks Code Injection
OS
CPU
DRAM I/O Disk
Applications Compromised OS Privacy Leakage
Malicious I/Os DoS Attacks
Other Devices Data Tampering
Side Channels
Performance
Security
COMPUTER ARCHITECTURE GROUP
Key Questions • What are the performance implicaAons of these security schemes?
• How do they affect performance in the context of mulAcores?
• IEEE Computer Vision 2022 predicts that MulAcores as a key enabling technology by 2022
• To get started, we look at Buffer Overflow Protec/on Schemes
• How do they affect tradeoffs in MulAcores?
COMPUTER ARCHITECTURE GROUP
A Primer on Buffer Overflow Attacks • The stack return address is located aRer the program contents
• An aSacker enters a modified string, overwrites the return address
• The return address now points to the malicious code
a+2
a+2
Malicious Code
a+1
a
a+n
a+2
……
a+3
a+2
a+2
Return Address
Other Variables
Buffer
a+1
a
a+n
a+2
……
a+3
Overflow ASack
overwrites Return Address
A)acker Comes In
A)acker takes Control
COMPUTER ARCHITECTURE GROUP
Security Solutions and Prior Works • Dynamic InformaAon Flow Tracking (DIFT) :
• Flags and Tags all data from I/O • Raises excepAons of tagged data propagates into program control flow • High False posiAve rates, Zero false negaAves (Hence not used in applicaAons)
• Context based Checking : • Creates a “context”, an idenAfier for each variable/data structure in the program
• Checks on each variable read/write • Large data structure required, 1 element for each array element. Pointer aliasing problems (Hence also not used in applicaAons)
• Bounds Checking : • Creates a base and bounds Metadata data structure for a program • Reads and writes are checked and updated as meta data • Close to Zero False posiAves and negaAves (Used a lot in Safe languages etc)
COMPUTER ARCHITECTURE GROUP
Security Solutions and Prior Works • Dynamic InformaAon Flow Tracking (DIFT) :
• Flags and Tags all data from I/O • Raises excepAons of tagged data propagates into program control flow • High False posiAve rates, Zero false negaAves (Hence not used in applicaAons)
• Context based Checking : • Creates a “context”, an idenAfier for each variable/data structure in the program
• Checks on each variable read/write • Large data structure required, 1 element for each array element. Pointer aliasing problems (Hence also not used in applicaAons)
• Bounds Checking : • Creates a base and bounds Metadata data structure for a program • Reads and writes are checked and updated as meta data • Close to Zero False posiAves and negaAves (Used a lot in Safe languages etc)
COMPUTER ARCHITECTURE GROUP
Bound Checking Schemes • Hardware based Schemes:
• CHERI • HardBound • WatchdogLite
• SoRware based Schemes: • So5Bound • Cyclone • SafeCode • Ccured • And many others • Safe Languages such as Python and Java
Return Address Other Variables
Buffer
a+1
a
a+n
a+2
……
a+3
Overflow ASack
cannot write beyond the Bound
Base
Bound
Compute Pipeline
Bounds Check
Extra Memory Accesses
Caches Bounds Metadata
DRAM Bounds Metadata
Bounds Checks
COMPUTER ARCHITECTURE GROUP
Performance Implications in Multicores
Compute Pipeline
Bounds Check
Increased Coherency Bo)lenecks
Shared Cache
Bounds Metadata
Compute Pipeline
Bounds Check
Core Core
Bounds Metadata
Cache Pollu#on and Interference
Synchroniza#on Bo)lenecks as more #me spent in cri#cal
code sec#ons
Compute Stalls for Bound Checks
Reduced Scalability
…
• Previous works have not analyzed memory safety schemes in the context of mulAcores
• All prior works use sequenAal benchmarks such as SPEC
Extra Memory Accesses
DRAM Bounds Metadata
COMPUTER ARCHITECTURE GROUP
Performance Implications in Multicores : Critical Code • CriAcal code secAons such as atomically locked program funcAons are most affected.
• Due to bounds checking stalls, a thread keeps a locked secAon for a longer Ame, depriving other threads from exploiAng scalability.
Code Without So5Bound
1. int a = 1;
2. pthread_mutex_lock(&lock);
3. do_parallel_work();
4. pthread_mutex_unlock(&lock);
5. return 0;
Code With So5Bound
1. int a = 1;
2. pthread_mutex_lock(&lock);
3. do_parallel_work();
4. Get Security Metadata();
5. SoRBound Checks ();
6. pthread_mutex_unlock(&lock);
7. return 0;
COMPUTER ARCHITECTURE GROUP
Agenda MoAvaAon Characteriza#on Methodology
CharacterizaAon Results Insights and Possible Improvements
COMPUTER ARCHITECTURE GROUP
System Model
Mul#core Processor
SoRBound
.EXE Executables compiled with SoRBound
ApplicaAon Data
DRAM
SoRBound Metadata
Tile
Graph
ite Sim
ulator
COMPUTER ARCHITECTURE GROUP
System Parameters • The Graphite Simulator is used to analyze Parallel Benchmarks
• 256 Cores @ 1 GHz • Results for both in-‐order and
Out-‐of-‐Order cores • L1-‐I Cache : 32 KB per core • L1-‐D Cache : 32 KB per core • L2 Cache : 256 KB per core • OOO ROB Buffer Size: 168
• An 8 Core Intel machine used as well (Trends similar as Graphite results) (Results in Paper)
• POSIX Threading Model is used • Evalua#on Includes results for 1-‐256 Threads
COMPUTER ARCHITECTURE GROUP
Evaluated Benchmarks • Prefix Scan – 16M elements
• Each Thread gets a chunk to scan, then the Master Thread reduces the scans of other threads to get the final soluAon
• Barriers used to do explicit SynchronizaAon
• Matrix Mul#ply – 1K x 1K • Each Thread Tiles a chunk of the matrix, and mulAplies to get the final matrix • Barriers used to do explicit SynchronizaAon
• Breadth First Search (BFS) – 1M ver#ces, 16M edges • Each Thread gets a chunk of the graph to search on • Atomic locks used on all graph verAces to ensure that no race condiAons occur in shared verAces
• Dijkstra – 16K ver#ces, 134M edges • Checking neighboring nodes for shortest path distances parallelized among threads • Explicit Barriers to progress each node check
COMPUTER ARCHITECTURE GROUP
Agenda MoAvaAon CharacterizaAon Methodology
Characteriza#on Results Insights and Possible Improvements
COMPUTER ARCHITECTURE GROUP
Benchmark Characterization : Prefix Scan
0
0,2
0,4
0,6
0,8
1
1,2
1 2 4 8 16 32 64 96 128 160 192 224 256 Normalized
Com
ple#
on Tim
e
Thread Count
Compute L1Cache-‐L2Home
L2Home-‐OffChip Synchroniza#on
• Shows Weak Scaling
COMPUTER ARCHITECTURE GROUP
Benchmark Characterization : Prefix Scan
0
0,2
0,4
0,6
0,8
1
1,2
1 2 4 8 16 32 64 96 128 160 192 224 256 Normalized
Com
ple#
on Tim
e
Thread Count
Compute L1Cache-‐L2Home
L2Home-‐OffChip Synchroniza#on
• Shows Weak Scaling
COMPUTER ARCHITECTURE GROUP
Benchmark Characterization : Prefix Scan
0
0,2
0,4
0,6
0,8
1
1,2
1 2 4 8 16 32 64 96 128 160 192 224 256 Normalized
Com
ple#
on Tim
e
Thread Count
Compute L1Cache-‐L2Home
L2Home-‐OffChip Synchroniza#on
• Shows Weak Scaling
COMPUTER ARCHITECTURE GROUP
Benchmark Characterization : Prefix Scan
0
0,2
0,4
0,6
0,8
1
1,2
1 2 4 8 16 32 64 96 128 160 192 224 256 Normalized
Com
ple#
on Tim
e
Thread Count
Compute L1Cache-‐L2Home
L2Home-‐OffChip Synchroniza#on
• Shows Weak Scaling
COMPUTER ARCHITECTURE GROUP
Benchmark Characterization : Prefix Scan
0
0,2
0,4
0,6
0,8
1
1,2
1 2 4 8 16 32 64 96 128 160 192 224 256 Normalized
Com
ple#
on Tim
e
Thread Count
Compute L1Cache-‐L2Home
L2Home-‐OffChip Synchroniza#on
• Shows Weak Scaling
COMPUTER ARCHITECTURE GROUP
Benchmark Characterization : Prefix Scan
0
0,2
0,4
0,6
0,8
1
1,2
1 2 4 8 16 32 64 96 128 160 192 224 256 Normalized
Com
ple#
on Tim
e
Thread Count
Compute L1Cache-‐L2Home
L2Home-‐OffChip Synchroniza#on
• Shows Weak Scaling
Weak Scaling
COMPUTER ARCHITECTURE GROUP
Benchmark Characterization : Prefix Scan
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
1 S 2 S 4 S 8 S 16 S 32 S 64 S 96 S
128 S
160 S
192 S
224 S
256 S
Normalized
Com
ple#
on Tim
e
Thread Count
Compute L1Cache-‐L2Home
L2Home-‐OffChip Synchroniza#on
• SoRBound has higher overheads due to extra compute and memory accesses
• ‘S’ shows results with SoRBound
Baseline So5Bound
COMPUTER ARCHITECTURE GROUP
Benchmark Characterization : Prefix Scan
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
1 S 2 S 4 S 8 S 16 S 32 S 64 S 96 S
128 S
160 S
192 S
224 S
256 S
Normalized
Com
ple#
on Tim
e
Thread Count
Compute L1Cache-‐L2Home
L2Home-‐OffChip Synchroniza#on
• Concurrency hides SoRBound’s overhead at high thread counts
• More Compute than CommunicaAon in this Workload
• ‘S’ shows results with SoRBound
Concurrency Improvement!
Baseline So5Bound
COMPUTER ARCHITECTURE GROUP
Benchmark Characterization : Matrix Multiply
0
0,5
1
1,5
2
2,5
1 S 2 S 4 S 8 S 16 S 32 S 64 S 96 S
128 S
160 S
192 S
224 S
256 S Normalized
Com
ple#
on Tim
e
Thread Count
Compute L1Cache-‐L2Home
L2Home-‐OffChip Synchroniza#on
0
0,05
0,1
160 S
192 S
224 S
256 S
• Significant Overhead with SoRBound
• Memory Bounds ApplicaAons lead to more Metadata and security checks
• ‘S’ shows results with SoRBound Baseline So5Bound
COMPUTER ARCHITECTURE GROUP
Benchmark Characterization : Matrix Multiply
0
0,5
1
1,5
2
2,5
1 S 2 S 4 S 8 S 16 S 32 S 64 S 96 S
128 S
160 S
192 S
224 S
256 S Normalized
Com
ple#
on Tim
e
Thread Count
Compute L1Cache-‐L2Home
L2Home-‐OffChip Synchroniza#on
0
0,05
0,1
160 S
192 S
224 S
256 S
• Concurrency does reduce SoRBound’s overhead at high thread counts
• However, overhead sAll quite significant
• ‘S’ shows results with SoRBound Baseline So5Bound
COMPUTER ARCHITECTURE GROUP
Benchmark Characterization : Matrix Multiply
0
0,5
1
1,5
2
2,5
1 S 2 S 4 S 8 S 16 S 32 S 64 S 96 S
128 S
160 S
192 S
224 S
256 S Normalized
Com
ple#
on Tim
e
Thread Count
Compute L1Cache-‐L2Home
L2Home-‐OffChip Synchroniza#on
0
0,05
0,1
160 S
192 S
224 S
256 S
• One can get the Efficiency of the baseline by using addiAonal Threads for SoRBound
• ‘S’ shows results with SoRBound
Baseline So5Bound
So5Bound Equal to Baseline
COMPUTER ARCHITECTURE GROUP
Benchmark Characterization : Breadth First Search (BFS)
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
1 S 2 S 4 S 8 S 16 S 32 S 64 S 96 S
128 S
160 S
192 S
224 S
256 S
Normalized
Com
ple#
on Tim
e
Thread Count
Compute L1Cache-‐L2Home
L2Home-‐OffChip Synchroniza#on
0
0,1
0,2
160 S
192 S
224 S
256 S
• Another Graph Workload, however with higher scalability and Locality
• Concurrency does reduce SoRBound’s overhead at high thread counts
• However, overhead sAll quite significant
• ‘S’ shows results with SoRBound
Baseline So5Bound
COMPUTER ARCHITECTURE GROUP
Benchmark Characterization : Breadth First Search (BFS)
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
1 S 2 S 4 S 8 S 16 S 32 S 64 S 96 S
128 S
160 S
192 S
224 S
256 S
Normalized
Com
ple#
on Tim
e
Thread Count
Compute L1Cache-‐L2Home
L2Home-‐OffChip Synchroniza#on
0
0,1
0,2
160 S
192 S
224 S
256 S
• Concurrency helps reduce SoRBound overhead at high thread counts
• However, overhead sAll quite significant because of fine grained synchronizaAon in this workload
• ‘S’ shows results with SoRBound
Baseline So5Bound
COMPUTER ARCHITECTURE GROUP
Benchmark Characterization : Dijkstra
0
0,5
1
1,5
2
2,5
1 S 2 S 4 S 8 S 16 S 32 S 64 S 96 S
128 S
160 S
192 S
224 S
256 S Normalized
Com
ple#
on Tim
e
Thread Count
Compute L1Cache-‐L2Home
L2Home-‐OffChip Synchroniza#on
0
0,2
16 S 32 S 64 S 96 S
• SoRBound results in higher on-‐chip and off-‐chip data accesses, due to large working set
• ‘S’ shows results with SoRBound
Baseline So5Bound
COMPUTER ARCHITECTURE GROUP
Benchmark Characterization : Dijkstra
0
0,5
1
1,5
2
2,5
1 S 2 S 4 S 8 S 16 S 32 S 64 S 96 S
128 S
160 S
192 S
224 S
256 S Normalized
Com
ple#
on Tim
e
Thread Count
Compute L1Cache-‐L2Home
L2Home-‐OffChip Synchroniza#on
0
0,2
16 S 32 S 64 S 96 S
• Scalability is limited due to fine grained communicaAon
• SoRBound checks within criAcal secAons hurts performance
• ‘S’ shows results with SoRBound
Baseline So5Bound
COMPUTER ARCHITECTURE GROUP
Benchmark Characterization : Summary of Slowdowns
1
1,5
2
2,5
3
3,5
4
0 50 100 150 200 250
So5Bo
und’s N
ormalized
Overhead
Thread Count
PREFIX
Concurrency Improvement!
Baseline Scales Faster Ini#ally
• Concurrency helps hide SoRBound Overheads
COMPUTER ARCHITECTURE GROUP
Benchmark Characterization : Summary of Slowdowns
1
1,5
2
2,5
3
3,5
4
0 50 100 150 200 250
Slow
down du
e to So5
Boun
d
Thread Count
MATMUL
Larger Metadata Size Inhibits Scalability
So5Bound Scales Ini#ally
• Concurrency helps hide SoRBound Overheads iniAally at ~32-‐64 threads • Then worsens since metadata impacts the available local cache size
COMPUTER ARCHITECTURE GROUP
Benchmark Characterization : Summary of Slowdowns
1
1,5
2
2,5
3
3,5
4
0 50 100 150 200 250
Slow
down du
e to So5
Boun
d
Thread Count
BFS
Larger Metadata Size Inhibits Scalability
So5Bound Scales Ini#ally
• Concurrency helps hide SoRBound Overheads iniAally at ~2-‐8 threads • Then worsens since metadata impacts criAcal secAons
COMPUTER ARCHITECTURE GROUP
Benchmark Characterization : Summary of Slowdowns
1
1,5
2
2,5
3
3,5
4
0 50 100 150 200 250
Slow
down du
e to So5
Boun
d
Thread Count
DIJK
Larger Metadata Size Inhibits Scalability
So5Bound Scales a bit
Cache Effects
• Concurrency helps hide SoRBound Overheads iniAally at ~2-‐8 threads • Then worsens since metadata impacts criAcal secAons
COMPUTER ARCHITECTURE GROUP
Benchmark Characterization : Slowdowns in Out-of-Order Cores
1
1,5
2
2,5
3
3,5
4
0 50 100 150 200 250
Slow
down du
e to So5
Boun
d
Thread Count
DIJK PREFIX BFS MATMUL • All Prior Results had In-‐Order Cores
• ReducAon in performance degradaAon observed
• OOOs improve criAcal code secAons
• Latency hiding helps So5Bound scale be)er
COMPUTER ARCHITECTURE GROUP
0
0,5
1
1,5
2
2,5
IN OOC IN OOC IN OOC IN OOC
So5Bo
und Overhead
DIJK
Compute L1Cache-‐L2Home
L2Home-‐OffChip Synchroniza#on
Benchmark Characterization : In-Order vs. Out-of-Order
Prefix BFS Matrix Mul#ply
• Parallel SoRBound results for both In-‐Order and OOO core types
• At thread counts showing highest speedups
OOO Improvements in Compute Bound Workloads
• OOOs can not improve parallel performance much • Alterna#ve
architectural improvements needed
COMPUTER ARCHITECTURE GROUP
Agenda MoAvaAon CharacterizaAon Methodology
CharacterizaAon Results Insights and Possible Improvements
COMPUTER ARCHITECTURE GROUP
Insights from Characterization • Most boSlenecks in MulA-‐threaded SoRBound workloads stem from Memory Accesses and SynchronizaAon
• Latency Hiding does improve SoRBound using OoO Cores slightly
• Increase in thread count allows SoRBound to perform as well as a baseline at a lower thread counts
• However, addi#onal so5ware/architectural mechanisms are needed to further improve Performance
COMPUTER ARCHITECTURE GROUP
Potential Future Improvements • Improve SoRBound’s Metadata structures
• Compression • BeSer Layout (e.g. a beSer tree type of structure)
• Use ParallelizaAon Strategies (e.g. Blocking) that are aware of Security Metadata’s presence
• Devise a Prefetcher for Security Metadata • Prefetch SoRBound metadata from DRAM • Hides SoRBounds latency