exploring the performance implications of memory safety ... · computer architecture group...

Post on 07-Jun-2020

7 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

COMPUTER ARCHITECTURE GROUP

Exploring the Performance Implications of Memory Safety

Primitives in Many-core Processors Executing Multi-threaded Workloads

Masab  Ahmad,  Syed  Kamran  Haider,  Farrukh  Hijaz,    Marten  van  Dijk,  Omer  Khan  

 University  of  ConnecAcut  

COMPUTER ARCHITECTURE GROUP

Agenda MoAvaAon    CharacterizaAon  Methodology  

CharacterizaAon  Results    Insights  and  Possible  Improvements  

COMPUTER ARCHITECTURE GROUP

Agenda Mo#va#on    CharacterizaAon  Methodology  

CharacterizaAon  Results    Insights  and  Possible  Improvements  

COMPUTER ARCHITECTURE GROUP

Security Vulnerabilities in Processors

Vulnerable Applications Buffer Overflow Attacks Code Injection

OS

CPU

DRAM I/O Disk

Applications Compromised OS Privacy Leakage

Malicious I/Os DoS Attacks

Other Devices Data Tampering

Side Channels

COMPUTER ARCHITECTURE GROUP

Security Vulnerabilities in Processors

Vulnerable Applications Buffer Overflow Attacks Code Injection

OS

CPU

DRAM I/O Disk

Applications Compromised OS Privacy Leakage

Malicious I/Os DoS Attacks

Other Devices Data Tampering

Side Channels

Performance  

Security  

COMPUTER ARCHITECTURE GROUP

Key Questions • What  are  the  performance  implicaAons  of  these  security  schemes?  

•  How  do  they  affect  performance  in  the  context  of  mulAcores?  

•  IEEE  Computer  Vision  2022  predicts  that  MulAcores  as  a  key  enabling  technology  by  2022  

•  To  get  started,  we  look  at  Buffer  Overflow  Protec/on  Schemes  

•  How  do  they  affect  tradeoffs  in  MulAcores?  

COMPUTER ARCHITECTURE GROUP

A Primer on Buffer Overflow Attacks •  The  stack  return  address  is  located  aRer  the  program  contents  

•  An  aSacker  enters  a  modified  string,  overwrites  the  return  address  

•  The  return  address  now  points  to  the  malicious  code  

a+2  

a+2  

Malicious  Code  

a+1  

a  

a+n  

a+2  

……  

a+3  

a+2  

a+2  

Return  Address  

Other  Variables  

Buffer  

a+1  

a  

a+n  

a+2  

……  

a+3  

Overflow  ASack    

overwrites  Return  Address  

A)acker  Comes  In  

A)acker  takes  Control  

COMPUTER ARCHITECTURE GROUP

Security Solutions and Prior Works •  Dynamic  InformaAon  Flow  Tracking  (DIFT)  :  

•  Flags  and  Tags  all  data  from  I/O  •  Raises  excepAons  of  tagged  data  propagates  into  program  control  flow  •  High  False  posiAve  rates,  Zero  false  negaAves  (Hence  not  used  in  applicaAons)    

•  Context  based  Checking  :    •  Creates  a  “context”,  an  idenAfier  for  each  variable/data  structure  in  the  program  

•  Checks  on  each  variable  read/write  •  Large  data  structure  required,  1  element  for  each  array  element.  Pointer  aliasing  problems  (Hence  also  not  used  in  applicaAons)  

•  Bounds  Checking  :    •  Creates  a  base  and  bounds  Metadata  data  structure  for  a  program  •  Reads  and  writes  are  checked  and  updated  as  meta  data  •  Close  to  Zero  False  posiAves  and  negaAves  (Used  a  lot  in  Safe  languages  etc)  

COMPUTER ARCHITECTURE GROUP

Security Solutions and Prior Works •  Dynamic  InformaAon  Flow  Tracking  (DIFT)  :  

•  Flags  and  Tags  all  data  from  I/O  •  Raises  excepAons  of  tagged  data  propagates  into  program  control  flow  •  High  False  posiAve  rates,  Zero  false  negaAves  (Hence  not  used  in  applicaAons)    

•  Context  based  Checking  :    •  Creates  a  “context”,  an  idenAfier  for  each  variable/data  structure  in  the  program  

•  Checks  on  each  variable  read/write  •  Large  data  structure  required,  1  element  for  each  array  element.  Pointer  aliasing  problems  (Hence  also  not  used  in  applicaAons)  

•  Bounds  Checking  :    •  Creates  a  base  and  bounds  Metadata  data  structure  for  a  program  •  Reads  and  writes  are  checked  and  updated  as  meta  data  •  Close  to  Zero  False  posiAves  and  negaAves  (Used  a  lot  in  Safe  languages  etc)  

COMPUTER ARCHITECTURE GROUP

Bound Checking Schemes • Hardware  based  Schemes:  

•  CHERI  •  HardBound  •  WatchdogLite  

•  SoRware  based  Schemes:  •  So5Bound  •  Cyclone  •  SafeCode  •  Ccured  •  And  many  others  •  Safe  Languages  such  as  Python  and  Java  

Return  Address  Other  Variables  

Buffer  

a+1  

a  

a+n  

a+2  

……  

a+3  

Overflow  ASack    

cannot  write  beyond  the    Bound  

Base  

Bound  

Compute  Pipeline  

 Bounds    Check  

Extra  Memory  Accesses  

Caches    Bounds    Metadata  

DRAM    Bounds    Metadata  

Bounds  Checks  

COMPUTER ARCHITECTURE GROUP

Performance Implications in Multicores

Compute  Pipeline  

 Bounds    Check  

Increased  Coherency  Bo)lenecks  

Shared  Cache  

 Bounds    Metadata  

Compute  Pipeline  

 Bounds    Check  

Core   Core  

 Bounds    Metadata  

Cache  Pollu#on  and  Interference  

Synchroniza#on  Bo)lenecks  as  more  #me  spent  in  cri#cal  

code  sec#ons  

Compute  Stalls  for  Bound  Checks  

Reduced  Scalability  

 …    

•  Previous  works  have  not  analyzed  memory  safety  schemes  in  the  context  of  mulAcores  

•  All  prior  works  use  sequenAal  benchmarks  such  as  SPEC  

Extra  Memory  Accesses  

DRAM    Bounds    Metadata  

COMPUTER ARCHITECTURE GROUP

Performance Implications in Multicores : Critical Code •  CriAcal  code  secAons  such  as  atomically  locked  program  funcAons  are  most  affected.  

•  Due  to  bounds  checking  stalls,  a  thread  keeps  a  locked  secAon  for  a  longer  Ame,  depriving  other  threads  from  exploiAng  scalability.  

Code  Without  So5Bound  

1.    int  a  =  1;  

2.    pthread_mutex_lock(&lock);  

3.    do_parallel_work();  

4.    pthread_mutex_unlock(&lock);  

5.  return  0;  

 Code  With  So5Bound  

1.    int  a  =  1;  

2.    pthread_mutex_lock(&lock);  

3.    do_parallel_work();  

4.  Get  Security  Metadata();  

5.  SoRBound  Checks  ();  

6.    pthread_mutex_unlock(&lock);  

7.  return  0;  

 

COMPUTER ARCHITECTURE GROUP

Agenda MoAvaAon    Characteriza#on  Methodology  

CharacterizaAon  Results    Insights  and  Possible  Improvements  

COMPUTER ARCHITECTURE GROUP

System Model

Mul#core  Processor  

SoRBound  

.EXE  Executables  compiled  with  SoRBound  

ApplicaAon  Data  

DRAM

 

SoRBound  Metadata  

Tile  

Graph

ite  Sim

ulator  

COMPUTER ARCHITECTURE GROUP

System Parameters •  The  Graphite  Simulator  is  used  to  analyze  Parallel  Benchmarks  

•  256  Cores  @  1  GHz  •  Results  for  both  in-­‐order  and    

Out-­‐of-­‐Order  cores  •  L1-­‐I  Cache  :  32  KB  per  core  •  L1-­‐D  Cache  :  32  KB  per  core  •  L2  Cache  :  256  KB  per  core  •  OOO  ROB  Buffer  Size:  168  

•  An  8  Core  Intel  machine  used  as  well  (Trends  similar  as  Graphite  results)  (Results  in  Paper)  

•  POSIX  Threading  Model  is  used  •  Evalua#on  Includes  results  for  1-­‐256  Threads  

COMPUTER ARCHITECTURE GROUP

Evaluated Benchmarks •  Prefix  Scan  –  16M  elements  

•  Each  Thread  gets  a  chunk  to  scan,  then  the  Master  Thread  reduces  the  scans  of  other  threads  to  get  the  final  soluAon  

•  Barriers  used  to  do  explicit  SynchronizaAon  

•  Matrix  Mul#ply  –  1K  x  1K  •  Each  Thread  Tiles  a  chunk  of  the  matrix,  and  mulAplies  to  get  the  final  matrix  •  Barriers  used  to  do  explicit  SynchronizaAon  

•  Breadth  First  Search  (BFS)  –  1M  ver#ces,  16M  edges  •  Each  Thread  gets  a  chunk  of  the  graph  to  search  on  •  Atomic  locks  used  on  all  graph  verAces  to  ensure  that  no  race  condiAons  occur  in  shared  verAces  

•  Dijkstra  –  16K  ver#ces,  134M  edges  •  Checking  neighboring  nodes  for  shortest  path  distances  parallelized  among  threads  •  Explicit  Barriers  to  progress  each  node  check  

COMPUTER ARCHITECTURE GROUP

Agenda MoAvaAon    CharacterizaAon  Methodology  

Characteriza#on  Results    Insights  and  Possible  Improvements  

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Prefix Scan

0  

0,2  

0,4  

0,6  

0,8  

1  

1,2  

1   2   4   8   16   32   64   96   128   160   192   224   256  Normalized

 Com

ple#

on  Tim

e  

Thread  Count  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

•  Shows  Weak  Scaling  

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Prefix Scan

0  

0,2  

0,4  

0,6  

0,8  

1  

1,2  

1   2   4   8   16   32   64   96   128   160   192   224   256  Normalized

 Com

ple#

on  Tim

e  

Thread  Count  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

•  Shows  Weak  Scaling  

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Prefix Scan

0  

0,2  

0,4  

0,6  

0,8  

1  

1,2  

1   2   4   8   16   32   64   96   128   160   192   224   256  Normalized

 Com

ple#

on  Tim

e  

Thread  Count  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

•  Shows  Weak  Scaling  

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Prefix Scan

0  

0,2  

0,4  

0,6  

0,8  

1  

1,2  

1   2   4   8   16   32   64   96   128   160   192   224   256  Normalized

 Com

ple#

on  Tim

e  

Thread  Count  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

•  Shows  Weak  Scaling  

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Prefix Scan

0  

0,2  

0,4  

0,6  

0,8  

1  

1,2  

1   2   4   8   16   32   64   96   128   160   192   224   256  Normalized

 Com

ple#

on  Tim

e  

Thread  Count  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

•  Shows  Weak  Scaling  

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Prefix Scan

0  

0,2  

0,4  

0,6  

0,8  

1  

1,2  

1   2   4   8   16   32   64   96   128   160   192   224   256  Normalized

 Com

ple#

on  Tim

e  

Thread  Count  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

•  Shows  Weak  Scaling  

Weak  Scaling  

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Prefix Scan

0  

0,2  

0,4  

0,6  

0,8  

1  

1,2  

1,4  

1,6  

1   S   2   S   4   S   8   S   16   S   32   S   64   S   96   S  

128   S  

160   S  

192   S  

224   S  

256   S  

Normalized

 Com

ple#

on  Tim

e  

Thread  Count  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

•  SoRBound  has  higher  overheads  due  to  extra  compute  and  memory  accesses  

•  ‘S’  shows  results  with  SoRBound  

Baseline      So5Bound  

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Prefix Scan

0  

0,2  

0,4  

0,6  

0,8  

1  

1,2  

1,4  

1,6  

1   S   2   S   4   S   8   S   16   S   32   S   64   S   96   S  

128   S  

160   S  

192   S  

224   S  

256   S  

Normalized

 Com

ple#

on  Tim

e  

Thread  Count  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

•  Concurrency  hides  SoRBound’s  overhead  at  high  thread  counts  

•  More  Compute  than  CommunicaAon  in  this  Workload  

•  ‘S’  shows  results  with  SoRBound  

Concurrency  Improvement!  

Baseline      So5Bound  

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Matrix Multiply

0  

0,5  

1  

1,5  

2  

2,5  

1   S   2   S   4   S   8   S   16   S   32   S   64   S   96   S  

128   S  

160   S  

192   S  

224   S  

256   S  Normalized

 Com

ple#

on  Tim

e  

Thread  Count  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

0  

0,05  

0,1  

160   S  

192   S  

224   S  

256   S  

•  Significant  Overhead  with  SoRBound  

•  Memory  Bounds  ApplicaAons  lead  to  more  Metadata  and  security  checks  

•  ‘S’  shows  results  with  SoRBound   Baseline      So5Bound  

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Matrix Multiply

0  

0,5  

1  

1,5  

2  

2,5  

1   S   2   S   4   S   8   S   16   S   32   S   64   S   96   S  

128   S  

160   S  

192   S  

224   S  

256   S  Normalized

 Com

ple#

on  Tim

e  

Thread  Count  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

0  

0,05  

0,1  

160   S  

192   S  

224   S  

256   S  

•  Concurrency  does  reduce  SoRBound’s  overhead  at  high  thread  counts  

•  However,  overhead  sAll  quite  significant  

•  ‘S’  shows  results  with  SoRBound   Baseline      So5Bound  

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Matrix Multiply

0  

0,5  

1  

1,5  

2  

2,5  

1   S   2   S   4   S   8   S   16   S   32   S   64   S   96   S  

128   S  

160   S  

192   S  

224   S  

256   S  Normalized

 Com

ple#

on  Tim

e  

Thread  Count  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

0  

0,05  

0,1  

160   S  

192   S  

224   S  

256   S  

•  One  can  get  the  Efficiency  of  the  baseline  by  using  addiAonal  Threads  for  SoRBound  

•  ‘S’  shows  results  with  SoRBound  

Baseline      So5Bound  

So5Bound  Equal  to  Baseline  

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Breadth First Search (BFS)

0  

0,2  

0,4  

0,6  

0,8  

1  

1,2  

1,4  

1,6  

1   S   2   S   4   S   8   S   16   S   32   S   64   S   96   S  

128   S  

160   S  

192   S  

224   S  

256   S  

Normalized

 Com

ple#

on  Tim

e  

Thread  Count  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

0  

0,1  

0,2  

160   S  

192   S  

224   S  

256   S  

•  Another  Graph  Workload,  however  with  higher  scalability  and  Locality  

•  Concurrency  does  reduce  SoRBound’s  overhead  at  high  thread  counts  

•  However,  overhead  sAll  quite  significant  

•  ‘S’  shows  results  with  SoRBound  

Baseline      So5Bound  

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Breadth First Search (BFS)

0  

0,2  

0,4  

0,6  

0,8  

1  

1,2  

1,4  

1,6  

1   S   2   S   4   S   8   S   16   S   32   S   64   S   96   S  

128   S  

160   S  

192   S  

224   S  

256   S  

Normalized

 Com

ple#

on  Tim

e  

Thread  Count  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

0  

0,1  

0,2  

160   S  

192   S  

224   S  

256   S  

•  Concurrency  helps  reduce  SoRBound  overhead  at  high  thread  counts  

•  However,  overhead  sAll  quite  significant  because  of  fine  grained  synchronizaAon  in  this  workload  

•  ‘S’  shows  results  with  SoRBound  

Baseline      So5Bound  

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Dijkstra

0  

0,5  

1  

1,5  

2  

2,5  

1   S   2   S   4   S   8   S   16   S   32   S   64   S   96   S  

128   S  

160   S  

192   S  

224   S  

256   S  Normalized

 Com

ple#

on  Tim

e  

Thread  Count  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

0  

0,2  

16  S   32  S   64  S   96  S  

•  SoRBound  results  in  higher  on-­‐chip  and  off-­‐chip  data  accesses,  due  to  large  working  set  

•  ‘S’  shows  results  with  SoRBound  

Baseline      So5Bound  

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Dijkstra

0  

0,5  

1  

1,5  

2  

2,5  

1   S   2   S   4   S   8   S   16   S   32   S   64   S   96   S  

128   S  

160   S  

192   S  

224   S  

256   S  Normalized

 Com

ple#

on  Tim

e  

Thread  Count  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

0  

0,2  

16  S   32  S   64  S   96  S  

•  Scalability  is  limited  due  to  fine  grained  communicaAon  

•  SoRBound  checks  within  criAcal  secAons  hurts  performance  

•  ‘S’  shows  results  with  SoRBound  

Baseline      So5Bound  

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Summary of Slowdowns

1  

1,5  

2  

2,5  

3  

3,5  

4  

0   50   100   150   200   250  

So5Bo

und’s  N

ormalized

 Overhead  

Thread  Count  

PREFIX  

Concurrency  Improvement!  

Baseline  Scales  Faster    Ini#ally  

•  Concurrency  helps  hide  SoRBound  Overheads  

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Summary of Slowdowns

1  

1,5  

2  

2,5  

3  

3,5  

4  

0   50   100   150   200   250  

Slow

down  du

e  to  So5

Boun

d  

Thread  Count  

MATMUL  

Larger  Metadata  Size  Inhibits  Scalability    

So5Bound  Scales  Ini#ally  

•  Concurrency  helps  hide  SoRBound  Overheads  iniAally  at  ~32-­‐64  threads  •  Then  worsens  since  metadata  impacts  the  available  local  cache  size  

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Summary of Slowdowns

1  

1,5  

2  

2,5  

3  

3,5  

4  

0   50   100   150   200   250  

Slow

down  du

e  to  So5

Boun

d  

Thread  Count  

BFS  

Larger  Metadata  Size  Inhibits  Scalability    

So5Bound  Scales  Ini#ally  

•  Concurrency  helps  hide  SoRBound  Overheads  iniAally  at  ~2-­‐8  threads  •  Then  worsens  since  metadata  impacts  criAcal  secAons    

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Summary of Slowdowns

1  

1,5  

2  

2,5  

3  

3,5  

4  

0   50   100   150   200   250  

Slow

down  du

e  to  So5

Boun

d  

Thread  Count  

DIJK  

Larger  Metadata  Size  Inhibits  Scalability    

So5Bound  Scales  a  bit  

Cache  Effects    

•  Concurrency  helps  hide  SoRBound  Overheads  iniAally  at  ~2-­‐8  threads  •  Then  worsens  since  metadata  impacts  criAcal  secAons  

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Slowdowns in Out-of-Order Cores

1  

1,5  

2  

2,5  

3  

3,5  

4  

0   50   100   150   200   250  

Slow

down  du

e  to  So5

Boun

d  

Thread  Count  

DIJK   PREFIX   BFS   MATMUL  •  All  Prior  Results  had  In-­‐Order  Cores  

•  ReducAon  in  performance  degradaAon  observed  

•  OOOs  improve  criAcal  code  secAons  

•  Latency  hiding  helps  So5Bound  scale  be)er  

COMPUTER ARCHITECTURE GROUP

0  

0,5  

1  

1,5  

2  

2,5  

IN   OOC   IN   OOC   IN   OOC   IN   OOC  

So5Bo

und  Overhead  

DIJK  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

Benchmark Characterization : In-Order vs. Out-of-Order

Prefix   BFS   Matrix  Mul#ply  

•  Parallel  SoRBound  results  for  both  In-­‐Order  and  OOO  core  types  

•  At  thread  counts  showing  highest  speedups  

 

OOO  Improvements  in  Compute  Bound  Workloads  

•  OOOs  can  not  improve  parallel  performance  much  •  Alterna#ve  

architectural  improvements  needed  

 

COMPUTER ARCHITECTURE GROUP

Agenda MoAvaAon    CharacterizaAon  Methodology  

CharacterizaAon  Results    Insights  and  Possible  Improvements  

COMPUTER ARCHITECTURE GROUP

Insights from Characterization •  Most  boSlenecks  in  MulA-­‐threaded  SoRBound  workloads  stem  from  Memory  Accesses  and  SynchronizaAon  

•  Latency  Hiding  does  improve  SoRBound  using  OoO  Cores  slightly  

•  Increase  in  thread  count  allows  SoRBound  to  perform  as  well  as  a  baseline  at  a  lower  thread  counts  

•  However,  addi#onal  so5ware/architectural  mechanisms  are  needed  to  further  improve  Performance  

COMPUTER ARCHITECTURE GROUP

Potential Future Improvements •  Improve  SoRBound’s  Metadata  structures  

•  Compression  •  BeSer  Layout  (e.g.  a  beSer  tree  type  of  structure)  

•  Use  ParallelizaAon  Strategies  (e.g.  Blocking)  that  are  aware  of  Security  Metadata’s  presence    

•  Devise  a  Prefetcher  for  Security  Metadata  •  Prefetch  SoRBound  metadata  from  DRAM  •  Hides  SoRBounds  latency  

top related