hardware mechanisms for distributed dynamic software analysis joseph l. greathouse advisor: prof....
TRANSCRIPT
Hardware Mechanisms for Distributed Dynamic
Software Analysis
Joseph L. Greathouse
Advisor: Prof. Todd Austin
May 10, 2012
NIST: Software errors cost U.S. ~$60 billion/year
FBI: Security Issues cost U.S. $67 billion/year >⅓ from viruses, network intrusion, etc.
Software Errors Abound
2
Example of a Modern Bug
3
Thread 2mylen=large
Thread 1mylen=small
ptr∅
Nov. 2010 OpenSSL Security FlawNov. 2010 OpenSSL Security Flaw
if(ptr == NULL) { len=thread_local->mylen; ptr=malloc(len); memcpy(ptr, data, len);}
Example of a Modern Bug
4
if(ptr==NULL)if(ptr==NULL)
if(ptr==NULL)if(ptr==NULL)
memcpy(ptr, data2, len2)memcpy(ptr, data2, len2)
ptrLEAKED
Thread 2mylen=large
Thread 1mylen=small
∅
len2=thread_local->mylen;len2=thread_local->mylen;
ptr=malloc(len2);ptr=malloc(len2);
len1=thread_local->mylen;len1=thread_local->mylen;
ptr=malloc(len1);ptr=malloc(len1);
memcpy(ptr, data1, len1)memcpy(ptr, data1, len1)
In spite of proposed hardware solutions
Hardware Plays a Role in this Problem
5
Hardware Data
Race RecordingDeterministic Execution/Replay
Atomicity Violation
DetectorsBug-Free Memory Models
Bulk Memory Commits
TRANSACTIONALMEMORY
Taint Analysis
Dynamic Bounds Checking
Data Race Detection(e.g. Inspector XE)
Memory Checking(e.g. MemCheck)
Dynamic Software Analyses
6
2-200x2-200x
2-80x2-80x5-50x5-50x
2-300x2-300x
Analyze the program as it runs+ Find errors on any executed path– LARGE overheads, only test one path at a time
Goals of this Thesis Allow high quality dynamic software analyses
Find difficult bugs that weaker analyses miss
Distribute the tests to large populations Must be low overhead or users will get angry
Sampling + Hardware to accomplished this Each user only tests a small part of the program Each test should be helped by hardware
7
Meeting These Goals - Thesis Overview
8
Allow high quality dynamic software analyses
Allow high quality dynamic software analyses
Dat
aflo
wA
naly
sis Dataflow
Analysis
Data Race Detection
Software Support Hardware Support
Dat
a R
ace
Det
ectio
n
Meeting These Goals - Thesis Overview
9
Dataflow Analysis Sampling (CGO’11)
Dataflow Analysis Sampling (CGO’11)
Dataflow Analysis Sampling
(MICRO’08)
Dataflow Analysis Sampling
(MICRO’08)Dat
aflo
wA
naly
sis
Dat
a R
ace
Det
ectio
n
Software Support Hardware Support
Unlimited Watchpoint
System (ASPLOS’12)
Unlimited Watchpoint
System (ASPLOS’12)
Hardware-Assisted Demand-DrivenRace Detection (ISCA’11)
Hardware-Assisted Demand-DrivenRace Detection (ISCA’11)
Distribute the tests + Sample the analyses
Distribute the tests + Sample the analyses
Tests must below overhead
Tests must below overhead
Outline
Problem Statement
Distributed Dynamic Dataflow Analysis
Demand-Driven Data Race Detection
Unlimited Watchpoints
10
Outline
Problem Statement
Distributed Dynamic Dataflow Analysis
Demand-Driven Data Race Detection
Unlimited Watchpoints
11
SWDataflowSampling
SWDataflowSampling
HW Dataflow Sampling
HW Dataflow SamplingWatc
hpoints
Watch
points
Demand-Driven Race DetectionDemand-Driven Race Detection
Distributed Dynamic Dataflow Analysis
12
Split analysis across large populations Observe more runtime states Report problems developer never thought to test
Instrumented Program Potential
problems
Taint Analysis(e.g.TaintCheck)
Dynamic Bounds Checking
Data Race Detection(e.g. Thread Analyzer)
Memory Checking(e.g. MemCheck)
The Problem: OVERHEADS
13
2-200x2-200x
2-80x2-80x5-50x5-50x
2-300x2-300x
Analyze the program as it runs+ System state, find errors on any executed path– LARGE runtime overheads, only test one path
Current Options Limited
14
CompleteAnalysis
CompleteAnalysis
NoAnalysis
NoAnalysis
Solution: Sampling
15
Lower overheads by skipping some analyses
CompleteAnalysis
CompleteAnalysis
NoAnalysis
NoAnalysis
Fe
w U
sers
Ma
ny
Use
rs
Lower overheads mean more users
CompleteAnalysis
CompleteAnalysis
NoAnalysis
NoAnalysis
Sampling Allows Distribution
16
Developers
Beta Testers
End Users
Many users testing at little overhead see more errors than
one user at high overhead.
Many users testing at little overhead see more errors than
one user at high overhead.
a += ya += y z = y * 75z = y * 75
y = x * 1024y = x * 1024 w = x + 42w = x + 42 Check wCheck w
Example Dynamic Dataflow Analysis
17
validate(x)validate(x)x = read_input()x = read_input() Clear
a += ya += y z = y * 75z = y * 75
y = x * 1024y = x * 1024
x = read_input()x = read_input()
Propagate
Associate
InputInput
Check aCheck a
Check zCheck z
Data
Meta-data
Sampling must be aware of meta-data
Remove meta-data from skipped dataflows
Sampling Dataflows
18
Dataflow Sampling
19
Sampling ToolSampling Tool
Analysis ToolAnalysis ToolApplicationApplication Analysis ToolAnalysis Tool
Meta-Data DetectionMeta-Data DetectionMeta-dataMeta-data
Clear meta-dataClear meta-dataOH ThresholdOH Threshold
Finding Meta-Data
20
No additional overhead when no meta-data Needs hardware support
Take a fault when touching shadowed data Solution: Virtual Memory Watchpoints
V→P V→PFAULT
Prototype Setup Xen+QEMU Taint analysis sampling system
Network packets untrusted
Performance Tests – Network Throughput Example: ssh_receive
Sampling Accuracy Tests Real-world Security Exploits
21
Xen HypervisorXen Hypervisor
OS and ApplicationsOS and Applications
AppApp AppApp AppApp…
LinuxLinux
ShadowPage Table
ShadowPage Table
Admin VMAdmin VM
Taint Analysis QEMU
Taint Analysis QEMU
Net StackNet
Stack OHMOHM
ssh_receive
Performance of Dataflow Sampling
22
Throughput with no analysis
Accuracy with Background Tasks ssh_receive running in background
23
Outline
Problem Statement
Distributed Dynamic Dataflow Analysis
Demand-Driven Data Race Detection
Unlimited Watchpoints
24
SWDataflow Sampling
SWDataflow Sampling
HW Dataflow Sampling
HW Dataflow SamplingWatc
hpoints
Watchpoint
s
Demand-Driven Race DetectionDemand-Driven Race Detection
Dynamic Data Race Detection
Add checks around every memory access
Find inter-thread sharing
Synchronization between write-shared accesses? No? Data race.
25
SW Race Detection is Slow
26
Phoenix PARSEC
Inter-thread Sharing is What’s Important
27
if(ptr==NULL)if(ptr==NULL)
if(ptr==NULL)if(ptr==NULL)
len1=thread_local->mylen;len1=thread_local->mylen;
ptr=malloc(len1);ptr=malloc(len1);
memcpy(ptr, data1, len1)memcpy(ptr, data1, len1)
len2=thread_local->mylen;len2=thread_local->mylen;
ptr=malloc(len2);ptr=malloc(len2);
memcpy(ptr, data2, len2)memcpy(ptr, data2, len2)
Thread-local dataNO SHARING
Shared dataNO INTER-THREAD SHARING EVENTS
Very Little Dynamic Sharing
28
Phoenix PARSEC
Run the Analysis On Demand
29
Multi-threadedApplication
Multi-threadedApplication
SoftwareRace Detector
SoftwareRace Detector
Local Access
Inter-thread sharing
Inter-thread sharing
Inter-thread Sharing MonitorInter-thread Sharing Monitor
Finding Inter-thread Sharing Virtual Memory Watchpoints?
– ~100% of accesses cause page faults
Granularity Gap Per-process not per-thread
30
FAULTFAULT
Inter-Thread Sharing
Hardware Sharing Detector HITM in Cache Memory: W→R Data Sharing
Hardware Performance Counters
31
Read YRead YSS
IISS
IIHITM
Core 1 Core 2
PipelinePipeline
CacheCache
00
00
00
00
Perf. Ctrs
11
11-1-1 FAULT
Write Y=5Write Y=5 MMY=5Y=5
Potential Accuracy & Perf. Problems Limitations of Performance Counters
Intel HITM only finds W→R Data Sharing
Limitations of Cache Events SMT sharing can’t be counted Cache eviction causes missed events
Events go through the kernel
32
On-Demand Analysis on Real HW
33
Execute Instruction
SW Race Detection
Enable Analysis
Disable Analysis
HITMInterrupt?
HITMInterrupt?
Sharing Recently?
AnalysisEnabled?
NO
NO
NOYES
YES
YES
> 97%
< 3%
Performance Increases
34
Phoenix PARSEC
51x51x
Accuracy vs. Continuous Analysis:
97%
Accuracy vs. Continuous Analysis:
97%
Outline
Problem Statement
Distributed Dynamic Dataflow Analysis
Demand-Driven Data Race Detection
Unlimited Watchpoints
35
SWDataflow Sampling
SWDataflow Sampling
HW Dataflow Sampling
HW Dataflow SamplingWatc
hpoints
Watchpoint
s
Demand-Driven Race DetectionDemand-Driven Race Detection
Bounds Checking
Watchpoints Work for Many Analyses
36
Taint Analysis
Data Race Detection
Deterministic Execution
Transactional Memory
Speculative Parallelization
Desired Watchpoint Capabilities Large Number
Store in memory Cache on chip
Fine-grained Watch full VA
Per Thread Cached per HW thread
Ranges Range Cache
37
V W X YZ???
WP Fault False Fault
False Faults
Range Cache
38
0x0 0xffff_ffff Not Watched
Start Address End Address Watchpoint?
1
0
0
Valid
Set Addresses 0x5 – 0x2000R-Watched
Set Addresses 0x5 – 0x2000R-Watched
0x4
0x20000x5 R Watched 1
0x2001 0xffff_ffff Not Watched 1
Load Address 0x400
Load Address 0x400
≤ 0x400? ≥ 0x400?
R Watched
WPInterrupt
Memory
Watchpoint System Design
39
Store Ranges in Main Memory Per-Thread Ranges, Per-Core Range Cache Software Handler on RC miss or overflow Write-back RC works as a write filter Precise, user-level watchpoint faults
T1 Memory T2 Memory Core 1 Core 2
WP Changes
Experimental Evaluation Setup
Trace-based timing simulator using Pin
Taint analysis on SPEC INT2000 Race Detection on Phoenix and PARSEC
Comparing only shadow value checks
40
Watchpoint-Based Taint Analysis
41
19x10x 30x 206x 423x 23x1429x
128 entry RC –or– 64 entry RC + 2KB Bitmap
20%Slowdown
28x
RC+Bitmap
Watchpoint-Based Data Race Detection
42
+10% +20%
RC+Bitmap
Future Directions Dataflow Tests find bugs on executed code
What about code that is never executed?
Sampling + Demand-Driven Race Detection Good synergy between the two, like taint analysis
Further watchpoint hardware studies: Clear microarchitectural analysis More software systems, different algorithms
43
SoftwareDataflow Analysis Sampling
SoftwareDataflow Analysis Sampling
Conclusions Sampling allows distributed dataflow analysis
Existing hardware can speed up race detection
Watchpoint hardware useful everywhere
44
HardwareDataflow Analysis Sampling
HardwareDataflow Analysis Sampling
Unlimited Watchpoint System
Unlimited Watchpoint System
Hardware-Assisted Demand-DrivenData Race Detection
Hardware-Assisted Demand-DrivenData Race Detection
Distributed DynamicSoftware Analysis
Distributed DynamicSoftware Analysis
45
Thank You
BACKUP SLIDES
46
Finding Errors Brute Force
Code review, fuzz testing,whitehat/grayhat hackers
– Time-consuming, difficult
Static Analysis Automatically analyze source,
formal reasoning, compiler checks– Intractable, requires expert input,
no system state
47
Dynamic Dataflow Analysis
Associate meta-data with program values
Propagate/Clear meta-data while executing
Check meta-data for safety & correctness
Forms dataflows of meta/shadow information
48
Demand-Driven Dataflow Analysis
49
Only Analyze Shadowed Data
NativeApplication
NativeApplication
InstrumentedApplication
InstrumentedApplication
Meta-Data DetectionMeta-Data Detection
Non-Shadowed
Data
Shadowed Data
Shadowed Data
Results by Ho et al. lmbench Best Case Results:
Results when everything is tainted:
50
System Slowdown
Taint Analysis 101.7x
On-Demand Taint Analysis 1.98x
CompleteAnalysis
CompleteAnalysis
NoAnalysis
NoAnalysis
Sampling Allows Distribution
51
Developer
Beta TestersEnd Users
Many users testing at little overhead see more errors than
one user at high overhead.
Many users testing at little overhead see more errors than
one user at high overhead.
a += ya += y z = y * 75z = y * 75
y = x * 1024y = x * 1024 w = x + 42w = x + 42
Cannot Naïvely Sample Code
52
Validate(x)Validate(x)x = read_input()x = read_input()
a += ya += y
y = x * 1024y = x * 1024
False PositiveFalse Positive
False NegativeFalse Negative
w = x + 42w = x + 42
validate(x)validate(x)x = read_input()x = read_input() Skip Instr.
InputInput
Check wCheck w
Check zCheck zSkip Instr.
Check aCheck a
a += ya += y z = y * 75z = y * 75
y = x * 1024y = x * 1024 w = x + 42w = x + 42
Dataflow Sampling Example
53
validate(x)validate(x)x = read_input()x = read_input()
a += ya += y
y = x * 1024y = x * 1024
False NegativeFalse Negative
x = read_input()x = read_input()Skip Dataflow
InputInput
Check wCheck w
Check zCheck z
Skip Dataflow
Check aCheck a
Benchmarks Performance – Network Throughput
Example: ssh_receive Accuracy of Sampling Analysis
Real-world Security Exploits
54
Name Error Description
Apache Stack overflow in Apache Tomcat JK Connector
Eggdrop Stack overflow in Eggdrop IRC bot
Lynx Stack overflow in Lynx web browser
ProFTPD Heap smashing attack on ProFTPD Server
Squid Heap smashing attack on Squid proxy server
netcat_receive
Performance of Dataflow Sampling (2)
55
Throughput with no analysis
ssh_transmit
Performance of Dataflow Sampling (3)
56
Throughput with no analysis
Accuracy at Very Low Overhead Max time in analysis: 1% every 10 seconds Always stop analysis after threshold
Lowest probability of detecting exploits
57
Name Chance of Detecting Exploit
Apache 100%
Eggdrop 100%
Lynx 100%
ProFTPD 100%
Squid 100%
netcat_receive running with benchmark
Accuracy with Background Tasks
58
Outline Problem Statement
Proposed Solutions Distributed Dynamic Dataflow Analysis Testudo: Hardware-Based Dataflow Sampling Demand-Driven Data Race Detection
Future Work
Timeline
59
HW Dataflow Sampling
HW Dataflow SamplingWatc
hpoints
Watchpoint
s
DataflowSamplingDataflowSampling
Demand-Driven Race DetectionDemand-Driven Race Detection
Virtual Memory Not Ideal
60
FAULTFAULT netcat_receive
Word Accurate Meta-Data
What happens when the cache overflows? Increase the size of main memory? Store into virtual memory?
Use Sampling to Throw Away Data
61
Data CacheData Cache
Word AccurateMeta Data CacheWord Accurate
Meta Data Cache
PipelinePipeline
On-Chip Sampling Mechanism
62
Avg
. #
of e
xecu
tions
512-entry cache 1024-entry cache17601
Useful for Scaling to Complex AnalysesIf each shadow operation uses 1000 instructions:
63
Av
era
ge
% O
ve
rhea
d
1024-entry Sample Cache
Av
era
ge
% O
ve
rhea
d
telnet server benchmark
% %
3500
executions
17.3%
0.3%
169,000
executions
Thread 2mylen=large
Thread 1mylen=small
if(ptr==NULL)if(ptr==NULL)
if(ptr==NULL)if(ptr==NULL)
len1=thread_local->mylen;len1=thread_local->mylen;
ptr=malloc(len1);ptr=malloc(len1);
memcpy(ptr, data1, len1)memcpy(ptr, data1, len1)
len2=thread_local->mylen;len2=thread_local->mylen;
ptr=malloc(len2);ptr=malloc(len2);
memcpy(ptr, data2, len2)memcpy(ptr, data2, len2)
Example of Data Race Detection
64
ptr write-shared?ptr write-shared?Interleaved Synchronization?
Interleaved Synchronization?
Demand-Driven Analysis Algorithm
65
Demand-Driven Analysis on Real HW
66
Performance Difference
67
Phoenix PARSEC
Demand-Driven Analysis Accuracy
68
1/11/1 2/42/4 3/33/3 4/44/4 3/33/3 4/44/4 4/44/42/42/4
Accuracy vs. Continuous Analysis:
97%
Accuracy vs. Continuous Analysis:
97%
Accuracy on Real Hardware
69
kmeans facesim ferret freqmine vips x264 streamcluster
W→W 1/1 (100%)
0/1(0%)
- - 1/1 (100%)
- 1/1(100%)
R→W - 0/1(0%)
2/2 (100%)
2/2 (100%)
1/1 (100%)
3/3 (100%)
1/1(100%)
W→R - 2/2 (100%)
1/1 (100%)
2/2 (100%)
1/1 (100%)
3/3/ (100%)
1/1(100%)
SpiderMonkey-0
SpiderMonkey-1
SpiderMonkey-2
NSPR-1 Memcached-1 Apache-1
W→W 9/9(100%)
1/1(100%)
1/1(100%)
3/3(100%)
- 1/1(100%)
R→W 3/3(100%)
- 1/1 (100%)
1/1(100%)
1/1(100%)
7/7(100%)
W→R 8/8 (100%)
1/1(100%)
2/2 (100%)
4/4(100%)
- 2/2(100%)
kmeans facesim ferret freqmine vips x264 streamcluster
W→W 1/1 (100%)
0/1(0%)
- - 1/1 (100%)
- 1/1(100%)
R→W - 0/1(0%)
2/2 (100%)
2/2 (100%)
1/1 (100%)
3/3 (100%)
1/1(100%)
W→R - 2/2 (100%)
1/1 (100%)
2/2 (100%)
1/1 (100%)
3/3/ (100%)
1/1(100%)
HW Interrupt when touching watched data
SW knows it’s touching important data AT NO OVERHEAD
Normally used for debugging
Hardware-Assisted Watchpoints
70
0 1 2 3 4 5 6 7
A B C D E F G H
LD 2R-Watch 2-4
DC E
WR X→7
X
W-Watch 6-7
G X
Existing Watchpoint Solutions Watchpoint Registers
– Limited number (4-16), small reach (4-8 bytes)
Virtual Memory– Coarse-grained, per-process, only aligned ranges
ECC Mangling– Per physical address, all cores, no ranges
71
Meeting These Requirements Unlimited Number of Watchpoints
Store in memory, cache on chip Fine-Grained
Watch full virtual addresses Per-Thread
Watchpoints cached per core/thread TID Registers
Ranges Range Cache
72
The Need for Many Small Ranges Some watchpoints better suited for ranges
32b Addresses: 2 ranges x 64b each = 16B Some need large # of small watchpoints
51 ranges x 64b each = 408B Better stored as bitmap? 51 bits!
Taint analysis has good ranges Byte-accurate race detection does not..
73
Watchpoint System Design II Make some RC entries point to bitmaps
74
-
Start Addr End Addr Pointer toWP Bitmap
R
-
W
1
V
1
B
Memory CoreRanges Bitmaps Range Cache Bitmap Cache
Accessed in ParallelAccessed in Parallel
Watchpoint-Based Taint Analysis
75
19x10x 30x 206x 423x 23x1429x
128 entry Range Cache
20%Slowdown
28x
Width Test
76