recent advances in system software dependability 周枫 网易公司 2007-12-21 清华大学
TRANSCRIPT
Recent Advances in System Software Dependability
周枫网易公司
2007-12-21 清华大学http://zhoufeng.net
关于我• 周枫• 1996-2002,清华,计算机本科、硕士• 2002-2007 , UC Berkeley, Ph.D. in CS
• 2007-, 网易公司,高级副总裁• 感兴趣的研究方向: OS, Internet Services,
Programming Languages, Networking, Information Retrieval
2
A Trend in Software
“The Movement to Safe Languages”
• Type-safety No memory safety errors (segfaults)
Java, C#, Python, Ruby, PHP, Perl…
Vs.
C, C++, ASM
3
Why Safe Languages?
Pros
• Easier to program
• Less error-prone
• Easier to analyze
4
Cons
• Slower
• Less control (over memory, I/O …)
• Bad for real-time things
More dependable software with less effort
Status Quo
• Server: ~80%
– All server-side Java, Ruby-on-Rails…
• Client: ~30%
– Visual Studio, Eclipse…
• What about system software?
5
Why? See “cons” on previous slide: slower, less control, bad realtime perf.
Almost 0
The Problem
• System software: mainly operating systems
• Dependability: “properties that allow one to rely on a system”
– Reliability, Security, Safety, Availability
6
How do we make more dependable system software?
Current State
• Problems dealing with latest or peculiar hardware
• Worms because of so many security flaws
• More problems for long running servers or large clusters [Bligh et al. 07]
7
Problem scope
8
Our Focus
Why is this important?
• The cost of defects is increasing
• The society is increasingly driven by computers
• More and more computers online remote exploits
• Critical infrastructure using commodity systems – “OS monoculture”
9
Why is this hard?
• Reason 1: Complexity
– Windows Vista (2006): 50M LOCWindows NT 3.1 (1993): 6M
– Linux kernel: 8.3M LOC 86 lines/hr for last 2 years
10
More reasons
• Reason 2: Unsafe languages used (C/C++)
– Buffer overruns can be eliminated with safe languages
• Reason 3: Recent trends
– Multicore more parallelism
• “The Free Lunch is Over”, Herb Sutter, 2005
– ccNUMA, smarter devices
• ACPI byte language
11
Roadmap
• OS Dependability with Hardware Protection
– Swift et al. 03-04
• OS Dependability with Program Analysis
– Zhou et al. 04-06
• OS Dependability with Virtual Machines
– Criswell et al. 07
12
OS Dependability With Hardware
13
14
What Causes Most Crashes?
• Device drivers!
– Run in the same protection domain
• Drivers are often buggier than the kernel
– Device drivers cause 85% of Windows XP crashes
– Drivers are 7 times buggier than the kernel in Linux
– Xbox hacked due to memory bugs in games
Better driver dependability Better OS dependability
Crashes
15* Figure courtesy of Mike Swift et al.
Crashes
16* Figure courtesy of Mike Swift et al.
Goal
17* Figure courtesy of Mike Swift et al.
Requirements
• Isolation
• Recovery
• Non-intrusive
– No/very few code changes
18
Principles & Goal
• Principles & Assumptions
– Drivers are generally well-behaved and benign
– Design for mistakes (not abuse)
• Doesn’t need to be perfect
– Design for fault resistance (not fault tolerance)
• Goal: a practical “best-effort” system
19
Nooks
• Linux 2.4 kernel and drivers
• “Nooks” kernel patch
– Isolation
– Recovery
• Compatible with existing code
20
Isolation - Memory
21* Figure courtesy of Mike Swift et al.
Isolation – Control Transfer
22* Figure courtesy of Mike Swift et al.
Isolation – Control Transfer
23* Figure courtesy of Mike Swift et al.
Isolation – Data Access
24* Figure courtesy of Mike Swift et al.
Isolation – Data Access
25* Figure courtesy of Mike Swift et al.
Isolation – Interposition
26* Figure courtesy of Mike Swift et al.
Isolation – Interposition
27* Figure courtesy of Mike Swift et al.
Isolation Summary
• Isolation
– Lightweight Kernel Protection Domain
– eXtension Procedure Call (XPC)
– Copy-in/Copy-out
– Wrappers
28
Failure Detection
29* Figure courtesy of Mike Swift et al.
Failure Detection
30* Figure courtesy of Mike Swift et al.
Failure Detection
31* Figure courtesy of Mike Swift et al.
Restart
32* Figure courtesy of Mike Swift et al.
Restart
33* Figure courtesy of Mike Swift et al.
Restart
34* Figure courtesy of Mike Swift et al.
Driver State/Session Recovery
• Drivers lose state after restart
– E.g. file handles + history of ioctls configuring the drivers
– This causes apps to fail
• Shadow drivers
– Kernel agents for recovering drivers
– Observe kernel-driver communication normally
– Restores drivers state after restart
35
Native Linux
36* Figure courtesy of Mike Swift et al.
Normal Behavior
37* Figure courtesy of Mike Swift et al.
During Recovery
38* Figure courtesy of Mike Swift et al.
Evaluation
• Pros
– General solution
– Covers both isolation and recovery
– Good availability
– Low overhead when accessing memory
• Cons
– High overhead when crossing domains
– System specific
– Coarse grain protection. Does not prevent driver from corrupting itself
39
Results
40* Figure courtesy of Mike Swift et al.
More Results
• Nooks: 23,000 LOC
• Shadow Manager: 600 LOC
• Overhead: up to 100%
– Main overhead is domain crossing
41
Nooks Recap
• Device drivers are a major source of OS crashes.
• Nooks isolates drivers by putting them inside separate hardware protection domains
• Recovery is done by restarting drivers
• Driver state can be restored by the “shadow driver” technique
• Performance is O.K. Code changes are reasonable.
42
OS Dependability WithProgram Analysis
43
44
Review
• Separate hardware protection domains: Nooks [Swift et al], L4 [LeVasseur et al], Xen [Fraser et al]
– Relatively high overhead due to cross-domain calls, system specific
• Binary instrumentation: SFI [Wahbe et al, Small/Seltzer]
– High overhead, coarse-grained
• What can be done at the C language level?
– Add fined-grained type-safety, to extensions only
– A way to recover from failures
Vision
• What a safe language provides:
– Array indexing stays within object bounds
– No uses of null/invalid pointers
– All operations are type safe
– No uses of dangling pointers
– Control flow obeys program semantics
– …
45
46
A Language-Based Approach to Extension Safety
• Light annotations in extension code and host API
– Buffer bounds, non-null pointers, nullterm strings, tagged unions
• Deputy src-to-src compiler emits safety checks when necessary
• Key: compatible extension-host binary interface
• Runtime tracks resource usage and restores system invariants at fail time
Annot.SourceAnnot.Source
DeputyDeputy
C w/ checksC w/
checksGCCGCC
Kernel Address Space
DriverModuleDriver
Module
SafeDrive Runtime
& Recovery
SafeDrive Runtime
& Recovery
Linux KernelLinux Kernel
47
Deputy: Motivation
struct {
unsigned int len;
int * data;
} x;
for (i=0;i<x.len;i++) {
… x.data[i] …
}
• Common C code
• How to check memory safety?
• C pointers do not express extent of buffers (unlike Java)
48
Previous Approach: Fat Pointers
• Used in CCured and Cyclone
• Compiler inserts extra bounds variables
• Changes memory layout
• Cannot be applied modularly
struct {
unsigned int len;
int * data;
int * data_b;
int * data_e;
} x;
for (i = 0; i < x.len; i++) {
if (x.data+i<x.data_b) abort();
if (x.data+i>=x.data_e) abort();
… x.data[i] …
}
49
Deputy Bounds Annotations
struct {
unsigned int len;
int * count(len) data;
} x;
for(i = 0; i < x.len; i++) {
if (i<0||i>=x.len) abort();
… x.data[i] …
}
• Annotations use existing bounds info in programs, or constants
• Compiler emits runtime checks
• No memory layout change Can be applied to one extension a time
• Many checks can be optimized away
50
Deputy Features
• Bounds: safe,count(n), bound(lo,hi)
– Default: safe
• Other annotations
– Null terminated string/buffer
– Tagged unions
– Open arrays
– Checks for printf() arguments
• Automatic bounds variables for local variables reduced annotation burden
51
Deputy Guarantees
• Deputy guarantees type-safety if,
– Programmer correctly annotates globals and function parameters used by the extension
– Deallocation does not create dangling pointers
– Trusted casts are correct
– External modules / trusted code establish and preserve Deputy annotations
52
Failure Handling
• Everything runs inside the same protection domain
• After Deputy check failure: could just halt
• More useful: clean-up extension and let host continue
• Assumption: restarts should fix most transient failures
Annot.DriverAnnot.Driver
DeputyDeputy
C w/ checksC w/
checksGCCGCC
Kernel Address Space
DriverModuleDriver
Module
SafeDrive Runtime
& Recovery
SafeDrive Runtime
& Recovery
Linux KernelLinux Kernel
53
Update Tracking and Restarts
• Free resources and undo state changes done by driver
• Kernel API functions “wrapped” to do update tracking
– Compensations: spin_lock(l) vs. spin_unlock(l)
• After failure, undo updates in LIFO order
• Then restart driver
Annot.DriverAnnot.Driver
DeputyDeputy
C w/ checksC w/
checksGCCGCC
Kernel Address Space
DriverModuleDriver
Module
WrappersWrappers
Linux KernelLinux Kernel
UpdateTrackingUpdate
Tracking
RecoveryRecovery
54
Return Gracefully from Failure
Invariants:
• No driver code is executed after failure
Kernel:foo() {
}
Driver:bar2() {
}
Driver:bar1() {
}
Err code
55
Return Gracefully from Failure
Invariants:
• No driver code is executed after failure
• No kernel function is forced to return early
Kernel:foo1() {
}
Driver:bar2() {
}
Driver:bar1() {
}
Kernel:foo2() {
}
lock()
unlock()
56
Discussion
• Compared to Nooks
– Significantly less interception Much simpler overall
– Deputy does fine-grained per-allocation checks No separate heap/stack
– No help from virtual memory hardware
– Works for user-level applications and safe languages
• Compared to C++/Java exceptions
– Compensation does not contain any code from driver
– Only restores host state, not extension state
57
Implementation
• Deputy compiler: 20K lines of OCaml
• Kernel patch to 2.6.15.5: 1K lines
• Kernel headers patch: 1.9K lines
• Patch for 6 drivers in 4 categories
– Network: e1000, tg3
– USB: usb-storage
– Sound: intel8x0, emu10k1
– Video: nvidia
58
Evaluation: Recovery Rate
• Inject random errors with compile-time injection: 5 errors from one of 7 categories each time
– Faults chosen following empirical studies [Sullivan & Chillarege 91], [Christmansson & Chillarege 96]
– Scan overrun, loop fault, corrupt parameter, off-by-one, flipped condition, missing call, missing assignment
• Load buggy e1000 driver w/ and w/o SafeDrive
• Exercise by downloading a 89MB file, verifying it and unloading the driver
• Then rerun with original driver
59
Recovery Rate Results
SafeDrive off 44 crashes 21 failures 75 passes
SafeDrive on
Static error 10 0 3
Runtime error 34 2 5
No problem detected
0 19 67
Recovery successes 44 (100%) 2 (100%) 8 (100%)
• 140 runs, 20 per fault category
• SafeDrive is effective at detecting and recovering from crashing problems, and can detect some statically.
60
Annotation Burden
17011
260 270
13270
359
156
13252
136 118
2897
124167
11080
441
10126
224
100
1000
10000
100000
e1000 tg3 usb-storage intel8x0 emu10k1 nvidia
Original LOC
Deputy Annotations
Recovery Wrappers
• 1%-4% of lines with Deputy annotations
• Recovery wrappers can be automatically generated
61
Annotations Break-down
Lines Changed
Bounds Strings Tagged Unions
Trusted Code
All 6 drivers
1544 379 83 2 390
Kernel headers
1866 187 260 8 140
• Common reasons for trusted casts and trusted code
– Polymorphic private data, e.g. netdev->priv
– Small number of cases where buffer bounds are not available
– Code manipulating pointer values directly, e.g. PTR_ERR(x)
62
Performance
-174
12
9
8
13
23
6
14
4
0
0
-1.1
-1.3
0
0
0
0
-11
0
-20 -10 0 10 20Relative %, SafeDrive vs. native
Throughput
CPU
e1000 TCP recv
e1000 UDP recv
e1000 TCP send
e1000 UDP send
tg3 TCP recv
tg3 TCP send
usb-storage untar
emu10k aplay
intel8x0 aplay
nvidia xinit
• Nooks (Linux 2.4): e1000 TCP recv: 46% (vs. 4%), e1000 TCP send: 111% (vs. 12%)
63
Conclusion
• SafeDrive does fine-grained memory safety checking for extensions with low overhead and few code changes
• A recovery scheme for in-process extensions via restarts
• It is feasible to get much of the safety guarantee in type-safe languages in extensions without abandoning existing systems in C
• Language technology makes extension isolation easier
OS Dependability WithVirtual Machines
64
Safe Execution Environment
• The environment provided by a safe programming language.
– E.g. no out-of-bound access, etc.
• Executes the entire OS inside the environment (VM)
• Looking at ISA level (vs. kernel module, language)
• Benefits
– Better security (not necessarily reliability/availability!)
– More guarantees new OS designs
– Make the OS more analyzable
65
Secure Virtual Architecture
• Compiler-based virtual machine
– Uses analysis & transformation techniques from compilers
– Supports commodity operating systems (e.g., Linux)
• Typed virtual instruction set enables sophisticated program analysis
• Provide safe execution environment for commodity OSs66
Commodity OS
HardwareCompiler + VMVirtual ISANative ISA
* Figure courtesy of John Criswell et al.
SVA Architecture
67
Memory SafetyRun-time Library
Hardware
OS Memory Allocator
SVA Virtual Machine
Applications
OS Kernel
SVA ISA
Native ISA
Safety Checking Compiler
Drivers
Native Code Generator
SVA-OS Run-timeLibrary
Safety Verifier
Hardware
* Figure courtesy of John Criswell et al.
Software Flow
68
Safety CheckingCompiler
Safety Verifier
Code Generator
Compile-Time: Install/Load/Run-Time:
Kernel/ApplicationSource
Bytecodewith
Safe Types(LLVM-like)
Bytecode+
Run-Time Checks
Native Code
Hardware
TCB
* Figure courtesy of John Criswell et al.
Memory Safety Checking
• Do not want to use “fat pointers”
– The same reason as SafeDrive
• No annotations available either
– SafeDrive uses annotations
• Then???
69
Record all allocation sites at runtime and look up at each access…
Naïve Safety Checks
70
Memory TrackingData Structure
P1=kmem_cache_alloc(inode_cache);
pchk_reg_obj (P1, reg_size(inode_cache));
…
Dest = &P1[index];
bounds = pchk_get_bounds (P1);
pchk_check_bounds (P1, Dest, bounds);
…
P2=vmalloc(size1);
pchk_reg_obj (P2, size1);
P3=kmem_cache_alloc(file_cache);
pchk_reg_obj (P3, reg_size(file_cache));
P4=kmem_cache_alloc(inode_cache);
pchk_reg_obj (P4, reg_size(inode_cache));
P1
P4
P2
SVM Metadata:Kernel Code:
P3
•Jones-Kelly Method•Run-time lookups are too slow•Limited opportunity to remove run-time checks
* Figure courtesy of John Criswell et al.
SVM Memory Safety
71
P1=kmem_cache_alloc(inode_cache);
pchk_reg_obj (MP1, P1, reg_size(inode_cache));
…
Dest = &P1[index];
bounds = pchk_get_bounds (MP1, P1);
pchk_check_bounds (P1, Dest, bounds);
P2=vmalloc(size1);
pchk_reg_obj (MP2, P2, size1);
P3=kmem_cache_alloc(file_cache);
pchk_reg_obj (MP3, P3, reg_size(file_cache));
P4=kmem_cache_alloc(inode_cache);
pchk_reg_obj (MP1, P4, reg_size(inode_cache));
P1
P4
P2
SVM Metadata:Kernel Code:
P3
MP1
MP3
MP2
* Figure courtesy of John Criswell et al.
Type Safe Partitions
72
• Alias analysis performs type inference
• Type-homogeneous partitions reduce run-time checks:
– No load/store checks
– No indirect call checks
– Harmless dangling pointers
• Type-unsafe partitions require all run-time checks
Memory
Blue Partition
Red Partition
* Figure courtesy of John Criswell et al.
SVA Prototype
73
• Ported Linux to SVA instruction set
– Similar to porting to new hardware architecture
– Compiled using LLVM
• Wrote SVA-OS as run-time library linked into kernel
• Provide safety guarantees to entire kernel except:
– Memory management code
– Architecture-dependent utility library
– Architecture-independent utility library
Linux Modification
74
Section Original LOC SVA-OS Allocators Analysis Total Modified
Arch-indep Core 9,822 41 76 3 120
Net Drivers 399,872 12 0 6 18
Net Protocols 169,832 23 0 29 52
Core FS 18,499 78 0 19 97
Ext3 5,207 0 0 1 1
Total Indep 603,232 154 76 58 288
Arch-dep 29,237 4,777 0 1 4,778
Web Server Bandwidth
75
• Memory safety overhead less than 59%
• Transfers of larger file sizes show acceptable overhead
0%
10%
20%
30%
40%
50%
60%
70%
1 2 4 8 16 32 64 128
File Size in KB
Per
cent B
andw
idth
R
educt
ion R
elati
ve to
Nati
ve
thttpd apache
* Figure courtesy of John Criswell et al.
Server Latency Overhead
76
0%
20%
40%
60%
80%
100%
120%
140%
1 2 4 8 16 32 64 128 256
File Size in KB
Perc
ent O
verh
ead
Rela
tive
to N
ati
ve
thttpd apache
• Many short data transfers suffer high memory safety overhead
• Overhead acceptable for larger file sizes
Acceptable overhead for security-critical servers* Figure courtesy of John Criswell et al.
Exploits
77
• Tried 5 memory exploits that work on Linux 2.4.22
• Uncaught exploit due to code not instrumented with checks
BugTraq ID Kernel Component Caught?
11956 Console Driver Yes!
10179 TCP/IP Yes!
11917 TCP/IP Yes!
12911 Bluetooth Protocol Yes!
13589 ELF/Support Library No
Conclusion
Ways to detect errors in system software:
• With hardware protection
– Drivers in separate domains
• With program analysis
– Insert software checks
• With virtual machines
– Software checks inserted at JIT time
And ways to recover from these errors
78
Other directions
• Rewrite the OS: Singularity (MSR ‘05-present)
• Production driver framework: Windows Driver Foundation
• Static verification: Windows Static Driver Verifier (EuroSys 06)
79
References
• Nooks, SOSP 03, OSDI 04
• SafeDrive, OSDI 06
• Secure Virtual Architecture, SOSP 07
Thanks!Contact me: [email protected]
http://zhoufeng.net80