recent advances in system software dependability 周枫网易公司 2007-12-21 清华大学

Recent Advances in System Software Dependability

周枫网易公司

2007-12-21 清华大学http://zhoufeng.net

关于我• 周枫• 1996-2002，清华，计算机本科、硕士• 2002-2007 ， UC Berkeley, Ph.D. in CS

• 2007-, 网易公司，高级副总裁• 感兴趣的研究方向： OS, Internet Services,

Programming Languages, Networking, Information Retrieval

2

A Trend in Software

“The Movement to Safe Languages”

• Type-safety No memory safety errors (segfaults)

Java, C#, Python, Ruby, PHP, Perl…

Vs.

C, C++, ASM

3

Why Safe Languages?

Pros

• Easier to program

• Less error-prone

• Easier to analyze

4

Cons

• Slower

• Less control (over memory, I/O …)

• Bad for real-time things

More dependable software with less effort

Status Quo

• Server: ~80%

– All server-side Java, Ruby-on-Rails…

• Client: ~30%

– Visual Studio, Eclipse…

• What about system software?

5

Why? See “cons” on previous slide: slower, less control, bad realtime perf.

Almost 0

The Problem

• System software: mainly operating systems

• Dependability: “properties that allow one to rely on a system”

– Reliability, Security, Safety, Availability

6

How do we make more dependable system software?

Current State

• Problems dealing with latest or peculiar hardware

• Worms because of so many security flaws

• More problems for long running servers or large clusters [Bligh et al. 07]

7

Problem scope

8

Our Focus

Why is this important?

• The cost of defects is increasing

• The society is increasingly driven by computers

• More and more computers online remote exploits

• Critical infrastructure using commodity systems – “OS monoculture”

9

Why is this hard?

• Reason 1: Complexity

– Windows Vista (2006): 50M LOCWindows NT 3.1 (1993): 6M

– Linux kernel: 8.3M LOC 86 lines/hr for last 2 years

10

More reasons

• Reason 2: Unsafe languages used (C/C++)

– Buffer overruns can be eliminated with safe languages

• Reason 3: Recent trends

– Multicore more parallelism

• “The Free Lunch is Over”, Herb Sutter, 2005

– ccNUMA, smarter devices

• ACPI byte language

11

Roadmap

• OS Dependability with Hardware Protection

– Swift et al. 03-04

• OS Dependability with Program Analysis

– Zhou et al. 04-06

• OS Dependability with Virtual Machines

– Criswell et al. 07

12

OS Dependability With Hardware

13

14

What Causes Most Crashes?

• Device drivers!

– Run in the same protection domain

• Drivers are often buggier than the kernel

– Device drivers cause 85% of Windows XP crashes

– Drivers are 7 times buggier than the kernel in Linux

– Xbox hacked due to memory bugs in games

Better driver dependability Better OS dependability

Crashes

15* Figure courtesy of Mike Swift et al.

Crashes


Goal


Requirements

• Isolation

• Recovery

• Non-intrusive

– No/very few code changes

18

Principles & Goal

• Principles & Assumptions

– Drivers are generally well-behaved and benign

– Design for mistakes (not abuse)

• Doesn’t need to be perfect

– Design for fault resistance (not fault tolerance)

• Goal: a practical “best-effort” system

19

Nooks

• Linux 2.4 kernel and drivers

• “Nooks” kernel patch

– Isolation

– Recovery

• Compatible with existing code

20

Isolation - Memory


Isolation – Control Transfer


Isolation – Data Access


Isolation – Interposition


Isolation Summary

• Isolation

– Lightweight Kernel Protection Domain

– eXtension Procedure Call (XPC)

– Copy-in/Copy-out

– Wrappers

28

Failure Detection


Restart


Driver State/Session Recovery

• Drivers lose state after restart

– E.g. file handles + history of ioctls configuring the drivers

– This causes apps to fail

• Shadow drivers

– Kernel agents for recovering drivers

– Observe kernel-driver communication normally

– Restores drivers state after restart

35

Native Linux


Normal Behavior


During Recovery


Evaluation

• Pros

– General solution

– Covers both isolation and recovery

– Good availability

– Low overhead when accessing memory

• Cons

– High overhead when crossing domains

– System specific

– Coarse grain protection. Does not prevent driver from corrupting itself

39

Results


More Results

• Nooks: 23,000 LOC

• Shadow Manager: 600 LOC

• Overhead: up to 100%

– Main overhead is domain crossing

41

Nooks Recap

• Device drivers are a major source of OS crashes.

• Nooks isolates drivers by putting them inside separate hardware protection domains

• Recovery is done by restarting drivers

• Driver state can be restored by the “shadow driver” technique

• Performance is O.K. Code changes are reasonable.

42

OS Dependability WithProgram Analysis

43

44

Review

• Separate hardware protection domains: Nooks [Swift et al], L4 [LeVasseur et al], Xen [Fraser et al]

– Relatively high overhead due to cross-domain calls, system specific

• Binary instrumentation: SFI [Wahbe et al, Small/Seltzer]

– High overhead, coarse-grained

• What can be done at the C language level?

– Add fined-grained type-safety, to extensions only

– A way to recover from failures

Vision

• What a safe language provides:

– Array indexing stays within object bounds

– No uses of null/invalid pointers

– All operations are type safe

– No uses of dangling pointers

– Control flow obeys program semantics

– …

45

46

A Language-Based Approach to Extension Safety

• Light annotations in extension code and host API

– Buffer bounds, non-null pointers, nullterm strings, tagged unions

• Deputy src-to-src compiler emits safety checks when necessary

• Key: compatible extension-host binary interface

• Runtime tracks resource usage and restores system invariants at fail time

Annot.SourceAnnot.Source

DeputyDeputy

C w/ checksC w/

checksGCCGCC

Kernel Address Space

DriverModuleDriver

Module

SafeDrive Runtime

& Recovery

SafeDrive Runtime

& Recovery

Linux KernelLinux Kernel

47

Deputy: Motivation

struct {

unsigned int len;

int * data;

} x;

for (i=0;i<x.len;i++) {

… x.data[i] …

}

• Common C code

• How to check memory safety?

• C pointers do not express extent of buffers (unlike Java)

48

Previous Approach: Fat Pointers

• Used in CCured and Cyclone

• Compiler inserts extra bounds variables

• Changes memory layout

• Cannot be applied modularly

struct {

unsigned int len;

int * data;

int * data_b;

int * data_e;

} x;

for (i = 0; i < x.len; i++) {

if (x.data+i<x.data_b) abort();

if (x.data+i>=x.data_e) abort();

… x.data[i] …

}

49

Deputy Bounds Annotations

struct {

unsigned int len;

int * count(len) data;

} x;

for(i = 0; i < x.len; i++) {

if (i<0||i>=x.len) abort();

… x.data[i] …

}

• Annotations use existing bounds info in programs, or constants

• Compiler emits runtime checks

• No memory layout change Can be applied to one extension a time

• Many checks can be optimized away

50

Deputy Features

• Bounds: safe,count(n), bound(lo,hi)

– Default: safe

• Other annotations

– Null terminated string/buffer

– Tagged unions

– Open arrays

– Checks for printf() arguments

• Automatic bounds variables for local variables reduced annotation burden

51

Deputy Guarantees

• Deputy guarantees type-safety if,

– Programmer correctly annotates globals and function parameters used by the extension

– Deallocation does not create dangling pointers

– Trusted casts are correct

– External modules / trusted code establish and preserve Deputy annotations

52

Failure Handling

• Everything runs inside the same protection domain

• After Deputy check failure: could just halt

• More useful: clean-up extension and let host continue

• Assumption: restarts should fix most transient failures

Annot.DriverAnnot.Driver

DeputyDeputy

C w/ checksC w/

checksGCCGCC


DriverModuleDriver

Module

SafeDrive Runtime

& Recovery

SafeDrive Runtime

& Recovery


53

Update Tracking and Restarts

• Free resources and undo state changes done by driver

• Kernel API functions “wrapped” to do update tracking

– Compensations: spin_lock(l) vs. spin_unlock(l)

• After failure, undo updates in LIFO order

• Then restart driver

Annot.DriverAnnot.Driver

DeputyDeputy

C w/ checksC w/

checksGCCGCC


DriverModuleDriver

Module

WrappersWrappers


UpdateTrackingUpdate

Tracking

RecoveryRecovery

54

Return Gracefully from Failure

Invariants:

• No driver code is executed after failure

Kernel:foo() {

}

Driver:bar2() {

}

Driver:bar1() {

}

Err code

55

Return Gracefully from Failure

Invariants:

• No driver code is executed after failure

• No kernel function is forced to return early

Kernel:foo1() {

}

Driver:bar2() {

}

Driver:bar1() {

}

Kernel:foo2() {

}

lock()

unlock()

56

Discussion

• Compared to Nooks

– Significantly less interception Much simpler overall

– Deputy does fine-grained per-allocation checks No separate heap/stack

– No help from virtual memory hardware

– Works for user-level applications and safe languages

• Compared to C++/Java exceptions

– Compensation does not contain any code from driver

– Only restores host state, not extension state

57

Implementation

• Deputy compiler: 20K lines of OCaml

• Kernel patch to 2.6.15.5: 1K lines

• Kernel headers patch: 1.9K lines

• Patch for 6 drivers in 4 categories

– Network: e1000, tg3

– USB: usb-storage

– Sound: intel8x0, emu10k1

– Video: nvidia

58

Evaluation: Recovery Rate

• Inject random errors with compile-time injection: 5 errors from one of 7 categories each time

– Faults chosen following empirical studies [Sullivan & Chillarege 91], [Christmansson & Chillarege 96]

– Scan overrun, loop fault, corrupt parameter, off-by-one, flipped condition, missing call, missing assignment

• Load buggy e1000 driver w/ and w/o SafeDrive

• Exercise by downloading a 89MB file, verifying it and unloading the driver

• Then rerun with original driver

59

Recovery Rate Results

SafeDrive off 44 crashes 21 failures 75 passes

SafeDrive on

Static error 10 0 3

Runtime error 34 2 5

No problem detected

0 19 67

Recovery successes 44 (100%) 2 (100%) 8 (100%)

• 140 runs, 20 per fault category

• SafeDrive is effective at detecting and recovering from crashing problems, and can detect some statically.

60

Annotation Burden

17011

260 270

13270

359

156

13252

136 118

2897

124167

11080

441

10126

224

100

1000

10000

100000

e1000 tg3 usb-storage intel8x0 emu10k1 nvidia

Original LOC

Deputy Annotations

Recovery Wrappers

• 1%-4% of lines with Deputy annotations

• Recovery wrappers can be automatically generated

61

Annotations Break-down

Lines Changed

Bounds Strings Tagged Unions

Trusted Code

All 6 drivers

1544 379 83 2 390

Kernel headers

1866 187 260 8 140

• Common reasons for trusted casts and trusted code

– Polymorphic private data, e.g. netdev->priv

– Small number of cases where buffer bounds are not available

– Code manipulating pointer values directly, e.g. PTR_ERR(x)

62

Performance

-174

12

9

8

13

23

6

14

4

0

0

-1.1

-1.3

0

0

0

0

-11

0

-20 -10 0 10 20Relative %, SafeDrive vs. native

Throughput

CPU

e1000 TCP recv

e1000 UDP recv

e1000 TCP send

e1000 UDP send

tg3 TCP recv

tg3 TCP send

usb-storage untar

emu10k aplay

intel8x0 aplay

nvidia xinit

• Nooks (Linux 2.4): e1000 TCP recv: 46% (vs. 4%), e1000 TCP send: 111% (vs. 12%)

63

Conclusion

• SafeDrive does fine-grained memory safety checking for extensions with low overhead and few code changes

• A recovery scheme for in-process extensions via restarts

• It is feasible to get much of the safety guarantee in type-safe languages in extensions without abandoning existing systems in C

• Language technology makes extension isolation easier

OS Dependability WithVirtual Machines

64

Safe Execution Environment

• The environment provided by a safe programming language.

– E.g. no out-of-bound access, etc.

• Executes the entire OS inside the environment (VM)

• Looking at ISA level (vs. kernel module, language)

• Benefits

– Better security (not necessarily reliability/availability!)

– More guarantees new OS designs

– Make the OS more analyzable

65

Secure Virtual Architecture

• Compiler-based virtual machine

– Uses analysis & transformation techniques from compilers

– Supports commodity operating systems (e.g., Linux)

• Typed virtual instruction set enables sophisticated program analysis

• Provide safe execution environment for commodity OSs66

Commodity OS

HardwareCompiler + VMVirtual ISANative ISA

* Figure courtesy of John Criswell et al.

SVA Architecture

67

Memory SafetyRun-time Library

Hardware

OS Memory Allocator

SVA Virtual Machine

Applications

OS Kernel

SVA ISA

Native ISA

Safety Checking Compiler

Drivers

Native Code Generator

SVA-OS Run-timeLibrary

Safety Verifier

Hardware


Software Flow

68

Safety CheckingCompiler

Safety Verifier

Code Generator

Compile-Time: Install/Load/Run-Time:

Kernel/ApplicationSource

Bytecodewith

Safe Types(LLVM-like)

Bytecode+

Run-Time Checks

Native Code

Hardware

TCB


Memory Safety Checking

• Do not want to use “fat pointers”

– The same reason as SafeDrive

• No annotations available either

– SafeDrive uses annotations

• Then???

69

Record all allocation sites at runtime and look up at each access…

Naïve Safety Checks

70

Memory TrackingData Structure

P1=kmem_cache_alloc(inode_cache);

pchk_reg_obj (P1, reg_size(inode_cache));

…

Dest = &P1[index];

bounds = pchk_get_bounds (P1);

pchk_check_bounds (P1, Dest, bounds);

…

P2=vmalloc(size1);

pchk_reg_obj (P2, size1);

P3=kmem_cache_alloc(file_cache);

pchk_reg_obj (P3, reg_size(file_cache));


pchk_reg_obj (P4, reg_size(inode_cache));

P1

P4

P2

SVM Metadata:Kernel Code:

P3

•Jones-Kelly Method•Run-time lookups are too slow•Limited opportunity to remove run-time checks


SVM Memory Safety

71


pchk_reg_obj (MP1, P1, reg_size(inode_cache));

…

Dest = &P1[index];

bounds = pchk_get_bounds (MP1, P1);

pchk_check_bounds (P1, Dest, bounds);

P2=vmalloc(size1);

pchk_reg_obj (MP2, P2, size1);

P3=kmem_cache_alloc(file_cache);

pchk_reg_obj (MP3, P3, reg_size(file_cache));


pchk_reg_obj (MP1, P4, reg_size(inode_cache));

P1

P4

P2

SVM Metadata:Kernel Code:

P3

MP1

MP3

MP2


Type Safe Partitions

72

• Alias analysis performs type inference

• Type-homogeneous partitions reduce run-time checks:

– No load/store checks

– No indirect call checks

– Harmless dangling pointers

• Type-unsafe partitions require all run-time checks

Memory

Blue Partition

Red Partition


SVA Prototype

73

• Ported Linux to SVA instruction set

– Similar to porting to new hardware architecture

– Compiled using LLVM

• Wrote SVA-OS as run-time library linked into kernel

• Provide safety guarantees to entire kernel except:

– Memory management code

– Architecture-dependent utility library

– Architecture-independent utility library

Linux Modification

74

Section Original LOC SVA-OS Allocators Analysis Total Modified

Arch-indep Core 9,822 41 76 3 120

Net Drivers 399,872 12 0 6 18

Net Protocols 169,832 23 0 29 52

Core FS 18,499 78 0 19 97

Ext3 5,207 0 0 1 1

Total Indep 603,232 154 76 58 288

Arch-dep 29,237 4,777 0 1 4,778

Web Server Bandwidth

75

• Memory safety overhead less than 59%

• Transfers of larger file sizes show acceptable overhead

0%

10%

20%

30%

40%

50%

60%

70%

1 2 4 8 16 32 64 128

File Size in KB

Per

cent B

andw

idth

R

educt

ion R

elati

ve to

Nati

ve

thttpd apache


Server Latency Overhead

76

0%

20%

40%

60%

80%

100%

120%

140%

1 2 4 8 16 32 64 128 256

File Size in KB

Perc

ent O

verh

ead

Rela

tive

to N

ati

ve

thttpd apache

• Many short data transfers suffer high memory safety overhead

• Overhead acceptable for larger file sizes

Acceptable overhead for security-critical servers* Figure courtesy of John Criswell et al.

Exploits

77

• Tried 5 memory exploits that work on Linux 2.4.22

• Uncaught exploit due to code not instrumented with checks

BugTraq ID Kernel Component Caught?

11956 Console Driver Yes!

10179 TCP/IP Yes!

11917 TCP/IP Yes!

12911 Bluetooth Protocol Yes!

13589 ELF/Support Library No

Conclusion

Ways to detect errors in system software:

• With hardware protection

– Drivers in separate domains

• With program analysis

– Insert software checks

• With virtual machines

– Software checks inserted at JIT time

And ways to recover from these errors

78

Other directions

• Rewrite the OS: Singularity (MSR ‘05-present)

• Production driver framework: Windows Driver Foundation

• Static verification: Windows Static Driver Verifier (EuroSys 06)

79

References

• Nooks, SOSP 03, OSDI 04

• SafeDrive, OSDI 06

• Secure Virtual Architecture, SOSP 07

Thanks!Contact me: [email protected]

http://zhoufeng.net80

recent advances in system software dependability 周枫 网易公司 2007-12-21 清华大学

Documents

recent advances in system software dependability 周枫网易公司 2007-12-21 清华大学