revirt: enabling intrusion analysis through virtual machine logging and replay authors: george w....

33
ReVirt: Enabling Intrusion Analysis through Virtual Machine Logging And Replay Authors: George W. Dunlap Samuel T. King Sukru Cinar Murtaza A. Basrai Peter M. Chen Presentation by: Will Hrudey

Post on 19-Dec-2015

226 views

Category:

Documents


0 download

TRANSCRIPT

ReVirt: Enabling Intrusion Analysis through Virtual Machine Logging And Replay

Authors:

George W. DunlapSamuel T. KingSukru CinarMurtaza A. BasraiPeter M. Chen

Presentation by: Will Hrudey

Introduction

ReVirt is an intrusion analysis solution that facilitates post attack analysis

ReVirt applies VM and fault tolerant techniques to enable the Administrator to replay long term instruction-by-instruction execution of a computer system

ReVirt runs the target operating system (OS) and applications in a VM running as a kernel module in a host OS, allowing:

– Migration of logging from the target OS to the host OS below the VM

– Playback of the target system’s execution before, during, and after an intruder compromises the system

Motivation

The improvement of today’s computer system security is an urgent and difficult problem

The complexity and rapid change in software systems prevent developers from verifying their code to eliminate all vulnerabilities

Administrators have to routinely cope with computer break-ins

– CERT Coordination Center reports steady a increase of incidents handled and the number of vulnerabilities over the past 4 years

Goals

Solve two problems with current audit logging:

1. Improve the integrity of the logger because: Existing loggers depend on the integrity of the OS Attackers can disable, modify or delete system logs Kernel’s are large and complex so tend to contain many bugs

Solution:Encapsulate target system within VM and place logging below VM

2. Improve the completeness of the logger because: Existing loggers don’t save enough data to replay and analyze attacks so

Administrator still has to guess what happened Can’t account for non-determinism

Solution:Utilize checkpointing, logging and roll-forward recovery

Virtual Machines

A virtual-machine monitor (VMM) is a software layer that emulates the hardware of a complete computer system

The VMM creates an abstraction called a virtual machine (VM)

The host platform that the VMM runs on can be another OS (the host OS) or the bare hardware

– So the VMM runs in a separate domain from the guest OS and applications

Although the VMM can still be compromised, it makes a better trusted computing base (TCB) than the guest OS due to its narrow interface and small size

Virtual Machines

The VMM interface is similar to the physical hardware whereas the interface provided by a typical OS is much richer

The narrower interface restricts actions and the smaller code is easier to verify the VMM

VM's can be classified by how similar they are to the host hardware

– On one end, VM’s export a backwards compatible interface with the host hardware such as IBM VM/370. OS’s and applications intended to run on the host platform can run on these VMM’s without change

– On the other end, language-level VM's like Java VM export an interface completely different from the host hardware. These VMM’s can run only OS’s and applications written specifically for them

Virtual Machines

Direct-on-Host (DoH) OS-on-OS (OoO)

Different VM Configurations

UMLinux

ReVirt uses UMLinux as the virtual machine

– VMM in UMLinux exports an interface similar but not identical to the host hardware

– VMM custom optimizations in the underlying OS increase speed

Virtual machine in UMLinux runs as a user process on the host

– Guest OS and guest applications run inside this user host process

– Guest OS uses host services (system calls and signals) as the interface to peripheral devices, hence OS-on-OS architecture

Normal structure of target applications running directly on the host OS reflects the Direct-on-Host architecture

UMLinux

VMM in UMLinux is a loadable kernel module in the host OS

– Module is called before/after each signal and system call to/from the VM process

– Most instructions executed within the VM execute directly on host CPU

Memory accesses are translated by the host’s MMU based on translations that are set up via the host OS’s memory system calls

A host X application displays console output and reads keyboard input

The VMM module maintains a virtual privilege level (VPL)– Set to kernel when transferring control to the guest kernel– Set to user when transferring control to a guest application

UMLinux

If the current VPL is kernel, the VMM knows the guest OS made the system call and it checks to ensure its a call the guest OS should be making, then passes it onto the host OS

If the current VPL is user, the VMM knows the guest application made the system call and it sends a SIGUSR1 to the guest OS to notify it

– SIGUSR1 signal handler in the guest kernel is the equivalent of the system-call trap handler in a normal OS

SIGALRM, SIGIO, and SIGSEGV signals are used to emulate the hardware timer, I/O device interrupts, and memory exceptions

UMLinux emulates the enabling/disabling of interrupts by masking signals

The TCB is comprised of the VMM kernel module and the host OS

UMLinux

UMLinux

Attacker strategies:

From above

DoH: Attacker can cause application processes to exploit any/all host OS functionality in dangerous ways

OoO: Attacker can take similar avenues to attack Guest OS, however VMMlimits available systems calls to < 7% and Guest OS can only accessa limited number of host files and devices

From below

DoH: Attacker can send dangerous network packets to the host to compromise lowerlevels of the protocol stack

OoO: Less of the host OS network stack is exposed to the same dangerous packets

Logging And Replaying

Logging is used to recover state– Start from a checkpoint of a prior state, then roll forward using the log

Most events are deterministic and needn’t be logged however any host system calls that can yield non-deterministic results must be logged

Non-deterministic events are categorized as either time or external input– Time refers to the point in the execution stream which an event takes place

– External input is data received from a non-logged entity (keyboard, mouse, etc)

Output to peripherals does not affect the replay process

Log records are added and saved to disk similar to Linux syslogd daemon

PC and the # of branches executed since the last interrupt are logged

New asynchronous virtual interrupts do not perturb VM process playback

Logging And Replaying

ReVirt goes through two phases to find the right instruction at which to deliver the original asynchronous virtual interrupt

– 1st phase has branch_retired generate an interrupt after most branches– 2nd phase is needed to stop at exactly the right instruction

Replay can occur on any host with similar processor type as host

Most non-deterministic sources generate small amounts of log data

Received network messages can generate massive logs

– Can reduce the amount of logged network data since the receiver doesn’t need to log data because the sender can recreate the data via replay

– Requires cooperating computers to trust each other to regenerate the same message data during replay

Logging And Replaying

Administrator tools used to in understanding the attack:

– Tools that run inside the guest VM to probe the VM state edit files list current processes, etc

– Tools that run outside the guest VM to analyze the state of a VM

Xserver Debuggers Disk Analyzer, etc

Experiments: Testbed

VM is configured to use 192 MB of physical memory Virtual hard disk is stored on a raw disk partition

Experiments: Objective

Measure Virtualization Overhead:– Application runtimes within UMLinux vs. runtimes on the host OS– Evaluates 5 workloads with a warm cache averaged over 3 runs

Validate Correctness:– Micro-benchmarks run in the VM to verify virtual interrupts are being

replayed at the same point at which they occurred during logging– Macro-benchmark verifies ReVirt faithfully plays back input from

external systems

Measure Logging And Replaying Overhead– Quantify the time and space overhead of logging– Checkpoint overhead is not included

Attack Analysis– Exploit the ptrace race condition and verify replay

Experiments: Virtualization Overhead

Experiments: Logging / Replaying

Future Work

Make checkpointing faster and more convenient– Accelerate disk copy done during checkpointing– Enable the VMM to checkpoint a running VM

Reduce host OS size used to support UMLinux

Build higher level analysis tools to leverage ability to replay detailed, long-term executions

Move the X server into another VM

Use ReVirt as a building block for new security services

Cooperative logging in ReVirt?

Conclusion

ReVirt adopts VM and fault tolerance techniques to enable replay of long-term instruction by instruction execution to facilitate attack analysis

Target OS and applications run within the VM

ReVirt can replay execution before, during and after an intrusion

ReVirt logs all non-deterministic events so it can replay non-deterministic attacks and executions

ReVirt provides arbitrarily detailed observations about what transpired

ReVirt is implemented as a set of modifications to the host OS

ReVirt adds “reasonable?” time and space overhead

Observations

Total overhead for kernel-intensive workloads: up to 66%– Is this overhead justifiable?– Should have reported total overhead in tables for increased clarity

Checkpoint time and space overhead not characterized Host OS can still be compromised

– No quantitative data to support narrower interface is more secure Tests seem to focus on overhead rather than ability to enable analysis There are no specific tools to analyze potentially large ReVirt logs Log growth could be much larger since SPECWeb99 benchmark was

based on only 15 simultaneous connections Replay must start from a powered-off VM state, is this practical? How portable is ReVirt to other guest/host OS’s? “No perceptible time overhead” is a weak measurement. Better metric? No multiprocessor support yet published in late 2002

Discussion

1. The authors state that they “believe that even an overhead of 58% is not prohibitive for sites that value security.” (p11) I believe that an overhead of 58% is pretty big, especially for busy systems. How much of a concern is this really?

2. They show the average space/day logging takes. But does this include the daily snapshot as well? If you're running a lot of guest OS’s concurrently, couldn't this become a bottleneck (or does ReVirt only run one guest OS at a time)? They give results for both virtualization overhead and logging overhead, but not both at the same time (which is the real-world scenario). Is there any indication to how much the total overhead is?

Discussion

3. The authors talk about checkpointing in a few areas of the paper. They claim it will be a rare event and so do not test the time and space overhead to run one. They then say that their future work is to “make checkpointing faster and more convenient.”

I wonder how slow and inconvenient checkpointing is at this point for them to avoid testing it (or releasing the test results)? I think this should have been included in the paper as, even though checkpointing may not happen often, it is still part of the system overhead.

Discussion

4. If ReVirt detects the non-deterministic events occurred during the attack, what can it do to prevent further attack? Is it possible to isolate them?

5. Is UMLinux the only guest OS that can be used in ReVirt? Is there any other OS were ported to ReVirt? Or how about the development of ReVirt or some system like it?

Discussion

6. The authors introduce ReVirt to address two shortcomings of current systems - integrity and completeness. They state that the "current system loggers lack integrity because they assume the operating system kernel is trustworthy." However, they also indicate that "even the VMM may be subject to security breaches," but that the VMM is more trustworthy than operating system because the interface is narrower.

Does a narrower interface really make that much of a difference in securing the system? Can't attackers still do a lot of damage?

Discussion

7. They talk about how this approach is useful in analyzing an attack, and in section 5.4 give an example of this. But to do so they introduced a vulnerability and then used the logging method to analyze an attack that they themselves initiated. While the example may have some validity, it would have been nice to see something that they didn't set up themselves.

Discussion

8. Cooperative logging is cited as being capable of significantly reduced storage as no LAN data needs to be logged (it can just be regenerated); however you lose the ability to run independent machines without running the whole network (or so it seems). Are there any schemes that let you do both?

9. They use a modified version of Linux 2.4.18 as the host OS. I’m wondering how modified it is? They claim that the host OS is safe from attack, but because it is still just an ordinary OS, I’m not sure about this. What do you think?

Discussion

10. ReVirt logs all input from external devices. Could these logs be used to pick up passwords from keyboard input or other security input (i.e. fingerprint readers and files from memory sticks)?

11. "ReVirt log all input from external entities. These include most virtual devices: keyboard, mouse, network interface card, ..." When we want to analyze the intrusion of a highly-used web server, logging all input from the network device seems quite expensive (I believe it would be much more than 1.4 GB/day as shown in the experiment). Any solution for that?

Discussion

12. So does it make more sense to add this VM layer just so we can track, or is it just easier? (i.e. what are the arguments for not having a VM layer?)

13. When they used ReVirt to analyze and attack, they only tested it with one attack. I think a broader range of attacks should have been tested to get an accurate account of what ReVirt can do. What do you think about this?

14. What kind of analysis tools do the authors suggest/ provide? They were able to find an error, but when they themselves knew exactly what they were looking for.

Discussion

15. In section 4.4, the paper mentioned alternative architectures for logging and replay. Basically, they compared OS-on-OS structure with direct-on-host structure. How about the direct-on-VMM structure? Does removing host OS improve the performance and stability of ReVirt?

16. In section 6, the paper compared hypervisors with ReVirt and argued that they are targeting different goals. However, since Hypervisors already have similar logging functionalities, why not design ReVirt as a plugin (i.e. a special VM) for some hypervisors?

Discussion

17. Is there some other way to improve security that does not involve loading the VMM as a kernel module?

18. The guest doesn't run X itself, but rather connects to a remote X server (say on the host). Doesn't this introduce a hook that a malicious user could use to gain access to (or at least destabilize) the host?

Discussion

19. Why does ReVirt have only a single disk checkpoint which is the virtual machine being powered off? Why did they not think to add in other checkpoints? Why did they "envision checkpointing being a rare event?" Is this because they don't see their system being attacked more frequently than that?