lightweight monitoring of the progress of remotely executing computations

1

Lightweight Monitoring of the Progress of Remotely Executing Computations

Shuo Yang, Ali R. ButtY. Charlie Hu, Samuel P. Midkiff

Purdue University

2

Typical workloads are Bursty Periods of little or no processing Periods of insufficient CPU resources Idle cycles not usable for future

Exploit values from the wasted idle resources Achieve more available processing capability for

“free” or at low cost “Smooth out” the workload

Harvesting Unused Resources

3

Centralized cycle sharing SETI@Home, Genome@Home, IBM (with United Device), etc. Condor, Microsoft (with GridIron), etc.

P2P based cycle-sharing (Butt et al. [VM’04]) Individual node can utilize the system –more incentive Nodes can be across administrative domains– more available resource

Remote execution motivates remote monitoring Unreliable resources Untrusted resources

The Need of Remote Monitoring

4

Review of GridCop – [Yang et al. PPoPP’05]

Submitted Job

(H-code)

Reporting

Module

Reporting

Module

JVM (Sandboxed)

Host Machine

Processing

Module

(S-code)

JVM

Submitter

progress

partial computation

5

Our New Contribution: Key Difference From GridCop

Uses probabilistic code instrumentation Prevents replay attacks (like GridCop)

No recomputation needed – reduces network traffic and submitter machine overhead

Ties the progress information closely to program structure Makes spoofing more difficult

PC values reflecting the program binary code internal nature

6

Overview Design of Lightweight Monitoring Mechanism Experimental Results Related Research and Conclusions

Outline

7

System Overview: Code Generation

Original

codeHost-code

Submitter-code

Code

Generation System Original

code

Executed on Host:Emits progress information (“beacons”) during computation

Executed on submitter:Processes “beacons”

8

System Overview

Submitted Job

(H-code)

Reporting

Module

Host Machine

Beacon

Processing

Module

(S-code)

Submitter

Beacon

9

Basic Idea of the FSA Tracking Beacons are placed at significant execution points along

CFG Beacons can be viewed as states in an FSA Can be placed at any site satisfying the compiler instrumentation

criteria, e.g. MPI call sites in this paper

Host emits beacon messages at significant execution points An FSA emitting transition symbols

Submitter processes beacon messages A mirror FSA recognizing legal transitions

10

An FSA Example

main(){ … mpi_irecv(…); //S1 … if(predicate){ mpi_send(…); //S2 } … mpi_wait(); //S3 …}

S1 S2 S3

11

Binary file Location Beacon (BLB)

BLB values are the virtual address of instructions in the virtual memory of a process– states in FSA

Stack

….

….

heap

Code segment

Initialized data

bss

…

804a69b: call mpi_wait

…

804a679: call mpi_send

…

804a641: call mpi_irecv

…

12

PC values – labels driving the transitions in FSAmain(){ … pc = getPC(); mpi_irecv(…);// 0x804a641 deposit_beacon(pc); if(predicate){ pc = getPC(); mpi_send(…); //0x804a679 deposit_beacon(pc); } pc = getPC(); mpi_wait(); //0x804a69b deposit_beacon(pc); …}

@804a641

@804a679

@804a69b

“804a69b” “804a641”

“804a69b”“804a679”

•Compiler inserts a getPC() in front of a BLB

•getPC() returns the address of the next instruction

13

Tracking the Progress of an MPI Programmain(){ … pc = getPC(); mpi_irecv(…);// 0x804a641 deposit_beacon(pc); if(predicate){ pc = getPC(); mpi_send(…); //0x804a679 deposit_beacon(pc); } pc = getPC(); mpi_wait(); //0x804a69b deposit_beacon(pc); …}

@804a641

@804a679

@804a69b

“804a69b” “804a641”

“804a69b”“804a679”

14

Attacks to the FSA Mechanism

Susceptible to replay attack Remember the stream of beacons of a previous run Replay the stream in the future (cheating to gain

undeserved compensation) Reverse engineer the binary executable

Understand the control flow graph Expensive– NP-hard in worst case ([Wang, PhD thesis

University of Virginia])

15

Probabilistic BLB

Each MPI function call site is a BLB candidate but not necessarily a BLB site It is used as a BLB site with probability of PB in (0,1)

Effect: an individual MPI function call site may be a BLB in the FSA in one code generation; but not a BLB in next time

16

Probabilistic BLBs Guard against Attack

The same job can have a different FSA each time it is submitted to the host

This leads to a different legal beacon value stream Defeats the replay attack by making it detectable

Reverse engineering by binary analysis must be repeated by cheating host on each run Break once, spoof only once

Too expensive!

17

One FSA with Probabilistic BLBmain(){ … pc = getPC(); mpi_irecv(…);// 0x804a641 deposit_beacon(pc); if(predicate){ pc = getPC(); mpi_send(…); //0x804a679 deposit_beacon(pc); } pc = getPC(); mpi_wait(); //0x804a69b deposit_beacon(pc); …}

@804a641

@804a679

@804a69b

“804a69b” “804a641”

“804a69b”“804a679”

18

Another FSA with Probabilistic BLBmain(){ … pc = getPC(); mpi_irecv(…);// 0x804a641 deposit_beacon(pc); if(predicate){ mpi_send(…); //0x804a679 } pc = getPC(); mpi_wait(); //0x804a69b deposit_beacon(pc); …}

@804a641 @804a69b

“804a69b” “804a641”

19


Outline

20

Experimental Setup

Submitter machine @UIUC (thanks to Josep Torrellas) Intel 3GHz Xeon/512K cache, 1GB main memory Running Linux 2.4.20 kernel

Host machine @Purdue A cluster of 8 Pentium IV machines (each node has 512K cache,

512MB main memory), interconnected by a FastEthernet. Running FreeBSD 4.7, MPICH 1.2.5

Network access Both machines connected to campus networks via Ethernet UIUC—Purdue: representing a typical scenario of cycle-

sharing across WAN

21

Benchmarks & Evaluation Metrics Used NAS Parallel Benchmark (NPB) 3.2

A set of benchmarks to evaluate the performance of parallel computational resources

1. Run Time Computation Overhead

2. Network Traffic Overhead Network resource is not “free”

3. Beacon Distribution over Time Capability to track progress incrementally

22

Host Side Computation Overhead at Different Number of Nodes

Overhead = (Tmonitoring – Toriginal) / Toriginal * 100% Lower bar is better Does not increase monotonically with the increase of process numbers

0.00%

0.50%

1.00%

1.50%

2.00%

2.50%

EP IS MG CG

2 nodes

4 nodes

8 nodes

23

Host Side Computation Overhead under Different Input Sizes

Overhead = (Tmonitoring – Toriginal) / Toriginal * 100% Lower bar is better Lower overhead for larger problem size

Different input sizes on 8 nodes

0.00%

0.50%

1.00%

1.50%

2.00%

EP IS MG CG

Ov

erh

ea

d

size B

size C

24

Submitter Side Computation Cost

Size B (8 procs) Size C (8 procs)

EP 0.06% 0.02%

IS 0.07% 0.02%

MG 0.15% 0.03%

CG 0.17% 0.07%

• Overhead = time(submitter code) / execution time• Imperfect metric–the number depends on submitter’s

hardware, submitter workload, host speed etc.

25

Network Traffic Incurred by Monitoring

Size B (8 procs) Size C (8 procs)

EP 4.2 bytes/s 1.0 bytes/s

IS 101.5 bytes/s 23.2 bytes/s

MG 21.2K bytes/s 1.5K bytes/s

CG 21.9K bytes/s 7.9K bytes/s

Bytes sent over network between host and submitter machine divided by the total execution time Low bandwidth usage

26

Beacon Distribution over Time

Uniformly distribution enables incrementally tracking

27


Outline

28

Related Research

L. F. Sarmenta [CCGrid’01], W. Du et al. [ICDCS’04] A host performs same computation on different inputs Needs a central manager

Yang et al. [PPoPP’05] Partially duplicate compuation Incurs more network traffic associated with the recomputation

Hofmeyr et al. [J. of Computer Security’98], Chen and Wagner [CCS’02] Using system call sequence to detect intrusions Approaches to achieve host security

29

Conclusions

Lightweight monitoring over a WAN/Internet possible

No changes to host side system required Instrumentation can be performed automatically

30

Host Side Overhead Details(Slide 22)

Overhead = (Tmonitoring – Toriginal) / Toriginal

Does not increase monotonically with an increase in the number of processes (Nprocess)

When Nprocess increases: The denominator, Toriginal, decreases The numerator – difference of Tmonitoring and Toriginal decreases

(the number of MPI calls decreases, decreasing the overhead of BLB message generation)

Synchronization: always one extra thread per process no matter how many processes are running

31

Host Side Overhead Details(Slide 23)

Overhead = (Tmonitoring – Toriginal) / Toriginal

Results in lower overhead for larger problem size

When the problem size increases Denominator (Toriginal) increases Numerator (Tmonitoring – Toriginal) similar since the number of

MPI calls is similar

lightweight monitoring of the progress of remotely executing computations

Documents

getpc mpi

mpi programmain

beaconpc ifpredicate

0x804a69b deposit

0x804a679 deposit

irecv 0x804a641 deposit

mirror fsa

fsa trackingbeacons