replicating memory behavior for performance skeletons aditya toomula pc-doctor inc. reno, nv jaspal...

27
Replicating Memory Replicating Memory Behavior for Performance Behavior for Performance Skeletons Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By

Upload: joella-parks

Post on 13-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By

Replicating Memory Behavior for Replicating Memory Behavior for Performance SkeletonsPerformance Skeletons

Aditya Toomula

PC-Doctor Inc.

Reno, NV

Jaspal Subhlok

University of Houston

Houston, TX

By

Page 2: Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By

Resource Selection for Grid Resource Selection for Grid ApplicationsApplications

Application

Network

?where is the best performance

Data

Sim 1GUI

Model

Pre Stream

Page 3: Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By

Motivation

Estimating performance of an application in dynamically changing grid environment.- Estimation based on generic system probes (like NWS) is expensive and error prone.

Estimating performance for micro-architectural simulations.- Executing full application is prohibitively expensive.

Page 4: Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By

Our Approach

Application

Network

PREDICT APPLICATION PERFORMANCE BY RUNNING A SMALL PROGRAM REPRESENTATIVE OF ACTUAL DISTRIBUTED APPLICATION

Data

Sim 1GUI

Model

Pre Stream

Page 5: Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By

Performance Skeletons

A synthetically generated short running program. Skeleton reflects the performance of the application it represents in any

execution scenario.- E.g. skeleton execution time is always 1/1000th of application execution time

An application and its skeleton should have similar execution activities for the above to be true- Communication activity- CPU activity- Memory access pattern

Page 6: Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By

Memory Skeleton

Given an executable application, construct a short running skeleton program whose memory access behavior is representative of the application.

An application and its memory skeleton should have similar cache performance for any cache hierarchy.

Solution approach: create a program that recreates memory accesses in a sequence of representative slices of the executing program

Page 7: Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By

Challenges in creating a Memory Skeleton

• Memory trace is prohibitively large even for a few minutes of execution

• solution approach : sampling and compression – lossy if necessary

• Recreating memory accesses from a trace is difficult• cache is corrupted by management code• recreation has substantial overhead – several instructions have to be executed to issue a memory access request

• solution approach: avoid cache corruption and allow reordering that would minimize overhead per access

Page 8: Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By

Memory Access Behavior of Applications

Two types of locality.

Spatial Locality - if one memory location is accessed then nearby memory locations are also likely to be accessed.

Temporal Locality - if something is accessed once, it

is likely to be accessed again soon.

These locality principles should be preserved in the memory skeleton

Page 9: Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By

Data

Sim 1GUI

Model

Pre Stream

Data

Sim 1

GUI

Model

PreStream

CREATE SKELETON

Collect data address trace samples of the application.

ApplicationSkeleton

Summarize the trace samples.

Generate memory skeleton.

Automatic Skeleton Construction Framework

Page 10: Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By

Data

Sim 1GUI

Model

Pre Stream

Data

Sim 1

GUI

Model

PreStream

CREATE SKELETON

Collect data address trace samples of the application.

ApplicationSkeleton

Summarize the trace samples.

Generate memory skeleton.

Automatic Skeleton Construction Framework

Page 11: Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By

Address Trace Collection

Link the application executable with Valgrind Tool- Generates address trace of the application.- Access to source code not required.

Issues- Unacceptable level of storage space and time overhead.

Hence, sampling of address trace must be done during trace collection itself – collection of full traces of applications is prohibitively expensive.

Page 12: Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By

Trace Sampling- Divide the trace into trace slices: set of consecutive memory references.- tool can be periodically switched on and off to capture these slices.- slices can be collected at random or uniform intervals.

Slice size should be at least one order of magnitude greater than the largest cache expected, in order to capture the temporal locality.

Address Trace Collection (Contd…)

Page 13: Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By

Data

Sim 1GUI

Model

Pre Stream

Data

Sim 1

GUI

Model

PreStream

CREATE SKELETON

Collect data address trace samples of the application.

ApplicationSkeleton

Summarize the trace samples.

Generate memory skeleton.

Automatic Skeleton Construction Framework

Page 14: Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By

Trace Compaction

Recorded trace is still large and expensive to recreate

Compress the trace using the following two ideas

- Exact address in a trace is not critical – a nearby address will work – may affect spatial locality

- Slight reordering of address trace does not affect performance. – may affect temporal locality

This is lossy compression but impact on locality can be reduced to be negligible

Page 15: Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By

Trace Compaction (contd…)

Divide the address space into lines of size ~ typical cache line - record only the line number, not full address.

Impact on spatial locality should be minimal

Divide the temporal sequence of line numbers into clusters. Reordering within a cluster is allowed.

Cluster size should be much smaller than the smallest expected cache size so temporal locality is not affected by reordering

Page 16: Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By

Data

Sim 1GUI

Model

Pre Stream

Data

Sim 1

GUI

Model

PreStream

CREATE SKELETON

Collect data address trace samples of the application.

ApplicationSkeleton

Summarize the trace samples.

Generate memory skeleton.

Automatic Skeleton Construction Framework

Page 17: Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By

Memory Skeleton Generation

Create a C-program that synthetically generates a sequence

of memory references recorded in the previous step.

Challenges Minimizing extraneous address references.

- Any executing program has memory accesses of its own.- Generate a loop structure for each cluster. - Reading linenumber-frequency pairs once leads to series of actual memory references from the trace without intervening address reads.

Page 18: Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By

Skeleton Generation (contd…)

Eliminating cache corruption

- Reading trace data from disk impacts the memory simulation.

Use 2nd machine, read data on one machine and send it through sockets to the other machine where simulation is done.

Socket buffer is kept to a very small size Also reduces overhead on the main simulation machine.

Page 19: Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By

Skeleton Generation (contd…)

Allocating memory

- The regions of virtual memory that will actually be used are not known prior to simulation

Dynamic block allocation Substantial size block of memory is allocated when an address reference is made to a location that is not allocated.

Maintain Sparse Index Table to access the blocks.

Page 20: Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By

Experiments and Results

Skeletons constructed for Class-W NAS serial benchmarks

(and Class-A IS benchmark).

Experiments conducted on Intel Xeon dual CPU 1.7 GHz machines with 256KB 8-way set associative L2 cache and 64 byte lines, running Linux 2.4.7.

Objectives: Prediction of cache miss ratio of corresponding applications. Predictions across different memory hierarchies.Trace slices picked uniformly throughout the trace for all the experiments.

Page 21: Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By

Prediction of Cache Miss Ratios

Comparison of data cache miss rate of benchmarks and corresponding memory skeletons.

Average error < 5%

IS application is an exception.

Trace Sampling ratio = 10%

Trace slice size > 10 million references

No. of slices picked > 10

0

1

2

3

4

5

6

7

bt cg is lu mg sp

NAS benchmark applications

ca

ch

e m

iss

ra

te (

%)

application cache miss rate

skeleton cache miss rate

Page 22: Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By

Impact of trace slice selection

Data cache miss rates for different sets of trace slices in skeletons.Actual Data cache miss rates of

IS – 3.9%BT – 2.76%MG – 1.57%

• Traces for IS, BT and MG benchmarks divided into 100 uniform slices.

• 10 different versions of skeletons generated each using different sets of 10 uniformly spaced trace slices.

- MG and BT have similar cache miss rates in all cases.

- IS has significant variation in cache miss prediction with different sets of slices. Reason:IS execution goes through different phases with different memory access behavior unlike CG and BT.

0

1

2

3

4

5

6

1 2 3 4 5 6 7 8 9 10

trace slice picked

cac

he

mis

s r

ate

(%

)

is

bt

mg

Page 23: Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By

Impact of trace slice selection (contd…)

- Greater the number of trace slices in an application trace, smaller the size of each trace slice.

Data cache miss rates of IS skeletons with different sets of slices and for different number of slices.

- Having large number of slices in the trace captures the multiphase behavior of applications.

0

1

2

3

4

5

6

7

1 2 3 4 5 6 7 8 9 10

trace slice picked

ca

ch

e m

iss

ra

te (

%)

50 slices 100 slices150 slices 200 slices

True cache miss ratio

Page 24: Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By

Impact of trace slice size

-30

-20

-10

0

10

20

30

40

50

33

.5

16

.7

11

.1

8.4

6.7

5.6

4.2

3.4

2.8

1.7

1.1

0.5

0.2

0.1

3

0.0

7

trace slice size(106 references)

cac

he

mis

s e

rro

r (%

)

Error in cache miss prediction with memory skeletons for different trace slice sizes for MG.

- The cache miss ratio prediction error increases rapidly when the slice sizes are reduced below a certain point.

Page 25: Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By

Prediction across hardware platforms

Cache miss comparison of CG benchmark and its skeleton across different memory hierarchies.

Cache miss ratios were predicted fairly accurately with error < 5% across all machines.

0

2

4

6

8

10

12

128K 256K 512K

level 2 cache size

ca

ch

e m

iss

ra

te (

%)

application cache miss rate

skeleton cache miss rate

Page 26: Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By

Conclusion and Discussions

Presents a methodology to build memory skeletons for prediction of application cache miss ratio across hardware platforms.

A step towards building good performance skeletons Extends our group’s previous work on skeletons to memory

characteristics.

Major Contribution Low overhead generation of memory accesses from a trace.

Page 27: Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By

Limitations Instruction references Space and time Overhead Timing accuracy Integration with Communication and CPU

events

Conclusion and Discussions (Contd…)