profiling memory subsystem performance in an advanced power virtualization environment the prominent...

1
Profiling Memory Subsystem Performance in an Advanced POWER Virtualization Profiling Memory Subsystem Performance in an Advanced POWER Virtualization Environment Environment The prominent role of the memory hierarchy as one of the major bottlenecks in achieving good program performance has motivated the search for ways of capturing the memory performance of an application/machine pair that is both practical in terms of time and space, yet detailed enough to gain useful and relevant information. The strategy that we endorse periodically samples events during program execution, producing an event trace that is both manageable and informative. Additionally, we developed a fast and flexible performance evaluation framework with which to analyze and understand the performance data contained within the sampled event traces. We have shown the potential of our performance evaluation methodology by using it to analyze a disparate set of performance issues for large, complex applications running on a multiprocessor system. For example, we have applied our methodology to characterize performance issues such as memory access performance, process migration, compulsory and conflict misses, and false sharing. To date, we have studied the memory subsystem performance of several complex applications, including the TPC-C and SPECsfs benchmarks, executing on different configurations of the IBM eServer pSeries 690. Additionally, we have begun to investigate the effectiveness of our performance evaluation framework when studying memory subsystem performance in a virtualized environment. Virtualization allows multiple execution environments to time-share the same physical hardware in an effort to increase machine utilization. However, there is an inherent performance overhead associated with sharing a fixed set of hardware resources. The goal of our work is to identify and analyze the performance overhead associated with virtualization using our performance evaluation framework. To date, we have studied the memory subsystem performance of TRADE3, an on-line stock brokerage application, executing on different configurations of the IBM eServer p5 570, a commercial server designed to support virtualization. Department of Computer Science Diana Villa, Ph.D. Candidate Mitesh Meswani, Ph.D. Candidate Dr. Patricia Teller, Professor Bret Olszewski Mala Anand Carole Gottlieb Austin, TX Data Collection 1 Environment IBM eServer p5 570 (p570) architecture 1.65 GHz POWER5 processor 4-processor configuration Workload TRADE3 On-line stock brokerage application Three-tier configuration Websphere, DB2, Application Code Data Collected via Event-based Sampling (record periodic occurrence of monitored event) Organized as Sampled Event Traces (one per CPU) Event Record PID TID Timestamp Effective Instruction Address Effective Data Address 372872 184469 0.328104637 000000000000A8C4 0000000000218880 Events Profiled 2 L2-Cache Data Load Misses - require the CPU to access off-chip memory to be resolved Classified according to level at which they are resolved and state of the requested block 4-processor configuration of the p570 L2.75 (different DCM) L3 DCM 0 P P L2 L3 MEM DCM 1 P P L2 L3 MEM L3.75 cache L2.75 cache L3 cache L2 cache Load Latency L2-Cache Access Resolution Site LMEM RMEM 14 cycles 91 cycles 121 cycles 205 cycles 281 cycles 307 cycles Load Latencies of 4-processor Configuration L3.75 (different DCM) LMEM LMEM (different DCM) Performance Framework 3 MySQL databases catalog/store sampled event traces Java tools interface with databases to load sampled event traces and run queries Reports Distribution ofL3 D ata Load Hits 0 0.1 0.2 0.3 0.4 0.5 Kernel Text D ata,B SS,Heap BufferPool SharedD ata Stack U-BlockandKernelStack KERN_HEAP Address region Fraction ofdata loads Unique cache line Hit% D istribution ofL3 D ata Load H its A cross P ages of a B ufferP ool S egm ent 0 50 100 150 200 250 300 350 400 100 1600 3100 4600 6100 7600 P age [0-65536] Hit/Cache line count Totalloads Unique cache line Graphs Database Load DB Java Tool Data Collection Environment TRADE3 p570 Sampled Event Traces Report Generator Java Tool 5 BufferPool 56893 29384 6 Data,BSS,Heap 8799 4855 1 Kernel 23485 9840 PID TID Timestamp InstrAddr DataAddr PID TID Timestamp InstrAddr DataAddr PID TID Timestamp InstrAddr DataAddr Virtualization 4 Virtualize resources to facilitate time-sharing of the hardware by different execution environments Emergence of virtualization technology in new environments (e.g., newer architectures, open source) POWER Hypervisor facilitates resource sharing and supports as many as 254 active partitions Data Analysis and Results 5 Performance overhead associated with virtualization due to sharing a fixed-set of hardware resources Goal: Observe differences in data-load behavior that could represent the performance overhead Compared executions of TRADE3 in non-virtualized (1P) and virtualized (5P) environments Observed an increased locality of reference for 5P data-loads in memory Indicates a possible increase in capacity/conflict misses in 5P case due to contention for hardware resources DCM 0 P P L2 L3 MEM DCM 1 P P L2 L3 MEM APP 3 OS 3 POWER Hypervisor APP 2 OS 2 APP 1 OS 1 APP 4 OS 4 APP N OS N TRADE3 -W ebsphere G roup M EM D ata Load H its by Address R egion 0 0.2 0.4 0.6 0.8 1 Kernel W orkingStorage D ata SharedLibraryCode SharedLibraryData O ther A d d ress reg io Fraction ofdata loads 1P DLH 1P UCL 5P DLH 5P UCL TRADE3 -W ebsphere G roup M EM D ata Load H its by Segm entforD ata R egion 0 0.2 0.4 0.6 0.8 1 000000003 000000004 000000005 000000006 000000007 Segm en Fraction ofdata loads 1P DLH 1P UCL 5P DLH 5P UCL Publications 6 2005 Villa, D., Meswani, M., Teller, P.J., and Olszewski, B., "Profiling Memory Subsystem Performance in an Advanced POWER Virtualization Environment", To appear in the Proceedings of the 1st International Workshop on Operating System Interference in High Performance Applications, September 2004, St. Louis, MO. Portillo, R., Villa, D., Teller, P.J., and Olszewski, B., "Mining Performance Data from Sampled Event Traces", Proceedings of the 6th Annual Austin Center for Advanced Studies (CAS) Conference, February 2005, Austin, TX. 2004 Villa, D., Acosta, J., Teller, P.J., Olszewski, B., and Morgan, T., "Memory Performance Profiling via Sampled Performance Monitor Event Traces", Proceedings of the 5th Annual Los Alamos Computer Science Institute Symposium (LACSI), October, 2004, Santa Fe, NM. Portillo, R., Villa, D., Teller, P.J., and Olszewski, B., "Mining Performance Data from Sampled Event Traces", Proceedings of the 12th Annual Meeting of the IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), October 2004, Volendam, The Netherlands. Villa, D., Acosta, J., Teller, P.J., Olszewski, B., and Morgan, T., "A Framework for Profiling Multiprocessor Memory Performance", Proceedings of the 10th International Conference on Parallel and Distributed Systems (ICPADS), July 2004, Long Beach, CA. Villa, D., Acosta, J., Teller, P.J., Olszewski, B., and Morgan, T., "Memory Performance Profiling via Sampled Performance Monitor Event Traces", Proceedings of the 5th Annual Austin Center for Advanced Studies (CAS) Conference, February 2004, Austin, TX. 2003 Villa, D. (2003). Using Sampled Performance Monitor Event Traces to Characterize Application Behavior. Unpublished master's

Upload: margaretmargaret-bradford

Post on 02-Jan-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Profiling Memory Subsystem Performance in an Advanced POWER Virtualization Environment The prominent role of the memory hierarchy as one of the major bottlenecks

Profiling Memory Subsystem Performance in an Advanced POWER Virtualization Profiling Memory Subsystem Performance in an Advanced POWER Virtualization EnvironmentEnvironment

The prominent role of the memory hierarchy as one of the major bottlenecks in achieving good program performance has motivated the search for ways of capturing the memory performance of an application/machine pair that is both practical in terms of time and space, yet detailed enough to gain useful and relevant information.

The strategy that we endorse periodically samples events during program execution, producing an event trace that is both manageable and informative. Additionally, we developed a fast and flexible performance evaluation framework with which to analyze and understand the performance data contained within the sampled event traces. We have shown the potential of our performance evaluation methodology by using it to analyze a disparate set of performance issues for large, complex applications running on a multiprocessor system. For example, we have applied our methodology to characterize performance issues such as memory access performance, process migration, compulsory and conflict misses, and false sharing. To date, we have studied the memory subsystem performance of several complex applications, including the TPC-C and SPECsfs benchmarks, executing on different configurations of the IBM eServer pSeries 690.

Additionally, we have begun to investigate the effectiveness of our performance evaluation framework when studying memory subsystem performance in a virtualized environment. Virtualization allows multiple execution environments to time-share the same physical hardware in an effort to increase machine utilization. However, there is an inherent performance overhead associated with sharing a fixed set of hardware resources. The goal of our work is to identify and analyze the performance overhead associated with virtualization using our performance evaluation framework. To date, we have studied the memory subsystem performance of TRADE3, an on-line stock brokerage application, executing on different configurations of the IBM eServer p5 570, a commercial server designed to support virtualization.

Department of Computer Science

Diana Villa, Ph.D. Candidate

Mitesh Meswani, Ph.D. Candidate

Dr. Patricia Teller, Professor

Bret Olszewski

Mala Anand

Carole Gottlieb

Austin, TX

Data CollectionData Collection1

Environment IBM eServer p5 570 (p570) architecture

1.65 GHz POWER5 processor 4-processor configuration

Workload TRADE3

On-line stock brokerage application Three-tier configuration Websphere, DB2, Application CodeData

Collected via Event-based Sampling (record periodic occurrence of monitored event) Organized as Sampled Event Traces (one per CPU) Event Record

PID TID TimestampEffective

Instruction AddressEffective

Data Address

372872 184469 0.328104637 000000000000A8C4 0000000000218880

Events ProfiledEvents Profiled2

L2-Cache Data Load Misses - require the CPU to access off-chip memory to be resolved Classified according to level at which they are resolved and state of the requested block

4-processor configuration of the p570 L2.75 (different DCM) L3

DCM 0

P P

L2L3

MEM

DCM 1

P P

L2L3

MEM

L3.75 cache

L2.75 cache

L3 cache

L2 cache

Load LatencyL2-Cache Access Resolution Site

LMEM

RMEM

14 cycles

91 cycles

121 cycles

205 cycles

281 cycles

307 cycles

Load Latencies of 4-processor Configuration

L3.75 (different DCM)

LMEM

LMEM (different DCM)

Performance Framework

Performance Framework

3

MySQL databases catalog/store sampled event traces Java tools interface with databases to load sampled event traces and run queries

ReportsDistribution of L3 Data Load Hits

0 0.1 0.2 0.3 0.4 0.5

Kernel

Text

Data,BSS,Heap

BufferPool

SharedData

Stack

U-BlockandKernelStack

KERN_HEAP

Add

ress

reg

ion

Fraction of data loads

Unique cache line

Hit %

Distribution of L3 Data Load Hits Across Pages of a Buffer Pool Segment

050

100150200250300350400

100 1600 3100 4600 6100 7600

Page [0-65536]

Hit/

Cac

he li

ne c

ount

Total loads

Unique cache line

Graphs

Database

Load DB Java Tool

Data Collection Environment

TRADE3TRADE3 p570p570

Sampled Event Traces

Report GeneratorJava Tool

5 BufferPool 56893 293846 Data,BSS,Heap 8799 48551 Kernel 23485 9840

PID TID Timestamp InstrAddr DataAddrPID TID Timestamp InstrAddr DataAddrPID TID Timestamp InstrAddr DataAddr

VirtualizationVirtualization4

Virtualize resources to facilitate time-sharing of the hardware by different execution environments Emergence of virtualization technology in new environments (e.g., newer architectures, open source) POWER Hypervisor facilitates resource sharing and supports as many as 254 active partitions

Data Analysis and ResultsData Analysis and Results5

Performance overhead associated with virtualization due to sharing a fixed-set of hardware resources Goal: Observe differences in data-load behavior that could represent the performance overhead Compared executions of TRADE3 in non-virtualized (1P) and virtualized (5P) environments Observed an increased locality of reference for 5P data-loads in memory Indicates a possible increase in capacity/conflict misses in 5P case due to contention for hardware resources

DCM 0

P P

L2L3

MEM

DCM 1

P P

L2L3

MEM

APP3

OS3

POWER Hypervisor

APP2

OS2APP1

OS1

APP4

OS4 APPN

OSN

TRADE3 - Websphere Group MEM Data Load Hits by Address Region

0 0.2 0.4 0.6 0.8 1

Kernel

WorkingStorage

Data

SharedLibraryCode

SharedLibraryData

Other

Ad

dre

ss r

eg

ion

Fraction of data loads

1P DLH

1P UCL

5P DLH

5P UCL

TRADE3 - Websphere Group MEM Data Load Hits by Segment for Data Region

0 0.2 0.4 0.6 0.8 1

000000003

000000004

000000005

000000006

000000007

Seg

men

t

Fraction of data loads

1P DLH

1P UCL

5P DLH

5P UCL

PublicationsPublications6

2005 Villa, D., Meswani, M., Teller, P.J., and Olszewski, B., "Profiling Memory Subsystem Performance in an Advanced POWER Virtualization Environment", To appear in the Proceedings of the 1st International Workshop on Operating System Interference in High Performance Applications, September 2004, St. Louis, MO. Portillo, R., Villa, D., Teller, P.J., and Olszewski, B., "Mining Performance Data from Sampled Event Traces", Proceedings of the 6th Annual Austin Center for Advanced Studies (CAS) Conference, February 2005, Austin, TX.2004 Villa, D., Acosta, J., Teller, P.J., Olszewski, B., and Morgan, T., "Memory Performance Profiling via Sampled Performance Monitor Event Traces", Proceedings of the 5th Annual Los Alamos Computer Science Institute Symposium (LACSI), October, 2004, Santa Fe, NM. Portillo, R., Villa, D., Teller, P.J., and Olszewski, B., "Mining Performance Data from Sampled Event Traces", Proceedings of the 12th Annual Meeting of the IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), October 2004, Volendam, The Netherlands. Villa, D., Acosta, J., Teller, P.J., Olszewski, B., and Morgan, T., "A Framework for Profiling Multiprocessor Memory Performance", Proceedings of the 10th International Conference on Parallel and Distributed Systems (ICPADS), July 2004, Long Beach, CA. Villa, D., Acosta, J., Teller, P.J., Olszewski, B., and Morgan, T., "Memory Performance Profiling via Sampled Performance Monitor Event Traces", Proceedings of the 5th Annual Austin Center for Advanced Studies (CAS) Conference, February 2004, Austin, TX.2003 Villa, D. (2003). Using Sampled Performance Monitor Event Traces to Characterize Application Behavior. Unpublished master's thesis, The University of Texas at El Paso, El Paso, TX. Morgan, T., Villa, D., Teller, P.J., Olszewski, B., and Acosta, J., "L2 Miss Profiling on the p690 for a Large-scale Database Application", Proceedings of the 4th Annual Austin Center for Advanced Studies (CAS) Conference, February 2003, Austin, TX.