Download - Software Performance Monitoring

Software Performance MonitoringSoftware Performance Monitoring

Daniele Francesco KruseDaniele Francesco Kruse

July 2010July 2010

SummarySummary

1. Performance Monitoring2. Performance Counters3. Core and Nehalem PMUs – Overview4. Nehalem : Overview of the architecture5. μops flow in Nehalem pipeline6. Perfmon27. Cycle Accounting Analysis8. The 4-way Performance Monitoring9. PfmCodeAnalyser : a new tool for fast

monitoring10.David Levinthal : the expert11.Conclusions

2

Performance MonitoringPerformance Monitoring

DEFDEF : The action of collecting information : The action of collecting information relatedrelated

to how an application or system performsto how an application or system performs

HOWHOW : Obtain micro-architectural level : Obtain micro-architectural level informationinformation

from hardware performance countersfrom hardware performance counters

WHYWHY : To identify bottlenecks, and possibly : To identify bottlenecks, and possibly remove themremove them

in order to improve application performancein order to improve application performance3

Performance CountersPerformance Counters

• All recent processor architectures include a processor–specific PMU

• The Performance Monitoring Unit contains several performance counters

• Performance counters are able to count micro-architectural events from many hardware sources (cpu pipeline, caches, bus, etc…)

4

CoreCore and and NehalemNehalem PMUs - Overview PMUs - Overview

5

Intel Intel CoreCore Microarchitecture PMU Microarchitecture PMU• 3 fixed counters

(INSTRUCTIONS_RETIRED, UNHALTED_CORE_CYCLES, UNHALTED_REFERENCE_CYCLES)

• 2 programmable counters

Intel Intel NehalemNehalem Microarchitecture PMU Microarchitecture PMU

• 3 fixed core-counters(INSTRUCTIONS_RETIRED, UNHALTED_CORE_CYCLES, UNHALTED_REFERENCE_CYCLES)

• 4 programmable core-counters• 1 fixed uncore-counter (UNCORE_CLOCK_CYCLES)

• 8 programmable uncore-counters

Nehalem Nehalem : Overview of the architecture: Overview of the architecture

6

Core 0Core 0 Core 1Core 1 Core 2Core 2 Core 3Core 3

Level 3 CacheLevel 3 Cacheshared, writeback (lazy write) and inclusive (contains L1 shared, writeback (lazy write) and inclusive (contains L1

& L2 of each core)& L2 of each core)

Integrated Memory Integrated Memory ControllerController

Link to local memory Link to local memory (DDR3)(DDR3)

Quick Path Quick Path InterconnectInterconnectLink to I/O hub Link to I/O hub

(& to other processors, if (& to other processors, if present)present)

L1D L1D CacheCache

L1I L1I CacheCache

L2 CacheL2 Cache

TLBsTLBs

writebackwriteback

writebackwriteback

unified, writeback, not-inclusiveunified, writeback, not-inclusive

DTLB0 & ITLB (1DTLB0 & ITLB (1stst Level), STLB (unified Level), STLB (unified 22ndnd Level) Level)

μμops flow in ops flow in Nehalem Nehalem pipelinepipeline

7

• We are mainly interested in UOPS_EXECUTED (dispatched) and UOPS_RETIRED (the useful ones).

• Mispredicted UOPS_ISSUED may be eliminated before being executed.

Instruction Fetch& Branch Prediction

Unit

Decoder

Retirement&

Writeback

Re-OrderBuffer

ResourceAllocator

Execution

Units

Reservation

Station

UOPS_ISSUEDUOPS_ISSUED

UOPS_RETIREDUOPS_RETIRED

UOPS_EXECUTEDUOPS_EXECUTED

Perfmon2Perfmon2

• A generic API to access the PMU (libpfm)• Developed by Stéphane Eranian• Portable across all new processor micro-

architectures• Supports system-wide and per-thread

monitoring• Supports counting and sampling

CPU Hardware

Linux KernelGeneric PerfmonGeneric Perfmon

Architectural Architectural PerfmonPerfmon

PMUPMU

User spacePfmonPfmon Other libpfm-based Other libpfm-based AppsApps

libpfmlibpfm

8

Cycle Accounting AnalysisCycle Accounting Analysis

Total Cycles (Application total execution time)Total Cycles (Application total execution time)

Issuing Issuing μμopsops Not Issuing Not Issuing μμopsops

StalledStalled(no work)(no work)

Not retiring Not retiring μμopsops

(useless work)(useless work)

Retiring Retiring μμopsops

(useful (useful work)work)

Store-FwdStore-FwdL2 L2 missmiss

L2 L2 hithit

LCLCPP

L1 TLB L1 TLB missmiss 9

The 4-way Performance MonitoringThe 4-way Performance Monitoring

10

1. OverallAnalysis

2. Symbol LevelAnalysis

3. Module LevelAnalysis

4. Modular SymbolLevel Analysis

Overall (pfmon)

Modular

Sampling

Counting

PfmCodeAnalyserPfmCodeAnalyser : a new tool for fast monitoring : a new tool for fast monitoring

11

• Unreasonable (and useless) to run a complete analysis for every change in code

• Often interested in only small part of code and in one single event

• Solution: a fast, precise and light “singleton” class called PfmCodeAnalyser

• How to use it:

#include<PfmCodeAnalyser.h>

PfmCodeAnalyser::Instance(“INSTRUCTIONS_RETIRED”).start();

//code to monitor

PfmCodeAnalyser::Instance().stop();

PfmCodeAnalyserPfmCodeAnalyser : a new tool for fast monitoring : a new tool for fast monitoring

12

PfmCodeAnalyser::Instance("INSTRUCTIONS_RETIRED", 0, 0, "UNHALTED_CORE_CYCLES", 0, 0, "ARITH:CYCLES_DIV_BUSY", 0, 0, "UOPS_RETIRED:ANY", 0, 0).start();

Event: INSTRUCTIONS_RETIRED Total count:105000018525Number of counts:10 Average count:10500001852.5

Event: UNHALTED_CORE_CYCLES Total count:56009070544Number of counts:10 Average count:5600907054.4

Event: ARITH:CYCLES_DIV_BUSY Total count:28000202972Number of counts:10 Average count:2800020297.2

Event: UOPS_RETIRED:ANY Total count:138003585913Number of counts:10 Average count:13800358591.3

David LevinthalDavid Levinthal : the expert : the expert

13

• Intel senior engineer specialized in performance monitoring of applications

• He is going to be here from the 13th to the 30th of July

• He will give a detailed lecture about performance monitoring on the 21st or 22nd of July in 513-1-024

• Anyone who’s interested in partecipating may contact us for details

ConclusionsConclusions

14

1. We gave a very brief introduction to Nehalem multicore architecture and to Performance Monitoring using hardware performance counters

2. A consistent set of events has been monitored across the 4 different analysis approaches in CMSSW, Gaudi and Geant4 (Cycle Accounting Analysis)

3. A monitoring tool has been developed for quick performance monitoring: PfmCodeAnalyser

4. The report and background of the work done for CMSSW is available at http://cern.ch/dkruse/pfmon.pdf

http://cern.ch/dkruse/pfmon.pdf

Questions ?Questions ?

Download - Software Performance Monitoring

Top Related