Download - Software Performance Monitoring
Software Performance MonitoringSoftware Performance Monitoring
Daniele Francesco KruseDaniele Francesco Kruse
July 2010July 2010
SummarySummary
1. Performance Monitoring2. Performance Counters3. Core and Nehalem PMUs – Overview4. Nehalem : Overview of the architecture5. μops flow in Nehalem pipeline6. Perfmon27. Cycle Accounting Analysis8. The 4-way Performance Monitoring9. PfmCodeAnalyser : a new tool for fast
monitoring10.David Levinthal : the expert11.Conclusions
2
Performance MonitoringPerformance Monitoring
DEFDEF : The action of collecting information : The action of collecting information relatedrelated
to how an application or system performsto how an application or system performs
HOWHOW : Obtain micro-architectural level : Obtain micro-architectural level informationinformation
from hardware performance countersfrom hardware performance counters
WHYWHY : To identify bottlenecks, and possibly : To identify bottlenecks, and possibly remove themremove them
in order to improve application performancein order to improve application performance3
Performance CountersPerformance Counters
• All recent processor architectures include a processor–specific PMU
• The Performance Monitoring Unit contains several performance counters
• Performance counters are able to count micro-architectural events from many hardware sources (cpu pipeline, caches, bus, etc…)
4
CoreCore and and NehalemNehalem PMUs - Overview PMUs - Overview
5
Intel Intel CoreCore Microarchitecture PMU Microarchitecture PMU• 3 fixed counters
(INSTRUCTIONS_RETIRED, UNHALTED_CORE_CYCLES, UNHALTED_REFERENCE_CYCLES)
• 2 programmable counters
Intel Intel NehalemNehalem Microarchitecture PMU Microarchitecture PMU
• 3 fixed core-counters(INSTRUCTIONS_RETIRED, UNHALTED_CORE_CYCLES, UNHALTED_REFERENCE_CYCLES)
• 4 programmable core-counters• 1 fixed uncore-counter (UNCORE_CLOCK_CYCLES)
• 8 programmable uncore-counters
Nehalem Nehalem : Overview of the architecture: Overview of the architecture
6
Core 0Core 0 Core 1Core 1 Core 2Core 2 Core 3Core 3
Level 3 CacheLevel 3 Cacheshared, writeback (lazy write) and inclusive (contains L1 shared, writeback (lazy write) and inclusive (contains L1
& L2 of each core)& L2 of each core)
Integrated Memory Integrated Memory ControllerController
Link to local memory Link to local memory (DDR3)(DDR3)
Quick Path Quick Path InterconnectInterconnectLink to I/O hub Link to I/O hub
(& to other processors, if (& to other processors, if present)present)
L1D L1D CacheCache
L1I L1I CacheCache
L2 CacheL2 Cache
TLBsTLBs
writebackwriteback
writebackwriteback
unified, writeback, not-inclusiveunified, writeback, not-inclusive
DTLB0 & ITLB (1DTLB0 & ITLB (1stst Level), STLB (unified Level), STLB (unified 22ndnd Level) Level)
μμops flow in ops flow in Nehalem Nehalem pipelinepipeline
7
• We are mainly interested in UOPS_EXECUTED (dispatched) and UOPS_RETIRED (the useful ones).
• Mispredicted UOPS_ISSUED may be eliminated before being executed.
Instruction Fetch& Branch Prediction
Unit
Decoder
Retirement&
Writeback
Re-OrderBuffer
ResourceAllocator
Execution
Units
Reservation
Station
UOPS_ISSUEDUOPS_ISSUED
UOPS_RETIREDUOPS_RETIRED
UOPS_EXECUTEDUOPS_EXECUTED
Perfmon2Perfmon2
• A generic API to access the PMU (libpfm)• Developed by Stéphane Eranian• Portable across all new processor micro-
architectures• Supports system-wide and per-thread
monitoring• Supports counting and sampling
CPU Hardware
Linux KernelGeneric PerfmonGeneric Perfmon
Architectural Architectural PerfmonPerfmon
PMUPMU
User spacePfmonPfmon Other libpfm-based Other libpfm-based AppsApps
libpfmlibpfm
8
Cycle Accounting AnalysisCycle Accounting Analysis
Total Cycles (Application total execution time)Total Cycles (Application total execution time)
Issuing Issuing μμopsops Not Issuing Not Issuing μμopsops
StalledStalled(no work)(no work)
Not retiring Not retiring μμopsops
(useless work)(useless work)
Retiring Retiring μμopsops
(useful (useful work)work)
Store-FwdStore-FwdL2 L2 missmiss
L2 L2 hithit
LCLCPP
L1 TLB L1 TLB missmiss 9
The 4-way Performance MonitoringThe 4-way Performance Monitoring
10
1. OverallAnalysis
2. Symbol LevelAnalysis
3. Module LevelAnalysis
4. Modular SymbolLevel Analysis
Overall (pfmon)
Modular
Sampling
Counting
PfmCodeAnalyserPfmCodeAnalyser : a new tool for fast monitoring : a new tool for fast monitoring
11
• Unreasonable (and useless) to run a complete analysis for every change in code
• Often interested in only small part of code and in one single event
• Solution: a fast, precise and light “singleton” class called PfmCodeAnalyser
• How to use it:
#include<PfmCodeAnalyser.h>
PfmCodeAnalyser::Instance(“INSTRUCTIONS_RETIRED”).start();
//code to monitor
PfmCodeAnalyser::Instance().stop();
PfmCodeAnalyserPfmCodeAnalyser : a new tool for fast monitoring : a new tool for fast monitoring
12
PfmCodeAnalyser::Instance("INSTRUCTIONS_RETIRED", 0, 0, "UNHALTED_CORE_CYCLES", 0, 0, "ARITH:CYCLES_DIV_BUSY", 0, 0, "UOPS_RETIRED:ANY", 0, 0).start();
Event: INSTRUCTIONS_RETIRED Total count:105000018525Number of counts:10 Average count:10500001852.5
Event: UNHALTED_CORE_CYCLES Total count:56009070544Number of counts:10 Average count:5600907054.4
Event: ARITH:CYCLES_DIV_BUSY Total count:28000202972Number of counts:10 Average count:2800020297.2
Event: UOPS_RETIRED:ANY Total count:138003585913Number of counts:10 Average count:13800358591.3
David LevinthalDavid Levinthal : the expert : the expert
13
• Intel senior engineer specialized in performance monitoring of applications
• He is going to be here from the 13th to the 30th of July
• He will give a detailed lecture about performance monitoring on the 21st or 22nd of July in 513-1-024
• Anyone who’s interested in partecipating may contact us for details
ConclusionsConclusions
14
1. We gave a very brief introduction to Nehalem multicore architecture and to Performance Monitoring using hardware performance counters
2. A consistent set of events has been monitored across the 4 different analysis approaches in CMSSW, Gaudi and Geant4 (Cycle Accounting Analysis)
3. A monitoring tool has been developed for quick performance monitoring: PfmCodeAnalyser
4. The report and background of the work done for CMSSW is available at http://cern.ch/dkruse/pfmon.pdf
Questions ?Questions ?