opteron hardware performance counters richard smith technical specialist data centre practice sun...
TRANSCRIPT
OPTERON HARDWARE PERFORMANCE COUNTERS
Richard SmithTechnical Specialist Data Centre PracticeSun Microsystems Australia
2
Why does my code take xxxxx seconds to execute?
Opteron pipeline
3 instructions/cycle2.6 GHz (cycles per second)
1015
instr
35 hrs ???
Many factors involved:● Instruction Level Parallelism (ILP)● Memory latency● System bandwidth●Thread Level Parallelism (TLP)
3
Opteron Microarchitecture Features
● Deep OOO integer and FP execution● Fetch/Decode 3 instructions/cycle (max 16 bytes)● 3-way integer + 3-way address + 3-way FP exec● 64KB L1 D$ and 1MB L2 D$ on-chip● x86_64 extensions to x86 (“x64”)● Integrated DDR1 memory controller● 3 x 16b HyperTransport interfaces
4
Opteron Performance Counters
PerfEvtSel0
PerfCtr0
PerfEvtSel1
PerfCtr1
PerfEvtSel2
PerfCtr2
PerfEvtSel3
PerfCtr3
48-bit counters: bits 63--48 are reserved
UNIT_MASK EVENT_MASKINV
EN INT PC EDG
E
USR
31 24 22 1615 8 7 0
Processor Functional Unit
(FP, LS, DC, BU, IC, FR, NB)
CNT_MASK
0x004108cb ==> EN, USR, scalar SSE+SSE2 instr, Retired FPU instrcputrack -c pic2=FR_retired_fpu_instr,umask2=0x08 ...
OS
5
Solaris and HW Counters
• #26094 BIOS and Kernel Developer's Guide • Lists more than known by Solaris kernel
> New counters for processor revisions D and E• Nevada source available at http://opensolaris.org• cc -D_KERNEL -xarch=amd64 -xmodel=kernel -c
opteron_pcbe.c• ld -r -o pcbe.AuthenticAMD.15 opteron_pcbe.o• Virtualised counter support built-in
> Linux requires perfctr patch
6
Using the Counters
• cputrack(1)• cpustat(1M)• libcpc(3LIB)• collect (Studio 11 collector/analyzer)• perfctr (linux)• PAPI
NB: Some counters are not duplicated on dual-core Opteron(rev E and older)
7
Dual-core Opteron
CPU1
1MB L2 Cache
Memory
ControllerHT0 HT1 HT2
CPU0
1MB L2 Cache
System Request Interface
Crossbar Switch
8
HyperTransport
2, 4, 8, 16, or 32 bits @ 200 to 1000MHz
2, 4, 8, 16, or 32 bits @ 200 to 1000MHz
Device A Device B
HT is a scalable point-to-point linkCoherent HT is used to connect processorsNon-coherent HT is used for i/o connectivity (PCI semantics map neatly)1xx, 2xx and 8xx cpus differ in number of coherent HT interfacesBasic unit of transmission is a Dword (4 bytes)2B per clock edge @ 1000MHz ==> 4GB in each direction
9
HyperTransport on V20Z 2484B Dwords
Command
Data (max 64B payload)
Buffer Release (NOP?)
NOP
800 MHz DDR
cpu 0 cpu 1
0 1 2 0 1 2
I/O
800 MHz x 4B/clock = 3.2 GB/s each way
link cpu 0 cpu 10 0 01 600M 800M2 800M 0
800M Dword/sfull duplex
HT link
Measured via cpustat(1m)
10
NUMA Architecture (AMD: SUMA)
Each hop adds 30 – 40ns latencyMinimising #hops improves performanceand reduces system bandwidth consumed
11
Memory Bandwidth Test (per sec)
cpu 1cpu 0
76M probes:153M Cmd + 0M Data + 76M BufRel
execute
76M Cmd + 0M Data + 76M BufRel
cpu 1cpu 037M probes:113M Cmd + 1M Data + 150M BufRel
execute150M Cmd + 600M Data + 37M BufRel
4891 MB/s
2402 MB/s
Local Memory
Remote Memory
12
HT Usage via cpustat
cpustat -c \pic0=NB_ht_bus2_bandwidth,umask0=0x01,\pic1=NB_ht_bus2_bandwidth,umask1=0x02,\pic2=NB_ht_bus2_bandwidth,umask2=0x04,\pic3=NB_ht_bus2_bandwidth,umask3=0x08,sys \...-c pic0=NB_probe_result,umask0=0x0f,sys \-p 1 1 &
NB: Only one set of HT counterson dual-core cpus
13
Local vs Remote Memory AccessRevision E?
• Event 0xE9 CPU/IO Requests to Memory/IO> umask 0xA8 Local => Local> umask 0x98 Local => Remote
• Doesn't distinguish between reads and writes
14
Opteron Pipeline
L1
InstructionCache
64KB
44-entry
Load/Store
Queue
L2
Cache
L1
DataCache
64KB
Crossbar
Memory
Controller
HyperTransportTM
System
Request
Queue
Fetch
Int Decode & Rename
OPs
36-entry FP scheduler
FADD
FMISC
FMUL
Branch
Prediction
Instruction Control Unit (72 entries)
Fastpath Microcode EngineScan/Align
FP Decode & Rename
AGU
ALU
AGUALU
MULT
AGU
ALU
Res Res Res
Bus
Unit
15
Pipeline Throughput
• 76h BU_cpu_clk_unhalted• C0h FR_retired_x86_instr_w_excp_intr• C1h FR_retired_uops• CBh FR_retired_fpu_instr
> x87, MMX, packed and scalar SSE[2]• 00h FP_dispatched_fpu_ops
> add, multiply, store, ...• 01h FP_cycles_no_fpu_ops_retired
16
Understanding Pipeline Stalls
• D1h FR_dispatch_stalls• D2h FR_dispatch_stall_branch_abort_to_retire• D5h FR_dispatch_stall_reorder_buffer_full
> maximum of 72 inflight instructions (24 x 3 lanes)• D6h FR_dispatch_stall_resv_stations_full
> ALU and AGU ops 24 entries (8 x 3 schedulers) • D7h FR_dispatch_stall_fpu_full
> 36 FP instructions across 3 schedulers• 23h LS_buffer_2_full
> 12 LS1 entries and 32 LS2 entries
17
Prefetch Activity
• 67h BU_data_prefetch> Prefetch attempts and cancelled prefetches> Includes HW prefetcher activity
• 4Bh DC_dispatched_prefetch_instr> Prefetches are strong: not dropped on DTLB miss> Load (T0/T1/T2)> Store (PrefetchW)> NTA (for low-reuse data, avoids polluting L2)
18
Cache Counters Flow
dc accesses
l2 filldc misses
dc victim
refill from l2
page hit/miss/conflict
L1 L2memorycontroller memory
nb_sized_command
nb_ht_busx_bandwidth
HT
(hw prefetch)
refill from system
probe
prefetch
l2 miss
Very approximately!
19
DDR Memory Access
page hitpage misspage conflict
Trp
(precharge delay)Trcd
(RAS to CAS delay)
Tcl
(CAS latency)
●Opteron generally uses 16B wide memory transfers●16B x 400MT/s ==> 6400MB/s●200MHz x 2 edges●bank select ==> row select ==> column select●latency dependent on hit in memory controller “open page” cache●How do physical addresses (PA) map to (bank, row, column)?
NB_mem_ctrlr_page_access
20
Controlling Memory Locality
• lgroups on Solaris 10> future: hierarchy of memory locality
• processor_bind() and madvise()> keep a thread and its data together
• ppgsz -oheap=2m for large pages> fewer DTLB misses: 32 x 2m vs 512 x 4k
• meminfo() and/or pmap -sx> currently kernel cage issues may lead to memory
fragmentation so large pages not always available
Richard [email protected]
OPTERON SYSTEMPERFORMANCE