opteron hardware performance counters richard smith technical specialist data centre practice sun...

OPTERON HARDWARE PERFORMANCE COUNTERS

Richard SmithTechnical Specialist Data Centre PracticeSun Microsystems Australia

2

Why does my code take xxxxx seconds to execute?

Opteron pipeline

3 instructions/cycle2.6 GHz (cycles per second)

1015

instr

35 hrs ???

Many factors involved:● Instruction Level Parallelism (ILP)● Memory latency● System bandwidth●Thread Level Parallelism (TLP)

3

Opteron Microarchitecture Features

● Deep OOO integer and FP execution● Fetch/Decode 3 instructions/cycle (max 16 bytes)● 3-way integer + 3-way address + 3-way FP exec● 64KB L1 D$ and 1MB L2 D$ on-chip● x86_64 extensions to x86 (“x64”)● Integrated DDR1 memory controller● 3 x 16b HyperTransport interfaces

4

Opteron Performance Counters

PerfEvtSel0

PerfCtr0

PerfEvtSel1

PerfCtr1

PerfEvtSel2

PerfCtr2

PerfEvtSel3

PerfCtr3

48-bit counters: bits 63--48 are reserved

UNIT_MASK EVENT_MASKINV

EN INT PC EDG

E

USR

31 24 22 1615 8 7 0

Processor Functional Unit

(FP, LS, DC, BU, IC, FR, NB)

CNT_MASK

0x004108cb ==> EN, USR, scalar SSE+SSE2 instr, Retired FPU instrcputrack -c pic2=FR_retired_fpu_instr,umask2=0x08 ...

OS

5

Solaris and HW Counters

• #26094 BIOS and Kernel Developer's Guide • Lists more than known by Solaris kernel

> New counters for processor revisions D and E• Nevada source available at http://opensolaris.org• cc -D_KERNEL -xarch=amd64 -xmodel=kernel -c

opteron_pcbe.c• ld -r -o pcbe.AuthenticAMD.15 opteron_pcbe.o• Virtualised counter support built-in

> Linux requires perfctr patch

6

Using the Counters

• cputrack(1)• cpustat(1M)• libcpc(3LIB)• collect (Studio 11 collector/analyzer)• perfctr (linux)• PAPI

NB: Some counters are not duplicated on dual-core Opteron(rev E and older)

7

Dual-core Opteron

CPU1

1MB L2 Cache

Memory

ControllerHT0 HT1 HT2

CPU0

1MB L2 Cache

System Request Interface

Crossbar Switch

8

HyperTransport

2, 4, 8, 16, or 32 bits @ 200 to 1000MHz

2, 4, 8, 16, or 32 bits @ 200 to 1000MHz

Device A Device B

HT is a scalable point-to-point linkCoherent HT is used to connect processorsNon-coherent HT is used for i/o connectivity (PCI semantics map neatly)1xx, 2xx and 8xx cpus differ in number of coherent HT interfacesBasic unit of transmission is a Dword (4 bytes)2B per clock edge @ 1000MHz ==> 4GB in each direction

9

HyperTransport on V20Z 2484B Dwords

Command

Data (max 64B payload)

Buffer Release (NOP?)

NOP

800 MHz DDR

cpu 0 cpu 1

0 1 2 0 1 2

I/O

800 MHz x 4B/clock = 3.2 GB/s each way

link cpu 0 cpu 10 0 01 600M 800M2 800M 0

800M Dword/sfull duplex

HT link

Measured via cpustat(1m)

10

NUMA Architecture (AMD: SUMA)

Each hop adds 30 – 40ns latencyMinimising #hops improves performanceand reduces system bandwidth consumed

11

Memory Bandwidth Test (per sec)

cpu 1cpu 0

76M probes:153M Cmd + 0M Data + 76M BufRel

execute

76M Cmd + 0M Data + 76M BufRel

cpu 1cpu 037M probes:113M Cmd + 1M Data + 150M BufRel

execute150M Cmd + 600M Data + 37M BufRel

4891 MB/s

2402 MB/s

Local Memory

Remote Memory

12

HT Usage via cpustat

cpustat -c \pic0=NB_ht_bus2_bandwidth,umask0=0x01,\pic1=NB_ht_bus2_bandwidth,umask1=0x02,\pic2=NB_ht_bus2_bandwidth,umask2=0x04,\pic3=NB_ht_bus2_bandwidth,umask3=0x08,sys \...-c pic0=NB_probe_result,umask0=0x0f,sys \-p 1 1 &

NB: Only one set of HT counterson dual-core cpus

13

Local vs Remote Memory AccessRevision E?

• Event 0xE9 CPU/IO Requests to Memory/IO> umask 0xA8 Local => Local> umask 0x98 Local => Remote

• Doesn't distinguish between reads and writes

14

Opteron Pipeline

L1

InstructionCache

64KB

44-entry

Load/Store

Queue

L2

Cache

L1

DataCache

64KB

Crossbar

Memory

Controller

HyperTransportTM

System

Request

Queue

Fetch

Int Decode & Rename

OPs

36-entry FP scheduler

FADD

FMISC

FMUL

Branch

Prediction

Instruction Control Unit (72 entries)

Fastpath Microcode EngineScan/Align

FP Decode & Rename

AGU

ALU

AGUALU

MULT

AGU

ALU

Res Res Res

Bus

Unit

15

Pipeline Throughput

• 76h BU_cpu_clk_unhalted• C0h FR_retired_x86_instr_w_excp_intr• C1h FR_retired_uops• CBh FR_retired_fpu_instr

> x87, MMX, packed and scalar SSE[2]• 00h FP_dispatched_fpu_ops

> add, multiply, store, ...• 01h FP_cycles_no_fpu_ops_retired

16

Understanding Pipeline Stalls

• D1h FR_dispatch_stalls• D2h FR_dispatch_stall_branch_abort_to_retire• D5h FR_dispatch_stall_reorder_buffer_full

> maximum of 72 inflight instructions (24 x 3 lanes)• D6h FR_dispatch_stall_resv_stations_full

> ALU and AGU ops 24 entries (8 x 3 schedulers) • D7h FR_dispatch_stall_fpu_full

> 36 FP instructions across 3 schedulers• 23h LS_buffer_2_full

> 12 LS1 entries and 32 LS2 entries

17

Prefetch Activity

• 67h BU_data_prefetch> Prefetch attempts and cancelled prefetches> Includes HW prefetcher activity

• 4Bh DC_dispatched_prefetch_instr> Prefetches are strong: not dropped on DTLB miss> Load (T0/T1/T2)> Store (PrefetchW)> NTA (for low-reuse data, avoids polluting L2)

18

Cache Counters Flow

dc accesses

l2 filldc misses

dc victim

refill from l2

page hit/miss/conflict

L1 L2memorycontroller memory

nb_sized_command

nb_ht_busx_bandwidth

HT

(hw prefetch)

refill from system

probe

prefetch

l2 miss

Very approximately!

19

DDR Memory Access

page hitpage misspage conflict

Trp

(precharge delay)Trcd

(RAS to CAS delay)

Tcl

(CAS latency)

●Opteron generally uses 16B wide memory transfers●16B x 400MT/s ==> 6400MB/s●200MHz x 2 edges●bank select ==> row select ==> column select●latency dependent on hit in memory controller “open page” cache●How do physical addresses (PA) map to (bank, row, column)?

NB_mem_ctrlr_page_access

20

Controlling Memory Locality

• lgroups on Solaris 10> future: hierarchy of memory locality

• processor_bind() and madvise()> keep a thread and its data together

• ppgsz -oheap=2m for large pages> fewer DTLB misses: 32 x 2m vs 512 x 4k

• meminfo() and/or pmap -sx> currently kernel cage issues may lead to memory

fragmentation so large pages not always available

Richard [email protected]

OPTERON SYSTEMPERFORMANCE

opteron hardware performance counters richard smith technical specialist data centre practice sun...

Documents