performance monitoring & queries on intel gpus · opengl performance queries we can’t extract...
TRANSCRIPT
1
Performance Monitoring & Querieson Intel GPUs
Lionel Landwerlin27 September 2018
Hardware overviewi915 interfaceUserspace tools
3
Hardware overview
Geom/FF GA
EU EU
EU EU
EUEU
EU EU
SP
L3
EU EU
EU EU
EUEU
EU EU
SP
L3
EU EU
EU EU
EUEU
EU EU
SP
L3
VF
DS GS
HS
VFE
TE BLT GAM
Media/FF
VE VD SFCGTI
https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-kbl-vol04-configurations.pdf
4
Hardware overview
Geom/FF GA
EU EU
EU EU
EUEU
EU EU
SP
L3
EU EU
EU EU
EUEU
EU EU
SP
L3
EU EU
EU EU
EUEU
EU EU
SP
L3
VF
DS GS
HS
VFE
TE BLT GAM
Media/FF
VE VD SFCGTI
https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-kbl-vol04-configurations.pdf
OA unit
5
Hardware overview
OA unit :● Writes snapshots of multiple registers to memory on :
○ context switch○ programmed timer○ frequency changes○ request from command streamer (only on 3D engine)
● Snapshots written to :○ OA buffer (circular buffer up to 16Mb)○ application address space
https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-kbl-vol14-observability.pdf
6
Hardware overview
Geom/FF GA
EU EU
EU EU
EUEU
EU EU
SP
L3
EU EU
EU EU
EUEU
EU EU
SP
L3
EU EU
EU EU
EUEU
EU EU
SP
L3
VF
DS GS
HS
VFE
TE BLT GAM
: direct connections
Media/FF
VE VD SFCGTI
https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-kbl-vol04-configurations.pdf
OA unit
7
Hardware overview
● Direct connections examples :○ Vertex Shader Threads Dispatched○ Hull Shader Threads Dispatched○ Pixel Shader Threads Dispatched○ 2x2s Rasterized Pixels○ 2x2s Killed in PS (discard in fragment shader)○ 2x2s Written To Render Target○ Blended 2x2s Written to Render Target○ 2x2s Requested from Sampler○ Sampler L1 Cache Misses○ Flexible EU counters○ …
Mostly 3D counters
https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-kbl-vol14-observability.pdf
8
Introduction
Geom/FF GA
EU EU
EU EU
EUEU
EU EU
SP
L3
EU EU
EU EU
EUEU
EU EU
SP
L3
EU EU
EU EU
EUEU
EU EU
SP
L3
VF
DS GS
HS
VFE
TE BLT GAM
: OA nodes : direct connections : indirect connections
Media/FF
VE VD SFCGTI
https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-kbl-vol04-configurations.pdf
OA unit
9
Hardware overview
● Indirect connections examples :○ GTI Depth Throughput○ Sampler 0/1 Busy○ L3 Cache Misses○ Early Depth Bottleneck○ Hi-Depth Cache Misses○ Multisampling Color Cache misses○ Stencil Cache misses○ …
● HW programming needed to get specific information
https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-kbl-vol14-observability.pdf
10
OA reports
A counters B counters C counters
● Headers : timestamp + context ID + reason● A counters : 32 (40 bits) + 4 (32 bits)
○ Mostly 3D counters● B counters : 8 (32 bits)● C counters : 8 (32 bits)
256 bytes (Broadwell and above)
Hea
ders
https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-kbl-vol14-observability.pdf
11
i915 Interface
Exclusive access to the OA unit because of B/C counters programming.
2 ways to use the i915 API :● Query mode :
○ Have snapshots filtered by context ID○ Use in addition to the MI_REPORT_PERF_COUNT instruction
● Monitoring mode :○ All snapshots available (privileged access)
12
i915 Interface
DRM Render Node /
masterFD
DRM_IOCTL_I915_PERF_OPEN● sampling period● configuration id● context id (optional)
i915/perfFD
Kernel
Userspace
read()poll()close()ioctl() enable/disable
13
i915 Interface
i915/perfFDGPU
Snapshot
Snapshot
Snapshot
Snapshot
Snapshot
Snapshot
Snapshot
HW Memory Kernel
Header
Snapshot
Header
Snapshot
Header
Snapshot
Userspace
14
Userspace
● Metrics Discovery (used by Graphics Performance Analyzers / VTUNE)○ https://github.com/intel/metrics-discovery
● GL_INTEL_performance_query extension○ https://www.khronos.org/registry/OpenGL/extensions/INTEL/INTEL_performance_query.txt
● GPUTop○ https://github.com/rib/gputop
15
OpenGL performance queries
We can’t extract all the performance counters in one pass.
Counters are grouped in query IDs :
● Render Metrics Basic● Compute Metrics Basic● Render Metrics for 3D Pipeline Profile● Memory Reads Distribution● Memory Writes Distribution● Compute Metrics Extended ● Compute Metrics L3 Cache ● Metric set HDCAndSF● Metric set L3_1
● Metric set L3_2● Metric set L3_3● Metric set RasterizerAndPixelBackend● Metric set Sampler● Metric set TDL_1● Metric set TDL_2● Compute Metrics Extra ● Media Vme Pipe ● Gpu Rings Busyness
16
OpenGL performance queries
GL_INTEL_performance_query :
● List query IDs :○ glGetFirstPerfQueryIdINTEL() / glGetNextPerfQueryIdINTEL()
● List counters for a given query ID :○ glGetPerfCounterInfoINTEL()
● Query data :○ glCreatePerfQueryINTEL() / glBeginPerfQueryINTEL() / glEndPerfQueryINTEL()
● Get data :○ glGetPerfQueryDataINTEL()
17
OpenGL performance queries
glUseProgram()
… (more pipeline setup)
glBindBuffer()
glClear()
glBeginPerfQueryINTEL()
glEndPerfQueryINTEL()
glDrawArrays()
glDrawArrays()
…
A counters B counters C counters
Hea
ders
A counters B counters C counters
Hea
ders
A counters values B countersvalues
C countersvalues
glGetPerfQueryDataINTEL()
Application Driver
18
OpenGL performance queries
https://github.com/janesma/apitrace
19
GPUTop
● Client/Server model :○ Server runs on the target system to monitor○ Clients connects to the server and process the extracted data
● 2 clients :○ Command line tool :
■ records accumulated samples in CSV format■ track an application’s usage
○ User interface :■ Observe global usage■ Draw timelines
20
GPUTop
Server :$ sudo gputop
Global monitoring :$ gputop-wrapper -m RenderBasic -c AvgGpuCoreFrequency,RasterizedPixels,Sampler0Busy
Application monitoring :$ gputop-wrapper -m RenderBasic -c AvgGpuCoreFrequency,RasterizedPixels,Sampler0Busy -- glxgears
Output : AvgGpuCoreFrequency, RasterizedPixels, Sampler0Busy (Hz), (pixels), (%) 295.3 MHz, 145.6 M pixels, 6.44 % 295.6 MHz, 119.5 M pixels, 4.84 % 295.8 MHz, 169.4 M pixels, 7.02 % 295.6 MHz, 97.31 M pixels, 3.97 % 295.6 MHz, 120.1 M pixels, 4.87 %
21
GPUTop
22
GPUTop - timelines
23
GPUTop - high frequency sampling
Give performance queries a try :https://github.com/janesma/apitrace
Give GPUTop a try (kernel 4.14 recommended) :https://github.com/rib/gputop
http://gputop.com
Questions?