how to write powerful parallel applications - polyhedron · how to write powerful parallel...

32
How to write powerful parallel Applications Parallel programming Techniques and Program Testing in Cluster Environments 16:45-17:15 Performance Tuning Threaded Software using Intel VTune Performance Analyzer and Thread Profiler 16:00-16:45 Pinpoint Program Inefficiencies and Threading Bugs - Data Races and Deadlocks 15:30-16:00 Break 15:15-15:30 Expressing Parallelism: Introducing Threading Through Libraries 14:30-15:15 Expressing Parallelism: Using Intel® C++ and Fortran Compilers, Professional Editions 10.1, for Performance, Multi-threading 13:30-14:30 Lunch 12:30-13:30 Introduction to Parallel Programming Methods 11:30-12:30 How to Optimize Applications and Identify Areas for Parallelization 10:30-11:30 Break 10:15-10:30 Introduction to Software Design Cycle - From Serial to Parallel Applications 09.45-10:15 Introduction to the Intel Micro architecture and Software Implications 09.00-09:45 Welcome and Coffee 08:30-09.00

Upload: phamnguyet

Post on 06-Aug-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

How to write powerful parallel Applications

Parallel programming Techniques and Program Testing in Cluster Environments16:45-17:15

Performance Tuning Threaded Software using Intel VTune Performance Analyzer

and Thread Profiler16:00-16:45

Pinpoint Program Inefficiencies and Threading Bugs - Data Races and Deadlocks15:30-16:00

Break15:15-15:30

Expressing Parallelism: Introducing Threading Through Libraries14:30-15:15

Expressing Parallelism: Using Intel® C++ and Fortran Compilers, Professional

Editions 10.1, for Performance, Multi-threading13:30-14:30

Lunch12:30-13:30

Introduction to Parallel Programming Methods11:30-12:30

How to Optimize Applications and Identify Areas for Parallelization10:30-11:30

Break10:15-10:30

Introduction to Software Design Cycle - From Serial to Parallel Applications09.45-10:15

Introduction to the Intel Micro architecture and Software Implications09.00-09:45

Welcome and Coffee08:30-09.00

Page 2: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

Intel® Core™Microarchitecture

Edmund PreissEMEA Software Solutions Group

Page 3: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

CoreCore™ ArchitectureArchitecture

�Moores Law and Processor Evolution

�Introduction on Core architecture

–New features added in 2007:

–Intro to 45nm Technology -> Shrink

�New Core™ Advanced Features

�Selected Software Implications

Page 4: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

Source: WSTS/Dataquest/IntelSource: WSTS/Dataquest/Intel

Implications of Moore’s Law

101010101010101033

101010101010101044

101010101010101055

101010101010101066

101010101010101077

101010101010101088

101010101010101099

10101010101010101010

As the As the As the As the number of number of number of number of

transistors transistors transistors transistors

goes UPgoes UPgoes UPgoes UP

Cost per Cost per Cost per Cost per

transistor transistor transistor transistor goes DOWNgoes DOWNgoes DOWNgoes DOWN

1010101010101010--------77

1010101010101010--------66

1010101010101010--------55

1010101010101010--------44

1010101010101010--------33

1010101010101010--------22

1010101010101010--------11

101010101010101000

1010101010101010

ScalingScalingScalingScalingScalingScalingScalingScaling

+ Wafer Size+ Wafer Size+ Wafer Size+ Wafer Size+ Wafer Size+ Wafer Size+ Wafer Size+ Wafer Size

+ Volume+ Volume+ Volume+ Volume+ Volume+ Volume+ Volume+ Volume

= Lower Costs= Lower Costs= Lower Costs= Lower Costs= Lower Costs= Lower Costs= Lower Costs= Lower Costs

Source: Fortune Magazine

Page 5: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

New Microarchitecture History

* IXA – Intel Internet Exchange Architecture/ EPIC – Explicitly Parallel Instruction Computing

Examples:

PentiumPentium®® ProPro

PentiumPentium®® II/IIIII/IIIPentiumPentium®®

PentiumPentium®® 44

PentiumPentium®® DD

XeonXeon®®

PentiumPentium®® MM

Core DuoCore Duo®®

Intel NetBurst®P5 P6 Banias

EPIC* (Itanium®) x86 IXA* (xScale)

Examples:

Examples:

IntelIntel®® CoreCore™™

ConroeWoodcrest

Merom

Page 6: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

2 YEARS

2 YEARS

2 YEARS

2 YEARS

Intel Processor Family Design Cycles

Increase performance per given clock cycle

Increase processor frequencies

Extend energy efficiency

Deliver lead product for 45nm High k + metal gate process technology

Deliver optimized processors across each product segment and power envelope

2 YEARS

2 YEARS

2 YEARS

2 YEARS

45nm

32nm

New MicroarchitectureNew MicroarchitectureNew MicroarchitectureNew MicroarchitectureNehalem

Shrink/DerivativeShrink/DerivativeShrink/DerivativeShrink/Derivative

New New New New MicroarchitectureMicroarchitectureMicroarchitectureMicroarchitecture

65nm

2 YEARS

2 YEARS

2 YEARS

2 YEARS Shrink/DerivativeShrink/DerivativeShrink/DerivativeShrink/Derivative

Presler · Yonah · Dempsey

New MicroarchitectureNew MicroarchitectureNew MicroarchitectureNew MicroarchitectureIntel® Core™ Microarchitecture

Shrink/DerivativeShrink/DerivativeShrink/DerivativeShrink/DerivativePenryn Family

Page 7: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

Details of the Intel Core Architecture

Page 8: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

IntelIntel®® WideWide

Dynamic ExecutionDynamic Execution

IntelIntel®® AdvancedAdvanced

Digital Media BoostDigital Media Boost

IntelIntel®® IntelligentIntelligent

Power CapabilityPower Capability

IntelIntel®® SmartSmart

Memory AccessMemory Access

IntelIntel®® AdvancedAdvanced

Smart CacheSmart Cache

L2 CacheL2 Cache

Core 1 Core 2

BusBus

Smarter

Faster

Wider Deeper

Intel Core Innovations

Page 9: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

CoreTM vs. NetBurstTM µ-arch: Overview

31SIMD Inst. Issued per Clock

80W135WPower

Up to 2

(Add + Mul or Div)

1FP Inst. Issued per clock

3 (Add/Mul/Div)3 (Add/Mul/Div)FP Units

3 x 128-bits2 x 64-bitsSIMD Units

3 (1x core freq)2 (2x core freq)Integer Units

41Instr. Decoders

1 x 4MB (shared)2 x 2MBL2 Cache Org.

(32K I/32K Data)(12K uop I/16K Data)

L1 Cache Org.

12Threads per core

1431Pipeline Stages

Intel CoreTMIntel NetBurstTMProcessor component

means per core

Page 10: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

45nm Technology

�Penryn – code name for an enhanced Intel® CoreTM

microarchitecture at 45 nm– Industry’s first 45 nm High-K processor technology

– ~2x transistor density

– >20% gain in transistor switching speed

– ~30% decrease in transistor switching power

– Dual core, quad core

– Shared L2 cache

– Intel 64 architecture

– 128-bit SSE

2 Threads, 1 Package

(similar to Intel® Core™ 2 Duo processor)

6M L2 Cache

32K I-Cache

32K D-Cache

Bus

Core Core

”Penryn”/”Wolfdale”/“Wolfdale DP”

Dual Core Package

32K I-Cache

32K D-Cache

Page 11: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

Core™ Microarchitecture

Decode

2MB/4MB Shared

L2 Cache

Up to 10.4 GB/s

FSB

uCodeROM

Instruction Queue

Instruction Fetch and Pre Decode

Rename/Alloc

Retirement Unit(Reorder Buffer)

ALU Branch

MMX/SSE

FPMove

ALU FAdd

MMX/SSE

FPMove

ALU FMul

MMX/SSE

FPMoveLOAD STORE

L1 D-Cache and D-TLB

Schedulers

Decode

Instruction Queue

Instruction Fetch and Pre Decode

Rename/Alloc

Retirement Unit(Reorder Buffer)

ALU Branch

MMX/SSE

FPMove

ALU FAdd

MMX/SSE

FPMove

ALU FMul

MMX/SSE

FPMove

L1 D-Cache and D-TLB

LOADSTORE

Schedulers

uCodeROM

Page 12: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

Intel® Core™Microarchitecture

Primary interfaces• Front end

• Execution• Memory

Primary interfaces• Front end

• Execution• Memory

Instruction Fetch Instruction Fetch

And And PreDecodePreDecode

Instruction QueueInstruction Queue

DecodeDecode

Rename/Rename/AllocAlloc

2M/4M2M/4M

Shared L2Shared L2

CacheCache

Up to Up to

10.6 GB/s10.6 GB/s

FSBFSB

uCodeuCode

ROMROM

SchedulersSchedulers

ALUALU

BranchBranch

MMX/SSEMMX/SSE

FPmoveFPmove

ALUALU

FAddFAdd

MMX/SSEMMX/SSE

FPmoveFPmove

ALUALU

FMulFMul

MMX/SSEMMX/SSE

FPmoveFPmove

LoadLoad StoreStore

L1 DL1 D--Cache and DCache and D--TLBTLB

Retirement UnitRetirement Unit

5

4

4

6

MemoryMemory

OrderOrder

BufferBuffer

Page 13: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

Intel® Core™Microarchitecture

Front EndFront End

Instruction Fetch Instruction Fetch

And And PreDecodePreDecode

Instruction QueueInstruction Queue

DecodeDecode

Rename/Rename/AllocAlloc

2M/4M2M/4M

Shared L2Shared L2

CacheCache

Up to Up to

10.6 GB/s10.6 GB/s

FSBFSB

uCodeuCode

ROMROM

SchedulersSchedulers

ALUALU

BranchBranch

MMX/SSEMMX/SSE

FPmoveFPmove

ALUALU

FAddFAdd

MMX/SSEMMX/SSE

FPmoveFPmove

ALUALU

FMulFMul

MMX/SSEMMX/SSE

FPmoveFPmove

LoadLoad StoreStore

L1 DL1 D--Cache and DCache and D--TLBTLB

Retirement UnitRetirement Unit

5

4

4

6

MemoryMemoryOrderOrder

BufferBuffer

� Up to 6 instructions per cycle can be sent to the IQ

� Typical programs average slightly less than 4 bytes per instruction

� 4 decoders:1 “large” and 3 “small”. – All decoders handle “simple” 1-uop instructions.

– Larger handles instructions up to 4 uops

� Detects short loops and locks them in the instruction queue (IQ)– Reduced front end power consumption - total saving of up to 14%

� Up to 6 instructions per cycle can be sent to the IQ

� Typical programs average slightly less than 4 bytes per instruction

� 4 decoders:1 “large” and 3 “small”. – All decoders handle “simple” 1-uop instructions.

– Larger handles instructions up to 4 uops

� Detects short loops and locks them in the instruction queue (IQ)– Reduced front end power consumption - total saving of up to 14%

Page 14: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

WithoutMacro-Fusion

Instruction Queue

Read five instructions from Instruction Queue

Each instruction gets decoded separately

store [mem3], ebx

load eax, [mem1]

cmp eax, [mem2]

jne targ

inc esp

inc esp

store [mem3], ebx

dec1 dec2 dec3

jne targ

load eax, [mem1]

cmp eax, [mem2]

dec0

Cycle 1

Cycle 2

Page 15: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

With Intel’s New Macro-Fusion

Read five Instructions from Instruction Queue

Send fusable pair to single decoder

All in one cycle

store [mem3], ebx

load eax, [mem1]

cmpjne eax, [mem2], targ

inc esp

dec1

Instruction Queue

inc esp

dec2 dec3

load eax, [mem1]

cmp eax, [mem2]

jne targ

store [mem3], ebx

dec0

ct3

Page 16: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

Slide 15

ct3 66% improvement due to macro fusion and +1 decoderVisually make NGMA bigger/betterctaggard, 03/03/2006

Page 17: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

Instruction Fetch Instruction Fetch

And And PreDecodePreDecode

Instruction QueueInstruction Queue

DecodeDecode

Rename/Rename/AllocAlloc

2M/4M2M/4M

Shared L2Shared L2

CacheCache

Up to Up to

10.6 GB/s10.6 GB/s

FSBFSB

uCodeuCode

ROMROM

SchedulersSchedulers

ALUALU

BranchBranch

MMX/SSEMMX/SSE

FPmoveFPmove

ALUALU

FAddFAdd

MMX/SSEMMX/SSE

FPmoveFPmove

ALUALU

FMulFMul

MMX/SSEMMX/SSE

FPmoveFPmove

LoadLoad StoreStore

L1 DL1 D--Cache and DCache and D--TLBTLB

Retirement UnitRetirement Unit

5

4

4

6

MemoryMemory

OrderOrder

BufferBuffer

Intel® Core™Microarchitecture

ExecutionOut-of-Order

ExecutionOut-of-Order

� 4 uops renamed / retired per clock

� Uops written to RS and ROB– RS waits for sources to arrive allowing OOO execution

– ROB waits for results to show up for retirement

� 6 dispatch ports from RS– 3 execution ports (integer / fp / simd)

– load

– store (address)

– store (data)

� 128-bit SSE implementation– Port 0 has packed multiply (4 cycles SP 5 DP pipelined)

– Port 1 has packed add (3 cycles all precisions)

� FP data has one additional cycle bypass latency– Do not mix SSE FP and SSE integer ops on same register

� 4 uops renamed / retired per clock

� Uops written to RS and ROB– RS waits for sources to arrive allowing OOO execution

– ROB waits for results to show up for retirement

� 6 dispatch ports from RS– 3 execution ports (integer / fp / simd)

– load

– store (address)

– store (data)

� 128-bit SSE implementation– Port 0 has packed multiply (4 cycles SP 5 DP pipelined)

– Port 1 has packed add (3 cycles all precisions)

� FP data has one additional cycle bypass latency– Do not mix SSE FP and SSE integer ops on same register

Page 18: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

Intel® Core™Microarchitecture

OthersOthers

DECODE

X4

Y4

X4opY4

SOURCE

X1opY1

DECODE

In Each Core In Each Core

X3

Y3

X3opY3

X2

Y2

X2opY2

X1

Y1

X1opY1

DEST

SSE/2/3 OP

X2opY2

X3opY3X4opY4

CLOCK

CYCLE 1CYCLE 1

CLOCKCLOCKCYCLE 22

00127127

CLOCKCLOCKCYCLE 11

SSE OperationSSE Operation(SSE/SSE2/SSE3)(SSE/SSE2/SSE3)

ADVANTAGEADVANTAGE•• Increased PerformanceIncreased Performance•• 128 bit Single Cycle In Each Core128 bit Single Cycle In Each Core•• Improved Energy EfficiencyImproved Energy Efficiency

EXECUTEEXECUTE

SingleCycle

Execution

Energy

Perf

*Graphics not representative of actual die photo or relative size

IntelIntel®® Advanced Digital Media Advanced Digital Media BoostBoost

Page 19: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

Instruction Fetch Instruction Fetch

And And PreDecodePreDecode

Instruction QueueInstruction Queue

DecodeDecode

Rename/Rename/AllocAlloc

2M/4M2M/4M

Shared L2Shared L2

CacheCache

Up to Up to

10.6 GB/s10.6 GB/s

FSBFSB

uCodeuCode

ROMROM

SchedulersSchedulers

ALUALU

BranchBranch

MMX/SSEMMX/SSE

FPmoveFPmove

ALUALU

FAddFAdd

MMX/SSEMMX/SSE

FPmoveFPmove

ALUALU

FMulFMul

MMX/SSEMMX/SSE

FPmoveFPmove

LoadLoad StoreStore

L1 DL1 D--Cache and DCache and D--TLBTLB

Retirement UnitRetirement Unit

5

4

4

6

MemoryMemory

OrderOrder

BufferBuffer

Intel® Core™Microarchitecture

Memorysub-system

Memorysub-system

� Loads & Stores – 128-bit load and 128-bit store per cycle

� Data Prefetching

� Memory Disambiguation

� Shared Cache

� L1D cache prefetching– Data Cache Unit Prefetcher (aka streaming prefetcher– Recognizes ascending access patterns in recently loaded data

– Prefetches the next line into the processors cache

– Instruction Based Stride Prefetcher– Prefetches based upon a load having a regular stride

– Can prefetch forward or backward 2 Kbytes (1/2 default page size)

� L2 cache prefetching: Data Prefetch Logic (DPL)– Prefetches data to the 2nd level cache before the DCU requests the data

– Maintains 2 tables for tracking loads– Upstream – 16 entries

– Downstream – 4 entries

� Loads & Stores – 128-bit load and 128-bit store per cycle

� Data Prefetching

� Memory Disambiguation

� Shared Cache

� L1D cache prefetching– Data Cache Unit Prefetcher (aka streaming prefetcher– Recognizes ascending access patterns in recently loaded data

– Prefetches the next line into the processors cache

– Instruction Based Stride Prefetcher– Prefetches based upon a load having a regular stride

– Can prefetch forward or backward 2 Kbytes (1/2 default page size)

� L2 cache prefetching: Data Prefetch Logic (DPL)– Prefetches data to the 2nd level cache before the DCU requests the data

– Maintains 2 tables for tracking loads– Upstream – 16 entries

– Downstream – 4 entries

Page 20: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

oldest

youngest L1

Data

Cache

Load1

Load2

Load3

Load4

SharedL2

DataCache

Intel Smart Memory Access: Intel Smart Memory Access: PrefetchersPrefetchers

Page 21: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

oldest

youngest

Memory is too far away

L1

Data

Cache

Load1

Load2

Load3

Load4

SharedL2

DataCache

Intel Smart Memory Access: Intel Smart Memory Access: PrefetchersPrefetchers

Page 22: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

oldest

youngest

Caches are closerwhen they have the data

L1

Data

Cache

Load1

Load2

Load3

Load4

SharedL2

DataCache

Intel Smart Memory Access: Intel Smart Memory Access: PrefetchersPrefetchers

Page 23: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

oldest

youngest

Prefetchers detectapplications datareference patterns

L1

Data

Cache

Load1

Load2

Load3

Load4

SharedL2

DataCache

Intel Smart Memory Access: Intel Smart Memory Access: PrefetchersPrefetchers

Page 24: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

oldest

youngest

And bring the data closer to data consumer

L1

Data

Cache

Load1

Load2

Load3

Load4

SharedL2

DataCache

Intel Smart Memory Access: Intel Smart Memory Access: PrefetchersPrefetchers

Page 25: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

oldest

youngest L1

Data

Cache

Load1

Load2

Load3

Load4

SharedL2

DataCache

Solving the Problem of Solving the Problem of Where Where

Intel Smart Memory Access: Intel Smart Memory Access: PrefetchersPrefetchers

Page 26: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

Some Implications of Core 2 Architecture for Developers

who want to thread their apps

Page 27: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

Advanced Smart Cache Benefits

–Two threads which “communicate”frequently should be scheduled to same two cores sharing L2 cache

–Use the thread/processor affinity feature in your applications

FSB

Quad Core

ProcessorL2 CacheL2 Cache

Core 3 Core 4Core 1 Core 2

Page 28: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

Memory Related

Avoid False Sharing

�What is false sharing?

�Multiple threads repeatedly write to the same cache line shared by processors– Usually different data

– Cache lines get invalidated– Forces additional reads from memory

– Severe performance impact in tight loops, in general– Threads read/write to the same cache line very rapidly

Page 29: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

Some Words on Pipelines (1)� Modern CPUs may be understood by considering their basic designparadigm, the so-called pipeline. The pipeline is designed to break up theprocessing of a single instruction in independenent parts that idealy areexecuted in an identical time window.

� Since identical processing time in each stage can‘t be guaranteed, mostpipeline stages control a buffer or queue that supplies instructions if theprevious stage is still busy or in which instruction can be stored if the nextstage is still busy.

� The independent parts of the processing are called pipeline stages.

� Underflow or Overflow of a queue will cause the respective stage to run idleand will cause a pipeline stall.

AllocateFetch Decode Execute Retire

Buffer Buffer Buffer BufferFull Full Full Full

Busy Busy Busy Busy

Empty Empty Empty

Idle Idle Idle

Stall

Page 30: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

Some Words on Pipelines (2)

� In order to achieve the best performance

� Pipeline stalls must be avoided

� Since Core 2 performance makes use of speculativeexecution, a wrongly taken branch might lead to a pipeline flush to keep the instructions consistent.

� Pipeline flushes must be avoided

� Understanding the Core 2 pipeline and being ableto detect pipeline problems will highly improve theperformance of your software

� Knowledge of the Pipeline and its registers increasethe understanding and efficient usage of VtunePerformance analyser

– E.g. look for Cache Misses, Branch Mispredictions

Page 31: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

Branch Target Buffer

Microcode Sequencer

Register AllocationTable (RAT)

32 KBInstruction Cache

Next IP

InstructionDecode(4 issue)

Fetch / Decode

Uop Flow – Refer to Vtune Event Counters

Retire

Re-Order Buffer (ROB) – 96 entry

IA Register Set

To L2 Cache

Port

Port

Port

Port

Bus Unit

Reservation Stations (RS)

32 entry

Scheduler / Dispatch Ports

32 KBData Cache

Execute

Port

FP Add

SIMDIntegerArithmetic

MemoryOrderBuffer(MOB)

Load

StoreAddr

FP Div/MulInteger

Shift/RotateSIMD

SIMD

IntegerArithmetic

IntegerArithmetic

Port

StoreData

UOPS_RETIRED measures at Retirement

UOPS_RETIRED measures at Retirement

RS_UOPS_DISPATCHED measures at Execution

RS_UOPS_DISPATCHED measures at Execution

RESOURCE_STALLS measures here transfer from Decode

RESOURCE_STALLS measures here transfer from Decode

Detailed description in Processor Manualshttp://www.intel.com/products/processor/manuals/

Page 32: How to write powerful parallel Applications - Polyhedron · How to write powerful parallel Applications 16:45-17:15 Parallel programming Techniques and Program Testing in Cluster

Backup