how to write powerful parallel applications - polyhedron · how to write powerful parallel...

How to write powerful parallel Applications

Parallel programming Techniques and Program Testing in Cluster Environments16:45-17:15

Performance Tuning Threaded Software using Intel VTune Performance Analyzer

and Thread Profiler16:00-16:45

Pinpoint Program Inefficiencies and Threading Bugs - Data Races and Deadlocks15:30-16:00

Break15:15-15:30

Expressing Parallelism: Introducing Threading Through Libraries14:30-15:15

Expressing Parallelism: Using Intel® C++ and Fortran Compilers, Professional

Editions 10.1, for Performance, Multi-threading13:30-14:30

Lunch12:30-13:30

Introduction to Parallel Programming Methods11:30-12:30

How to Optimize Applications and Identify Areas for Parallelization10:30-11:30

Break10:15-10:30

Introduction to Software Design Cycle - From Serial to Parallel Applications09.45-10:15

Introduction to the Intel Micro architecture and Software Implications09.00-09:45

Welcome and Coffee08:30-09.00

Intel® Core™Microarchitecture

Edmund PreissEMEA Software Solutions Group

CoreCore™ ArchitectureArchitecture

�Moores Law and Processor Evolution

�Introduction on Core architecture

–New features added in 2007:

–Intro to 45nm Technology -> Shrink

�New Core™ Advanced Features

�Selected Software Implications

Source: WSTS/Dataquest/IntelSource: WSTS/Dataquest/Intel

Implications of Moore’s Law

101010101010101033

101010101010101044

101010101010101055

101010101010101066

101010101010101077

101010101010101088

101010101010101099

10101010101010101010

As the As the As the As the number of number of number of number of

transistors transistors transistors transistors

goes UPgoes UPgoes UPgoes UP

Cost per Cost per Cost per Cost per

transistor transistor transistor transistor goes DOWNgoes DOWNgoes DOWNgoes DOWN

1010101010101010--------77

1010101010101010--------66

1010101010101010--------55

1010101010101010--------44

1010101010101010--------33

1010101010101010--------22

1010101010101010--------11

101010101010101000

1010101010101010

ScalingScalingScalingScalingScalingScalingScalingScaling

+ Wafer Size+ Wafer Size+ Wafer Size+ Wafer Size+ Wafer Size+ Wafer Size+ Wafer Size+ Wafer Size

+ Volume+ Volume+ Volume+ Volume+ Volume+ Volume+ Volume+ Volume

= Lower Costs= Lower Costs= Lower Costs= Lower Costs= Lower Costs= Lower Costs= Lower Costs= Lower Costs

Source: Fortune Magazine

New Microarchitecture History

* IXA – Intel Internet Exchange Architecture/ EPIC – Explicitly Parallel Instruction Computing

Examples:

PentiumPentium®® ProPro

PentiumPentium®® II/IIIII/IIIPentiumPentium®®

PentiumPentium®® 44

PentiumPentium®® DD

XeonXeon®®

PentiumPentium®® MM

Core DuoCore Duo®®

Intel NetBurst®P5 P6 Banias

EPIC* (Itanium®) x86 IXA* (xScale)

Examples:

Examples:

IntelIntel®® CoreCore™™

ConroeWoodcrest

Merom

2 YEARS

2 YEARS

2 YEARS

2 YEARS

Intel Processor Family Design Cycles

Increase performance per given clock cycle

Increase processor frequencies

Extend energy efficiency

Deliver lead product for 45nm High k + metal gate process technology

Deliver optimized processors across each product segment and power envelope

2 YEARS

2 YEARS

2 YEARS

2 YEARS

45nm

32nm

New MicroarchitectureNew MicroarchitectureNew MicroarchitectureNew MicroarchitectureNehalem

Shrink/DerivativeShrink/DerivativeShrink/DerivativeShrink/Derivative

New New New New MicroarchitectureMicroarchitectureMicroarchitectureMicroarchitecture

65nm

2 YEARS

2 YEARS

2 YEARS

2 YEARS Shrink/DerivativeShrink/DerivativeShrink/DerivativeShrink/Derivative

Presler · Yonah · Dempsey

New MicroarchitectureNew MicroarchitectureNew MicroarchitectureNew MicroarchitectureIntel® Core™ Microarchitecture

Shrink/DerivativeShrink/DerivativeShrink/DerivativeShrink/DerivativePenryn Family

Details of the Intel Core Architecture

IntelIntel®® WideWide

Dynamic ExecutionDynamic Execution

IntelIntel®® AdvancedAdvanced

Digital Media BoostDigital Media Boost

IntelIntel®® IntelligentIntelligent

Power CapabilityPower Capability

IntelIntel®® SmartSmart

Memory AccessMemory Access

IntelIntel®® AdvancedAdvanced

Smart CacheSmart Cache

L2 CacheL2 Cache

Core 1 Core 2

BusBus

Smarter

Faster

Wider Deeper

Intel Core Innovations

CoreTM vs. NetBurstTM µ-arch: Overview

31SIMD Inst. Issued per Clock

80W135WPower

Up to 2

(Add + Mul or Div)

1FP Inst. Issued per clock

3 (Add/Mul/Div)3 (Add/Mul/Div)FP Units

3 x 128-bits2 x 64-bitsSIMD Units

3 (1x core freq)2 (2x core freq)Integer Units

41Instr. Decoders

1 x 4MB (shared)2 x 2MBL2 Cache Org.

(32K I/32K Data)(12K uop I/16K Data)

L1 Cache Org.

12Threads per core

1431Pipeline Stages

Intel CoreTMIntel NetBurstTMProcessor component

means per core

45nm Technology

�Penryn – code name for an enhanced Intel® CoreTM

microarchitecture at 45 nm– Industry’s first 45 nm High-K processor technology

– ~2x transistor density

– >20% gain in transistor switching speed

– ~30% decrease in transistor switching power

– Dual core, quad core

– Shared L2 cache

– Intel 64 architecture

– 128-bit SSE

2 Threads, 1 Package

(similar to Intel® Core™ 2 Duo processor)

6M L2 Cache

32K I-Cache

32K D-Cache

Bus

Core Core

”Penryn”/”Wolfdale”/“Wolfdale DP”

Dual Core Package

32K I-Cache

32K D-Cache

Core™ Microarchitecture

Decode

2MB/4MB Shared

L2 Cache

Up to 10.4 GB/s

FSB

uCodeROM

Instruction Queue

Instruction Fetch and Pre Decode

Rename/Alloc

Retirement Unit(Reorder Buffer)

ALU Branch

MMX/SSE

FPMove

ALU FAdd

MMX/SSE

FPMove

ALU FMul

MMX/SSE

FPMoveLOAD STORE

L1 D-Cache and D-TLB

Schedulers

Decode

Instruction Queue

Instruction Fetch and Pre Decode

Rename/Alloc

Retirement Unit(Reorder Buffer)

ALU Branch

MMX/SSE

FPMove

ALU FAdd

MMX/SSE

FPMove

ALU FMul

MMX/SSE

FPMove

L1 D-Cache and D-TLB

LOADSTORE

Schedulers

uCodeROM


Primary interfaces• Front end

• Execution• Memory

Primary interfaces• Front end

• Execution• Memory

Instruction Fetch Instruction Fetch

And And PreDecodePreDecode

Instruction QueueInstruction Queue

DecodeDecode

Rename/Rename/AllocAlloc

2M/4M2M/4M

Shared L2Shared L2

CacheCache

Up to Up to

10.6 GB/s10.6 GB/s

FSBFSB

uCodeuCode

ROMROM

SchedulersSchedulers

ALUALU

BranchBranch

MMX/SSEMMX/SSE

FPmoveFPmove

ALUALU

FAddFAdd

MMX/SSEMMX/SSE

FPmoveFPmove

ALUALU

FMulFMul

MMX/SSEMMX/SSE

FPmoveFPmove

LoadLoad StoreStore

L1 DL1 D--Cache and DCache and D--TLBTLB

Retirement UnitRetirement Unit

5

4

4

6

MemoryMemory

OrderOrder

BufferBuffer


Front EndFront End




DecodeDecode


2M/4M2M/4M

Shared L2Shared L2

CacheCache

Up to Up to

10.6 GB/s10.6 GB/s

FSBFSB

uCodeuCode

ROMROM


ALUALU

BranchBranch

MMX/SSEMMX/SSE

FPmoveFPmove

ALUALU

FAddFAdd

MMX/SSEMMX/SSE

FPmoveFPmove

ALUALU

FMulFMul

MMX/SSEMMX/SSE

FPmoveFPmove

LoadLoad StoreStore



5

4

4

6

MemoryMemoryOrderOrder

BufferBuffer

� Up to 6 instructions per cycle can be sent to the IQ

� Typical programs average slightly less than 4 bytes per instruction

� 4 decoders:1 “large” and 3 “small”. – All decoders handle “simple” 1-uop instructions.

– Larger handles instructions up to 4 uops

� Detects short loops and locks them in the instruction queue (IQ)– Reduced front end power consumption - total saving of up to 14%

� Up to 6 instructions per cycle can be sent to the IQ

� Typical programs average slightly less than 4 bytes per instruction

� 4 decoders:1 “large” and 3 “small”. – All decoders handle “simple” 1-uop instructions.

– Larger handles instructions up to 4 uops

� Detects short loops and locks them in the instruction queue (IQ)– Reduced front end power consumption - total saving of up to 14%

WithoutMacro-Fusion

Instruction Queue

Read five instructions from Instruction Queue

Each instruction gets decoded separately

store [mem3], ebx

load eax, [mem1]

cmp eax, [mem2]

jne targ

inc esp

inc esp

store [mem3], ebx

dec1 dec2 dec3

jne targ

load eax, [mem1]

cmp eax, [mem2]

dec0

Cycle 1

Cycle 2

With Intel’s New Macro-Fusion

Read five Instructions from Instruction Queue

Send fusable pair to single decoder

All in one cycle

store [mem3], ebx

load eax, [mem1]

cmpjne eax, [mem2], targ

inc esp

dec1

Instruction Queue

inc esp

dec2 dec3

load eax, [mem1]

cmp eax, [mem2]

jne targ

store [mem3], ebx

dec0

ct3

ct3 66% improvement due to macro fusion and +1 decoderVisually make NGMA bigger/betterctaggard, 03/03/2006




DecodeDecode


2M/4M2M/4M

Shared L2Shared L2

CacheCache

Up to Up to

10.6 GB/s10.6 GB/s

FSBFSB

uCodeuCode

ROMROM


ALUALU

BranchBranch

MMX/SSEMMX/SSE

FPmoveFPmove

ALUALU

FAddFAdd

MMX/SSEMMX/SSE

FPmoveFPmove

ALUALU

FMulFMul

MMX/SSEMMX/SSE

FPmoveFPmove

LoadLoad StoreStore



5

4

4

6

MemoryMemory

OrderOrder

BufferBuffer


ExecutionOut-of-Order

ExecutionOut-of-Order

� 4 uops renamed / retired per clock

� Uops written to RS and ROB– RS waits for sources to arrive allowing OOO execution

– ROB waits for results to show up for retirement

� 6 dispatch ports from RS– 3 execution ports (integer / fp / simd)

– load

– store (address)

– store (data)

� 128-bit SSE implementation– Port 0 has packed multiply (4 cycles SP 5 DP pipelined)

– Port 1 has packed add (3 cycles all precisions)

� FP data has one additional cycle bypass latency– Do not mix SSE FP and SSE integer ops on same register

� 4 uops renamed / retired per clock

� Uops written to RS and ROB– RS waits for sources to arrive allowing OOO execution

– ROB waits for results to show up for retirement

� 6 dispatch ports from RS– 3 execution ports (integer / fp / simd)

– load

– store (address)

– store (data)

� 128-bit SSE implementation– Port 0 has packed multiply (4 cycles SP 5 DP pipelined)

– Port 1 has packed add (3 cycles all precisions)

� FP data has one additional cycle bypass latency– Do not mix SSE FP and SSE integer ops on same register


OthersOthers

DECODE

X4

Y4

X4opY4

SOURCE

X1opY1

DECODE

In Each Core In Each Core

X3

Y3

X3opY3

X2

Y2

X2opY2

X1

Y1

X1opY1

DEST

SSE/2/3 OP

X2opY2

X3opY3X4opY4

CLOCK

CYCLE 1CYCLE 1

CLOCKCLOCKCYCLE 22

00127127

CLOCKCLOCKCYCLE 11

SSE OperationSSE Operation(SSE/SSE2/SSE3)(SSE/SSE2/SSE3)

ADVANTAGEADVANTAGE•• Increased PerformanceIncreased Performance•• 128 bit Single Cycle In Each Core128 bit Single Cycle In Each Core•• Improved Energy EfficiencyImproved Energy Efficiency

EXECUTEEXECUTE

SingleCycle

Execution

Energy

Perf

*Graphics not representative of actual die photo or relative size

IntelIntel®® Advanced Digital Media Advanced Digital Media BoostBoost




DecodeDecode


2M/4M2M/4M

Shared L2Shared L2

CacheCache

Up to Up to

10.6 GB/s10.6 GB/s

FSBFSB

uCodeuCode

ROMROM


ALUALU

BranchBranch

MMX/SSEMMX/SSE

FPmoveFPmove

ALUALU

FAddFAdd

MMX/SSEMMX/SSE

FPmoveFPmove

ALUALU

FMulFMul

MMX/SSEMMX/SSE

FPmoveFPmove

LoadLoad StoreStore



5

4

4

6

MemoryMemory

OrderOrder

BufferBuffer


Memorysub-system

Memorysub-system

� Loads & Stores – 128-bit load and 128-bit store per cycle

� Data Prefetching

� Memory Disambiguation

� Shared Cache

� L1D cache prefetching– Data Cache Unit Prefetcher (aka streaming prefetcher– Recognizes ascending access patterns in recently loaded data

– Prefetches the next line into the processors cache

– Instruction Based Stride Prefetcher– Prefetches based upon a load having a regular stride

– Can prefetch forward or backward 2 Kbytes (1/2 default page size)

� L2 cache prefetching: Data Prefetch Logic (DPL)– Prefetches data to the 2nd level cache before the DCU requests the data

– Maintains 2 tables for tracking loads– Upstream – 16 entries

– Downstream – 4 entries

� Loads & Stores – 128-bit load and 128-bit store per cycle

� Data Prefetching

� Memory Disambiguation

� Shared Cache

� L1D cache prefetching– Data Cache Unit Prefetcher (aka streaming prefetcher– Recognizes ascending access patterns in recently loaded data

– Prefetches the next line into the processors cache

– Instruction Based Stride Prefetcher– Prefetches based upon a load having a regular stride

– Can prefetch forward or backward 2 Kbytes (1/2 default page size)

� L2 cache prefetching: Data Prefetch Logic (DPL)– Prefetches data to the 2nd level cache before the DCU requests the data

– Maintains 2 tables for tracking loads– Upstream – 16 entries

– Downstream – 4 entries

oldest

youngest L1

Data

Cache

Load1

Load2

Load3

Load4

SharedL2

DataCache

Intel Smart Memory Access: Intel Smart Memory Access: PrefetchersPrefetchers

oldest

youngest

Memory is too far away

L1

Data

Cache

Load1

Load2

Load3

Load4

SharedL2

DataCache


oldest

youngest

Caches are closerwhen they have the data

L1

Data

Cache

Load1

Load2

Load3

Load4

SharedL2

DataCache


oldest

youngest

Prefetchers detectapplications datareference patterns

L1

Data

Cache

Load1

Load2

Load3

Load4

SharedL2

DataCache


oldest

youngest

And bring the data closer to data consumer

L1

Data

Cache

Load1

Load2

Load3

Load4

SharedL2

DataCache


oldest

youngest L1

Data

Cache

Load1

Load2

Load3

Load4

SharedL2

DataCache

Solving the Problem of Solving the Problem of Where Where


Some Implications of Core 2 Architecture for Developers

who want to thread their apps

Advanced Smart Cache Benefits

–Two threads which “communicate”frequently should be scheduled to same two cores sharing L2 cache

–Use the thread/processor affinity feature in your applications

FSB

Quad Core

ProcessorL2 CacheL2 Cache

Core 3 Core 4Core 1 Core 2

Memory Related

Avoid False Sharing

�What is false sharing?

�Multiple threads repeatedly write to the same cache line shared by processors– Usually different data

– Cache lines get invalidated– Forces additional reads from memory

– Severe performance impact in tight loops, in general– Threads read/write to the same cache line very rapidly

Some Words on Pipelines (1)� Modern CPUs may be understood by considering their basic designparadigm, the so-called pipeline. The pipeline is designed to break up theprocessing of a single instruction in independenent parts that idealy areexecuted in an identical time window.

� Since identical processing time in each stage can‘t be guaranteed, mostpipeline stages control a buffer or queue that supplies instructions if theprevious stage is still busy or in which instruction can be stored if the nextstage is still busy.

� The independent parts of the processing are called pipeline stages.

� Underflow or Overflow of a queue will cause the respective stage to run idleand will cause a pipeline stall.

AllocateFetch Decode Execute Retire

Buffer Buffer Buffer BufferFull Full Full Full

Busy Busy Busy Busy

Empty Empty Empty

Idle Idle Idle

Stall

Some Words on Pipelines (2)

� In order to achieve the best performance

� Pipeline stalls must be avoided

� Since Core 2 performance makes use of speculativeexecution, a wrongly taken branch might lead to a pipeline flush to keep the instructions consistent.

� Pipeline flushes must be avoided

� Understanding the Core 2 pipeline and being ableto detect pipeline problems will highly improve theperformance of your software

� Knowledge of the Pipeline and its registers increasethe understanding and efficient usage of VtunePerformance analyser

– E.g. look for Cache Misses, Branch Mispredictions

Branch Target Buffer

Microcode Sequencer

Register AllocationTable (RAT)

32 KBInstruction Cache

Next IP

InstructionDecode(4 issue)

Fetch / Decode

Uop Flow – Refer to Vtune Event Counters

Retire

Re-Order Buffer (ROB) – 96 entry

IA Register Set

To L2 Cache

Port

Port

Port

Port

Bus Unit

Reservation Stations (RS)

32 entry

Scheduler / Dispatch Ports

32 KBData Cache

Execute

Port

FP Add

SIMDIntegerArithmetic

MemoryOrderBuffer(MOB)

Load

StoreAddr

FP Div/MulInteger

Shift/RotateSIMD

SIMD

IntegerArithmetic

IntegerArithmetic

Port

StoreData

UOPS_RETIRED measures at Retirement

UOPS_RETIRED measures at Retirement

RS_UOPS_DISPATCHED measures at Execution

RS_UOPS_DISPATCHED measures at Execution

RESOURCE_STALLS measures here transfer from Decode

RESOURCE_STALLS measures here transfer from Decode

Detailed description in Processor Manualshttp://www.intel.com/products/processor/manuals/

Backup

how to write powerful parallel applications - polyhedron · how to write powerful parallel...

Documents