simd - ucsbtyang/class/240a17/slides/simd.pdf · simd: single instruction, multiple data + •...

SIMDProgramming

CS240A, 2017

1

Flynn*Taxonomy,1966

• In2013,SIMDandMIMDmostcommonparallelisminarchitectures– usuallybothinsamesystem!

• Mostcommonparallelprocessingprogrammingstyle:SingleProgramMultipleData(“SPMD”)– SingleprogramthatrunsonallprocessorsofaMIMD– Cross-processorexecutioncoordinationusingsynchronization

primitives• SIMD(akahw-leveldataparallelism):specializedfunction

units,forhandlinglock-stepcalculationsinvolvingarrays– Scientificcomputing,signalprocessing,multimedia

(audio/videoprocessing)

2

*Prof.MichaelFlynn,Stanford

Single-Instruction/Multiple-DataStream(SIMDor“sim-dee”)

• SIMDcomputerexploitsmultipledatastreamsagainstasingleinstructionstreamtooperationsthatmaybenaturallyparallelized,e.g.,IntelSIMDinstructionextensionsorNVIDIAGraphicsProcessingUnit(GPU)

3

4

SIMD:SingleInstruction,MultipleData

+

• Scalarprocessing• traditionalmode• oneoperation produces

oneresult

• SIMDprocessing• WithIntelSSE/SSE2• SSE=streamingSIMDextensions• oneoperation producesmultipleresults

X

Y

X + Y

+x3 x2 x1 x0

y3 y2 y1 y0

x3+y3 x2+y2 x1+y1 x0+y0

X

Y

X + Y

SlideSource:AlexKlimovitski&DeanMacri,IntelCorporation

5

Whatdoesthismeantoyou?• InadditiontoSIMDextensions,theprocessormayhave

otherspecialinstructions– FusedMultiply-Add(FMA)instructions:

x=y+c*zissocommonsomeprocessorexecutethemultiply/addasasingleinstruction,atthesamerate(bandwidth)as+or*alone

• Intheory,thecompilerunderstandsallofthis– Whencompiling,itwillrearrangeinstructionstogetagood

“schedule”thatmaximizespipelining,usesFMAsandSIMD– Itworkswiththemixofinstructionsinsideaninnerloopor

otherblockofcode• Butinpracticethecompilermayneedyourhelp

– Chooseadifferentcompiler,optimizationflags,etc.– Rearrangeyourcodetomakethingsmoreobvious– Usingspecialfunctions(“intrinsics”)orwriteinassemblyL

IntelSIMDExtensions• MMX64-bitregisters,reusingfloating-pointregisters[1992]

• SSE2/3/4,new8128-bitregisters[1999]

• AVX,new256-bitregisters[2011]– Spaceforexpansionto1024-bitregisters

6

7

SSE/SSE2SIMDonIntel

16xbytes

4xfloats

2xdoubles

• SSE2datatypes:anythingthatfitsinto16bytes,e.g.,

• Instructionsperformadd,multiplyetc.onallthedatainparallel

• SimilaronGPUs,vectorprocessors(butmanymoresimultaneousoperations)

IntelArchitectureSSE2+128-BitSIMDDataTypes

86463

6463

6463

3231

3231

9695

9695 161548478079122121

6463 32319695 161548478079122121 16/128bits

8/128bits

4/128bits

2/128bits

• Note:inIntelArchitecture(unlikeMIPS)awordis16bits– Single-precisionFP:Doubleword(32bits)– Double-precisionFP:Quadword(64bits)

PackedandScalarDouble-PrecisionFloating-PointOperations

9

Packed

Scalar

SSE/SSE2FloatingPointInstructions

xmm:oneoperandisa128-bitSSE2registermem/xmm:otheroperandisinmemoryoranSSE2register{SS}ScalarSingleprecisionFP:one32-bitoperandina128-bitregister{PS}PackedSingleprecisionFP:four32-bitoperandsina128-bitregister{SD}ScalarDoubleprecisionFP:one64-bitoperandina128-bitregister{PD}PackedDoubleprecisionFP,ortwo64-bitoperandsina128-bitregister{A}128-bitoperandisalignedinmemory{U}meansthe128-bitoperandisunalignedinmemory{H}meansmovethehighhalfofthe128-bitoperand{L}meansmovethelowhalfofthe128-bitoperand

10

Movedoesbothloadand

store

Example:SIMDArrayProcessing

11

for each f in arrayf = sqrt(f) for each f in array

{load f to floating-point registercalculate the square rootwrite the result from the

register to memory}

for each 4 members in array{

load 4 members to the SSE registercalculate 4 square roots in one operationstore the 4 results from the register to memory

}SIMDstyle

Data-LevelParallelismandSIMD

• SIMDwantsadjacentvaluesinmemorythatcanbeoperatedinparallel

• Usuallyspecifiedinprogramsasloopsfor(i=1000; i>0; i=i-1)

x[i] = x[i] + s;• Howcanrevealmoredata-levelparallelismthanavailableinasingleiterationofaloop?

• Unrollloopandadjustiterationrate

12

LoopUnrollinginC• Insteadofcompilerdoingloopunrolling,coulddoit

yourselfinCfor(i=1000; i>0; i=i-1)

x[i] = x[i] + s;

• Couldberewrittenfor(i=1000; i>0; i=i-4) {

x[i] = x[i] + s; x[i-1] = x[i-1] + s; x[i-2] = x[i-2] + s; x[i-3] = x[i-3] + s;

}

13

GeneralizingLoopUnrolling

• Aloopofn iterations• k copiesofthebodyoftheloop• Assuming(n modk)≠0

– Thenwewillruntheloopwith1copyofthebody (nmodk)times

– andthenwithkcopiesofthebodyfloor(n/k)times

14

GeneralLoopUnrollingwithaHead

• Handingloopiterationsindivisiblebystepsize.for(i=1003; i>0; i=i-1)

x[i] = x[i] + s;

• Couldberewrittenfor(i=1003;i>1000;i--)//Handlethe head(1003mod4)

x[i] = x[i] + s;

for(i=1000; i>0; i=i-4) {// handleotheriterationsx[i] = x[i] + s; x[i-1] = x[i-1] + s; x[i-2] = x[i-2] + s; x[i-3] = x[i-3] + s;

}

15

Tailmethodforgeneralloopunrolling

• Handingloopiterationsindivisiblebystepsize.for(i=1003; i>0; i=i-1)

x[i] = x[i] + s;• Couldberewritten

for(i=1003; i>0 && i> 1003 mod 4; i=i-4) {x[i] = x[i] + s;

x[i-1] = x[i-1] + s; x[i-2] = x[i-2] + s; x[i-3] = x[i-3] + s;}

for(i=1003mod4;i>0;i--)//specialhandleintailx[i] = x[i] + s;

16

Anotherloopunrollingexample

17

Normalloop Afterloopunrolling

int x;for (x= 0;x< 103;x++){

delete(x);}

int x;for (x= 0;x< 103/5*5;x+= 5){delete(x);delete(x+ 1);delete(x+ 2);delete(x+ 3);delete(x+ 4);}/*Tail*/for (x=103/5*5;x<103;x++){

delete(x);}

IntelSSEIntrinsics

• Vectordatatype:_m128d

• Loadandstoreoperations:_mm_load_pd MOVAPD/aligned,packeddouble_mm_store_pd MOVAPD/aligned,packeddouble_mm_loadu_pd MOVUPD/unaligned,packeddouble_mm_storeu_pd MOVUPD/unaligned,packeddouble

• Loadandbroadcastacrossvector_mm_load1_pd MOVSD+shuffling/duplicating

• Arithmetic:_mm_add_pd ADDPD/add,packeddouble_mm_mul_pd MULPD/multiple,packeddouble

CorrespondingSSEinstructions:Instrinsics:

18

IntrinsicsareCfunctionsandproceduresforinsertingassemblylanguageintoCcode,includingSSEinstructions

19

Example1:UseofSSESIMDinstructions

• For(i=0;i<n;i++)sum=sum+a[i];• Set128-bittemp=0;

For(i =0;n/4*4;i=i+4){Add4integerswith128bitsfrom&a[i]totemp;}

Tail:Copyout4integersoftempandaddthemtogethertosum.For(i=n/4*4;i<n;i++)sum+=a[i];

20

RelatedSSESIMDinstructions__m128i_mm_setzero_si128() returns128-bitzerovector

__m128i_mm_loadu_si128(__m128i*p) Loaddatastoredatpointerpof memorytoa 128bitvector,returnsthisvector.

__m128i_mm_add_epi32(__m128ia,__m128ib) returnsvector(a0+b0,a1+b1,a2+b2,a3+b3)

void_mm_storeu_si128(__m128i*p,__m128ia)

storescontentoff128-bitvector”a”atomemorystartingatpointerp

21

RelatedSSESIMDinstructions

• Add4integerswith128bitsfrom&a[i]totempvectorwithloopbodytemp=temp+a[i]

• Add128bits,thennext128bits…

__m128itemp=_mm_setzero_si128();__m128itemp1=_mm_loadu_si128((__m128i*)(a+i));temp=_mm_add_epi32(temp,temp1)

Example2:2x2MatrixMultiply

Ci,j =(A×B)i,j =∑ Ai,k× Bk,j

2

k =1

DefinitionofMatrixMultiply:

A1,1 A1,2

A2,1 A2,2

B1,1 B1,2

B2,1 B2,2

x

C1,1=A1,1B1,1+A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2

C2,1=A2,1B1,1+A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2

=

1 0

0 1

1 3

2 4

x

C1,1=1*1 +0*2=1 C1,2=1*3+0*4=3

C2,1=0*1 +1*2=2 C2,2=0*3+1*4=4

=

22

Example:2x 2MatrixMultiply

• UsingtheXMMregisters– 64-bit/doubleprecision/twodoublesperXMMreg

C1C2

C1,1C1,2

C2,1C2,2

StoredinmemoryinColumnorder

B1B2

Bi,1Bi,2

Bi,1Bi,2

A A1,i A2,i

C1,1 C1,2

C2,1 C2,2

�

C1 C2

23


• Initialization

• I=1

C1C2

0

0

0

0

B1B2

B1,1B1,2

B1,1B1,2

A A1,1 A2,1 _mm_load_pd:StoredinmemoryinColumnorder

_mm_load1_pd:SSEinstructionthatloadsadoublewordandstoresitinthehighandlowdoublewordsoftheXMMregister

24

• Initialization

• I=1

C1C2

0

0

0

0

B1B2

B1,1B1,2

B1,1B1,2

A A1,1 A2,1 _mm_load_pd:Load2doublesintoXMMreg,StoredinmemoryinColumnorder

_mm_load1_pd:SSEinstructionthatloadsadoublewordandstoresitinthehighandlowdoublewordsoftheXMMregister(duplicatesvalueinbothhalvesofXMM)

25

A1,1 A1,2

A2,1 A2,2

B1,1 B1,2

B2,1 B2,2

x

C1,1=A1,1B1,1+A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2

C2,1=A2,1B1,1+A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2

=

Example:2x2MatrixMultiply


• Firstiterationintermediateresult

• I=1

C1C2

B1B2

B1,1B1,2

B1,1B1,2


0+A1,1B1,10+A1,1B1,2

0+A2,1B1,10+A2,1B1,2

c1=_mm_add_pd(c1,_mm_mul_pd(a,b1));c2=_mm_add_pd(c2,_mm_mul_pd(a,b2));SSEinstructionsfirstdoparallelmultipliesandthenparalleladdsinXMMregisters


26

A1,1 A1,2

A2,1 A2,2

B1,1 B1,2

B2,1 B2,2

x

C1,1=A1,1B1,1+A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2

C2,1=A2,1B1,1+A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2

=

A1,1 A1,2

A2,1 A2,2

B1,1 B1,2

B2,1 B2,2

x

C1,1=A1,1B1,1+A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2

C2,1=A2,1B1,1+A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2

=


• Firstiterationintermediateresult

• I=2

C1C2

0+A1,1B1,10+A1,1B1,2

0+A2,1B1,10+A2,1B1,2

B1B2

B2,1B2,2

B2,1B2,2

A A1,2 A2,2_mm_load_pd:StoredinmemoryinColumnorder



27


• Seconditerationintermediateresult

• I=2

C1C2

A1,1B1,1+A1,2B2,1A1,1B1,2+A1,2B2,2

A2,1B1,1+A2,2B2,1A2,1B1,2+A2,2B2,2

B1B2

B2,1B2,2

B2,1B2,2


C1,1

C1,2

C2,1

C2,2



28

Example:2x2MatrixMultiply(Part1of2)

#include<stdio.h>//headerfileforSSEcompilerintrinsics#include<emmintrin.h>

//NOTE:vectorregisterswillberepresentedincommentsasv1=[a|b]

//wherev1isavariableoftype__m128danda,b aredoubles

int main(void){//allocateA,B,Calignedon16-byteboundariesdoubleA[4]__attribute__((aligned(16)));doubleB[4]__attribute__((aligned(16)));doubleC[4]__attribute__((aligned(16)));int lda =2;int i =0;//declareseveral128-bitvectorvariables__m128dc1,c2,a,b1,b2;

//InitializeA,B,Cforexample/*A=(notecolumnorder!)

1001*/A[0]=1.0;A[1]=0.0;A[2]=0.0;A[3]=1.0;

/*B= (notecolumnorder!)1324*/B[0]=1.0;B[1]=2.0;B[2]=3.0;B[3]=4.0;

/*C=(notecolumnorder!)0000*/C[0]=0.0;C[1]=0.0;C[2]=0.0;C[3]=0.0;

29

Example:2x 2MatrixMultiply(Part2of2)

//usedalignedloadstoset//c1=[c_11|c_21]c1=_mm_load_pd(C+0*lda);//c2=[c_12|c_22]c2=_mm_load_pd(C+1*lda);

for(i =0;i <2;i++){/*a=i =0:[a_11|a_21]i =1:[a_12|a_22]*/a=_mm_load_pd(A+i*lda);/*b1=i =0:[b_11|b_11]i =1:[b_21|b_21]*/b1=_mm_load1_pd(B+i+0*lda);/*b2=i =0:[b_12|b_12]i =1:[b_22|b_22]*/b2=_mm_load1_pd(B+i+1*lda);

/*c1=i =0:[c_11+a_11*b_11|c_21+a_21*b_11]i =1:[c_11+a_21*b_21|c_21+a_22*b_21]*/c1=_mm_add_pd(c1,_mm_mul_pd(a,b1));/*c2=i =0:[c_12+a_11*b_12|c_22+a_21*b_12]i =1:[c_12+a_21*b_22|c_22+a_22*b_22]*/c2=_mm_add_pd(c2,_mm_mul_pd(a,b2));

}

//storec1,c2backintoCforcompletion_mm_store_pd(C+0*lda,c1);_mm_store_pd(C+1*lda,c2);

//printCprintf("%g,%g\n%g,%g\n",C[0],C[2],C[1],C[3]);return0;

}

30

Conclusion

• FlynnTaxonomy• IntelSSESIMDInstructions

– Exploitdata-levelparallelisminloops– Oneinstructionfetchthatoperatesonmultipleoperandssimultaneously

– 128-bitXMMregisters• SSEInstructionsinC

– EmbedtheSSEmachineinstructionsdirectlyintoCprogramsthroughuseofintrinsics

– Achieveefficiencybeyondthatofoptimizingcompiler

31

simd - ucsbtyang/class/240a17/slides/simd.pdf · simd: single instruction, multiple data + •...

Documents