performance migration from intel westmere to intel sandy bridge

33
© 2012 IBM Corporation Performance migration from Intel Westmere to Intel Sandy Bridge thru Advanced Vector Extensions (AVX) Nagarajan Kathiresan IBM India Presented by Giri Prabhakar Contact: [email protected] [email protected]

Upload: others

Post on 12-Sep-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Performance migration from Intel Westmere to Intel Sandy Bridge

© 2012 IBM Corporation

Performance migration from Intel Westmere to Intel Sandy Bridge thru Advanced Vector

Extensions (AVX)Nagarajan Kathiresan

IBM India Presented by Giri Prabhakar

Contact:[email protected]@in.ibm.com

Page 2: Performance migration from Intel Westmere to Intel Sandy Bridge
Page 3: Performance migration from Intel Westmere to Intel Sandy Bridge

IBM India © 2012 IBM Corporation3

Source: Intel MMX, SSE and AVX

Page 4: Performance migration from Intel Westmere to Intel Sandy Bridge

IBM India © 2012 IBM Corporation4

“I must have the Intel compiler, it has sped up our application by two.” - A customer when moving from version 9.1 to version 10 of the Intel compiler

Source: Intel

Page 5: Performance migration from Intel Westmere to Intel Sandy Bridge
Page 6: Performance migration from Intel Westmere to Intel Sandy Bridge

IBM India © 2012 IBM Corporation6

Source: Intel AVX

Page 7: Performance migration from Intel Westmere to Intel Sandy Bridge

IBM India © 2012 IBM Corporation7

Source: Intel SSE & AVX

Page 8: Performance migration from Intel Westmere to Intel Sandy Bridge

IBM India © 2012 IBM Corporation8

Source: Intel Compiler tunings

Page 9: Performance migration from Intel Westmere to Intel Sandy Bridge

IBM India © 2012 IBM Corporation9

Following figure illustrates the data types used in the SSE and Intel® AVX instructions. Roughly, for Intel® AVX, any multiple of 32-bit or 64-bit floating-point type that adds to 128 or 256 bits is allowed as well as multiples of any integer type that adds to 128 bits.

Source: Intel MMX, SSE and AVX

Page 10: Performance migration from Intel Westmere to Intel Sandy Bridge

IBM India © 2012 IBM Corporation10

About AVX Performance - Summary

Doubling the 128 bit SSE registers to 256 bits They introduce an entirely new instruction encoding (VEX) The new encoding switches from 2 operand instructions to 3 operand

instructions allowing the destination register to be different than the source registers. Example:

addps r0, r1 # (r0 = r0 + r1) vs. vaddps r0, r1, r2 # (r0 = r1 + r2)

This new encoding is not only used for the new 256 bit instructions, but also for the 128 bit AVX versions of all the old SSE instructions. This means that existing SSE code can improved without requiring a switch to 256 bit registers.

switching to AVX is very easy; simply recompile with -mavx. In addition to using -mavx

Page 11: Performance migration from Intel Westmere to Intel Sandy Bridge

IBM India © 2012 IBM Corporation11

Source: Compiling for AVX, Intel

Page 12: Performance migration from Intel Westmere to Intel Sandy Bridge

IBM India © 2012 IBM Corporation12

Intel and GNU compiler for AVX

Intel's 12.1 uses OpenMP std. 3.1, while the CP2K source code uses OpenMP std. 2.5

Some OpenMP classes could not be compiled with the Intel compiler The GNU compiler is open source, and appears to be more 'in step' with

the CP2K source. However, it is “difficult” to get the system admin of a very large

installation to make a root installation of the GNU compiler (4.3+ - later version)

Therefore, experiments were tried with a local build of GNU (Gfortran) While -mavx does “work”, i.e., code compiles, it doesn't “AVX vectorize” -

it was found that the flags -march=corei7-avx -mtune=corei7-avx were necessary to enable AVX

Page 13: Performance migration from Intel Westmere to Intel Sandy Bridge

IBM India © 2012 IBM Corporation13

How to build Gfortran compiler locally

Gfortran Dependent libraries – GNU Multiple Precision Library (GMP) – MPFR Library (http://www.mpfr.org/. )– MPC Library (http://www.multiprecision.org/ )– Parma Polyhedra Library (PPL) – CLooG-PPL or CLooG (ftp://gcc.gnu.org/pub/gcc/infrastructure/ as cloog-

ppl-0.15.tar.gz. )

Page 14: Performance migration from Intel Westmere to Intel Sandy Bridge

IBM India © 2012 IBM Corporation14

Gfortran Local Build

MPFR MPC

GMP

PPL ClooG(-PPL)

GFORTRAN

Page 15: Performance migration from Intel Westmere to Intel Sandy Bridge

IBM India © 2012 IBM Corporation15

FFTW_INC = /user/naga/hybrid/Endeavor/fftw/includeFFTW_LIB = /user/naga/hybrid/Endeavor/fftw/libCC = gccCPP =FC = mpif90LD = mpif90AR = ar -rCPPFLAGS =DFLAGS = -D__GFORTRAN -D__FFTSG -D__LIBINT -D__parallel -D__SCALAPACK -D__BLACS -D__FFTW3 -D__MAX_CONTR=3 -D__GRID_CORE=2FCFLAGS = -I$(FFTW_INC) –O3 -fopenmp -ffast-math -march=corei7-avx -mtune=corei7-avx -funroll-loops -ftree-vectorize -march=native -ffree-form $(DFLAGS)LDFLAGS = $(FCFLAGS)LIBS = /user/naga/hybrid/Endeavor/libint_cpp_wrapper.o \/user/naga/hybrid/Endeavor/libint/lib/libderiv.a \/user/naga/hybrid/Endeavor/libint/lib/libint.a \/user/naga/hybrid/Endeavor/libs/libscalapack.a \/user/naga/hybrid/Endeavor/libs/blacs_MPI-LINUX-0.a \/user/naga/hybrid/Endeavor/libs/blacsCinit_MPI-LINUX-0.a \ /user/naga/hybrid/Endeavor/libs/blacsF77init_MPI-LINUX-0.a \/user/naga/hybrid/Endeavor/libs/blacs_MPI-LINUX-0.a \/user/naga/hybrid/Endeavor/libs/lapack_LINUX.a \/user/naga/hybrid/Endeavor/libs/blas_LINUX.a \ $(FFTW_LIB)/libfftw3.a \ -lstdc++ -lpthreadOBJECTS_ARCHITECTURE = machine_gfortran.o

Page 16: Performance migration from Intel Westmere to Intel Sandy Bridge

IBM India © 2012 IBM Corporation16

CP2K Build

BLACS FFTW

BLAS

LAPACK SCALAPACK

CP2K

Page 17: Performance migration from Intel Westmere to Intel Sandy Bridge

IBM India © 2012 IBM Corporation17

CP2K Execution time

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1

Tota

l exe

cutio

n tim

e (in

rat

io)

SNB GF 4.5 SNB GF 4.7 SNB GF 4.7 OPT WSM GF 4.5

Lower is better

Page 18: Performance migration from Intel Westmere to Intel Sandy Bridge

IBM India © 2012 IBM Corporation18

MPI Synchronization time

0

0.5

1

1.5

2

2.5

GF 4.5 GF 4.7 GF 4.7 Opt GF 4.5

SNB SNB SNB WSM

Category

MP

I Syn

chro

niza

tion

time

(in ra

tio)

Lower is better

Page 19: Performance migration from Intel Westmere to Intel Sandy Bridge

IBM India © 2012 IBM Corporation19

MPI PERFORMANCE

05000

100001500020000250003000035000400004500050000550006000065000700007500080000850009000095000

100000105000

MP_Bcast MP_ISendRecv MP_ISend MP_IRecv MP_Recv

MPI ROUTINE

PE

RF

OR

MA

NC

E [M

B/s

]

SDB Gfortran 4.5 SDB Gfortran 4.7 SDB Gfortran 4.7 Optimized WSM Gfortran 4.5

Higher is better

Page 20: Performance migration from Intel Westmere to Intel Sandy Bridge

IBM India © 2012 IBM Corporation20

Swamy Kandadai

Acknowledgements / Technical advisory

Luigi Brochard

Raj Panda

Page 21: Performance migration from Intel Westmere to Intel Sandy Bridge

IBM India © 2012 IBM Corporation21

Page 22: Performance migration from Intel Westmere to Intel Sandy Bridge

IBM India © 2012 IBM Corporation22

Sandy Bridge vs Westmere

Sandy Bridge· 32 kB data + 32 kB instruction ··L1 cache (3 clocks) and 256 kB ··L2 cache (8 clocks) per core · Shared L3 cache includes the processor graphics (··LGA 1155) · 64-byte ··cache line size · Two load/store operations per ··CPU cycle for each memory channel · Decoded micro-operation cache and enlarged, optimized ··branch predictor · Improved performance for ··transcendental mathematics, ··AES encryption (··AES instruction set), and ··SHA-1 hashing · 256-bit/cycle ring bus interconnect between cores, graphics, cache and System Agent Domain · ··Advanced Vector Extensions (AVX) 256-bit instruction set with wider vectors, new extensible syntax and rich functionality · ··Intel Quick Sync Video, hardware support for video encoding and decoding · Up to 8 physical cores or 16 logical cores through ··Hyper-threading

Westmere:· Native six-core (··Gulftown) and ten-core (··Westmere-EX) processors.··[8] · A new set of instructions that gives over 3x the encryption and decryption rate of ··Advanced Encryption Standard (AES) processes compared to before.··[9] · Delivers seven new instructions (··AES instruction set or ··AES-NI) that will be used by the AES algorithm. Also an instruction called PCLMULQDQ (see ··CLMUL instruction set) that will perform carry-less multiplication for use in cryptography.··[10] These instructions will allow the processor to perform hardware-accelerated encryption, not only resulting in faster execution but also protecting against software targeted attacks.· Integrated graphics, added into the processor package (dual core ··Arrandale and ··Clarkdale only). · Improved virtualization latency.··[11] · New virtualization capability: "VMX Unrestricted mode support," which allows 16-bit guests to run (real mode and big real mode). · Support for "Huge Pages" of 1 GB in size.

Source: Wikipedia

Page 23: Performance migration from Intel Westmere to Intel Sandy Bridge

IBM India © 2012 IBM Corporation23

Gfortran Local Build

MPFR MPC

GMP

PPL ClooG(-PPL)

Page 24: Performance migration from Intel Westmere to Intel Sandy Bridge

IBM India © 2012 IBM Corporation24

MPFR Install

export CC=gccexport CXX=g++export F77=gfortranexport FC=gfortranexport F90=gfortranexport CFLAGS="-m64 -O2 "export CXXFLAGS="-m64 -O2 "export FFLAGS="-m64 -O2 "export FCFLAGS="-m64 -O2 "export LDFLAGS="-m64 -O2 "./configure –prefix=/user/naga/4.7.0/dlibs \--with-gmp=/user/naga/4.7.0/dlibs 2>&1 \ | tee config.naga-64bit.logmake -j8 2>&1 | tee make.naga-64bit.logmake install

Page 25: Performance migration from Intel Westmere to Intel Sandy Bridge

IBM India © 2012 IBM Corporation25

MPC Install

export CC=gccexport CXX=g++export F77=gfortranexport FC=gfortranexport F90=gfortranexport CFLAGS="-m64 -O2 "export CXXFLAGS="-m64 -O2 "export FFLAGS="-m64 -O2 "export FCFLAGS="-m64 -O2 "export LDFLAGS="-m64 -O2 "./configure --prefix=/user/naga/4.7.0/dlibs \--with-mpfr=/user/naga/4.7.0/dlibs \--with-gmp=/user/naga/4.7.0/dlibs 2>&1 \ | tee config.naga-64bit.logmake -j8 2>&1 | tee make.naga-64bit.logmake install

Page 26: Performance migration from Intel Westmere to Intel Sandy Bridge

IBM India © 2012 IBM Corporation26

PPL Installexport CC=gccexport CXX=g++export F77=gfortranexport FC=gfortranexport F90=gfortranexport CFLAGS="-m64 -O2 "export CXXFLAGS="-m64 -O2 "export FFLAGS="-m64 -O2 "export FCFLAGS="-m64 -O2 "export LDFLAGS="-m64 -O2 "./configure --prefix=/user/naga/4.7.0/dlibs \--with-libgmp-prefix=/user/naga/4.7.0/dlibs/lib \ 2>&1 | tee config.naga-64bit.log

make -j8 2>&1 | tee make.naga-64bit.logmake install

Page 27: Performance migration from Intel Westmere to Intel Sandy Bridge

IBM India © 2012 IBM Corporation27

cloog-ppl-0.15.11export CC=gccexport CXX=g++export F77=gfortranexport FC=gfortranexport F90=gfortranexport CFLAGS="-m64 -O2 "export CXXFLAGS="-m64 -O2 "export FFLAGS="-m64 -O2 "export FCFLAGS="-m64 -O2 "export LDFLAGS="-m64 -O2 "./configure --prefix=/user/naga/4.7.0/dlibs \ --with-ppl=/user/naga/4.7.0/dlibs \ --with-gmp=/user/naga/4.7.0/dlibs \

make -j8 2>&1 | tee make.naga-64bit.log

make install

Page 28: Performance migration from Intel Westmere to Intel Sandy Bridge

IBM India © 2012 IBM Corporation28

Sandy Bridge vs Westmere

Sandy Bridge· 32 kB data + 32 kB instruction ··L1 cache (3 clocks) and 256 kB ··L2 cache (8 clocks) per core · Shared L3 cache includes the processor graphics (··LGA 1155) · 64-byte ··cache line size · Two load/store operations per ··CPU cycle for each memory channel · Decoded micro-operation cache and enlarged, optimized ··branch predictor · Improved performance for ··transcendental mathematics, ··AES encryption (··AES instruction set), and ··SHA-1 hashing · 256-bit/cycle ring bus interconnect between cores, graphics, cache and System Agent Domain · ··Advanced Vector Extensions (AVX) 256-bit instruction set with wider vectors, new extensible syntax and rich functionality · ··Intel Quick Sync Video, hardware support for video encoding and decoding · Up to 8 physical cores or 16 logical cores through ··Hyper-threading

Westmere:· Native six-core (··Gulftown) and ten-core (··Westmere-EX) processors.··[8] · A new set of instructions that gives over 3x the encryption and decryption rate of ··Advanced Encryption Standard (AES) processes compared to before.··[9] · Delivers seven new instructions (··AES instruction set or ··AES-NI) that will be used by the AES algorithm. Also an instruction called PCLMULQDQ (see ··CLMUL instruction set) that will perform carry-less multiplication for use in cryptography.··[10] These instructions will allow the processor to perform hardware-accelerated encryption, not only resulting in faster execution but also protecting against software targeted attacks.· Integrated graphics, added into the processor package (dual core ··Arrandale and ··Clarkdale only). · Improved virtualization latency.··[11] · New virtualization capability: "VMX Unrestricted mode support," which allows 16-bit guests to run (real mode and big real mode). · Support for "Huge Pages" of 1 GB in size.

Source: Wikipedia

Page 29: Performance migration from Intel Westmere to Intel Sandy Bridge

IBM India © 2012 IBM Corporation29

CP2K Build

BLACS FFTW

BLAS

LAPACK SCALAPACK

CP2K

Page 30: Performance migration from Intel Westmere to Intel Sandy Bridge

IBM India © 2012 IBM Corporation30

BLAS InstallModify the make.inc

FORTRAN = gfortranOPTS = -O3 -ffast-math -funroll-loops -ftree-vectorize -march=corei7-avx -mtune=corei7-avxOPTS = -O3DRVOPTS = $(OPTS)NOOPT =LOADER = gfortranLOADOPTS =Make Make install

Page 31: Performance migration from Intel Westmere to Intel Sandy Bridge

IBM India © 2012 IBM Corporation31

Modify the Bmake.inc file

BTOPdir = /user/naga/hybrid/Endeavor/BLACS BLACSdir = $(BTOPdir)/LIB BLACSDBGLVL = 0 BLACSFINIT = $(BLACSdir)/blacsF77init_$(COMMLIB)-$(PLAT)-$(BLACSDBGLVL).a BLACSCINIT = $(BLACSdir)/blacsCinit_$(COMMLIB)-$(PLAT)-$(BLACSDBGLVL).a BLACSLIB = $(BLACSdir)/blacs_$(COMMLIB)-$(PLAT)-$(BLACSDBGLVL).a MPIdir = /opt/intel/impi/4.0.3.008/intel64 MPILIBdir = $(MPIdir)/lib MPIINCdir = $(MPIdir)/include MPILIB = -L$(MPILIBdir) -lmpich F77 = mpif90 F77NO_OPTFLAGS = F77FLAGS = $(F77NO_OPTFLAGS) -O F77LOADER = $(F77) F77LOADFLAGS = CC = mpicc CCFLAGS = -O4 -ffast-math -funroll-loops \

-ftree-vectorize -march=corei7-avx -mtune=corei7-avx CCFLAGS = -O4 CCLOADER = $(CC) CCLOADFLAGS =

Page 32: Performance migration from Intel Westmere to Intel Sandy Bridge

IBM India © 2012 IBM Corporation32

fftw-3.2.2export CC=gccexport CFLAGS="-O3 -ffast-math -funroll-loops -ftree-vectorize \

-ffree-form -march=corei7-avx -mtune=corei7-avx"export CFLAGS="-O3"export MPICC=mpiccexport F77=gfortranexport FFLAGS="-O3 -ffast-math -funroll-loops -ftree-vectorize \

-ffree-form -march=corei7-avx -mtune=corei7-avx"export FFLAGS="-O3"./configure --prefix=/user/naga/4.7.0/cp2k-dlibs \

--enable-mpi 2>&1 | tee config.naga.log

Page 33: Performance migration from Intel Westmere to Intel Sandy Bridge

IBM India © 2012 IBM Corporation33

Install scalapack-2.0.1Modify SLmake.inc fileFC = mpif90CC = mpiccNOOPT = -O0FCFLAGS = -O3 -march=corei7-avx -mtune=corei7-avxCCFLAGS = -O3 -march=corei7-avx -mtune=corei7-avxFCLOADER = $(FC)CCLOADER = $(CC)FCLOADFLAGS = $(FCFLAGS)CCLOADFLAGS = $(CCFLAGS)BLASLIB = /user/naga/hybrid/Endeavor/BLAS/blas_LINUX.aLAPACKLIB = /user/naga/hybrid/Endeavor/lapack-3.4.0/liblapack.aLIBS = $(LAPACKLIB) $(BLASLIB)