something about sse and beyond
TRANSCRIPT
![Page 1: Something about SSE and beyond](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587854141a28ab68198b6de9/html5/thumbnails/1.jpg)
SSE的那些事儿Use SIMD to boost your program!
![Page 2: Something about SSE and beyond](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587854141a28ab68198b6de9/html5/thumbnails/2.jpg)
CPU-Z
What all these about?
![Page 3: Something about SSE and beyond](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587854141a28ab68198b6de9/html5/thumbnails/3.jpg)
Outline• What is SSE?• Why SSE?• How to use SSE?• CPUID• Useful References• Discussions
![Page 4: Something about SSE and beyond](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587854141a28ab68198b6de9/html5/thumbnails/4.jpg)
Outline• What is SSE?• Why SSE?• How to use SSE?• CPUID• Useful References• Discussions
![Page 5: Something about SSE and beyond](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587854141a28ab68198b6de9/html5/thumbnails/5.jpg)
SSE• Streaming SIMD ExtensionsA set of CPU instructions dedicated to applications like signal
processing, scientific computation or 3D graphics.
![Page 6: Something about SSE and beyond](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587854141a28ab68198b6de9/html5/thumbnails/6.jpg)
SIMD• Single Instruction, Multiple DataA CPU instruction is said to be SIMD when the same operation is
applied on multiple data at the same time, i.e. operate on a “vector” of data with a single instruction.
![Page 7: Something about SSE and beyond](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587854141a28ab68198b6de9/html5/thumbnails/7.jpg)
Flynn’s taxonomy• Flynn's taxonomy is a classification of computer architectures,
proposed by Michael Flynn in 1966.Single instruction stream Multiple instruction streams
Single data stream SISD MISD
Multiple data streams SIMD MIMD
PU: Processing Unit
![Page 8: Something about SSE and beyond](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587854141a28ab68198b6de9/html5/thumbnails/8.jpg)
More on SSE• Streaming SIMD Extensions (SSE) is an SIMD instruction set extension
to the x86 architecture, designed by Intel and introduced in 1999 in their Pentium III series processors as a reply to AMD's 3DNow!• SSE contains 70 new instructions, most of which work on
single precision floating point data.• Intel's first IA-32 SIMD effort was the MMX instruction set.• SSE was subsequently expanded by Intel to SSE2, SSE3, SSSE3, SSE4 and
AVX.• SSE was originally called Katmai New Instructions (KNI), Katmai being
the code name for the first Pentium III core revision.
![Page 9: Something about SSE and beyond](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587854141a28ab68198b6de9/html5/thumbnails/9.jpg)
SSE Registers• SSE originally added eight new 128-bit registers known as XMM0
through XMM7. Later versions add more registers.• There is also a new 32-bit control/status register, MXCSR, which
provides control and status bits for operations performed on XMM registers.
![Page 10: Something about SSE and beyond](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587854141a28ab68198b6de9/html5/thumbnails/10.jpg)
SSE instructions• Packed and scalar single-precision floating-point instructions Data movement instructions Arithmetic instructions Logical instructions Comparison instructions Shuffle instructions Conversion instructions
• 64-bit SIMD integer instructions Operate on data in MMX registers and 64-bit memory locations.
• State management instructions LDMXCSR STMXCSR
• Cacheability control, prefetch, and memory ordering instructions Give programs more control over the caching of data
![Page 11: Something about SSE and beyond](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587854141a28ab68198b6de9/html5/thumbnails/11.jpg)
Intel CPU SIMD technology evolution
![Page 12: Something about SSE and beyond](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587854141a28ab68198b6de9/html5/thumbnails/12.jpg)
Outline• What is SSE?• Why SSE?• How to use SSE?• CPUID• Useful References• Discussions
![Page 13: Something about SSE and beyond](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587854141a28ab68198b6de9/html5/thumbnails/13.jpg)
Advantages of SIMD• Many real-world problems, especially in science and engineering, map
well to computation on arrays.• SIMD instructions can greatly increase performance when exactly the
same operations are to be performed on multiple data objects (arrays). • Typical applications are digital signal processing and
graphics processing.
![Page 14: Something about SSE and beyond](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587854141a28ab68198b6de9/html5/thumbnails/14.jpg)
Outline• What is SSE?• Why SSE?• How to use SSE?• CPUID• Useful References• Discussions
![Page 15: Something about SSE and beyond](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587854141a28ab68198b6de9/html5/thumbnails/15.jpg)
Think twice before you go• What is your application?• Is there better algorithm?• Will the effort get performance gain eventually? How much?• Which SSE version suites best?• Does your CPU support SSE? If, up to what version?• Does you operating system have SSE support?• How will you code the SSE programs? Assembly or high level?• …
![Page 16: Something about SSE and beyond](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587854141a28ab68198b6de9/html5/thumbnails/16.jpg)
Identity if applicable• SIMD improves the performance of 3D
graphics, speech recognition, image processing, scientific applications and applications that have the following characteristics:
Inherently parallel.Recurring memory access patterns.Localized recurring operations performed on
the data.Data-independent control flow.
• Support must be ensured on:CPUOperating System
• SIMD application candidates:Speech compression algorithms and filters.Speech recognition algorithms.Video display and capture routines.Rendering routines.3D graphics (geometry). Image and video processing algorithms.Spatial (3D) audio.Physical modeling (graphics, CAD).Workstation applications.Encryption algorithms.Complex arithmetic.
![Page 17: Something about SSE and beyond](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587854141a28ab68198b6de9/html5/thumbnails/17.jpg)
Choose the right instructions – Refer to Intel Optimization Manual 2.9• MMX• SSE• SSE2• SSE3• SSSE3• SSE4• AESNI and PCLMULQDQ• AVX, FMA and AVX2
![Page 18: Something about SSE and beyond](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587854141a28ab68198b6de9/html5/thumbnails/18.jpg)
Coding methodologies for SIMD• Assembly• Intrinsic• Classes• Automatic Vectorization
![Page 19: Something about SSE and beyond](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587854141a28ab68198b6de9/html5/thumbnails/19.jpg)
Assembly• Key loops can be coded directly in assembly language using an
assembler or by using inline assembly (C-ASM) in C/C++ code.• This model offers the opportunity for attaining greatest performance,
but this performance is not portable across the different processor architectures.
![Page 20: Something about SSE and beyond](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587854141a28ab68198b6de9/html5/thumbnails/20.jpg)
Intrinsic• Intrinsic provides the access to the ISA functionality using C/C++ style
coding instead of assembly language.• https://software.intel.com/sites/landingpage/IntrinsicsGuide/#
![Page 21: Something about SSE and beyond](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587854141a28ab68198b6de9/html5/thumbnails/21.jpg)
Header File Instructions & CPUx86intrin.h x86 instructionsmmintrin.h MMX (Pentium MMX!)mm3dnow.h 3dnow! (K6-2) (deprecated)xmmintrin.h SSE + MMX (Pentium 3, Athlon XP)emmintrin.h SSE2 + SSE + MMX (Pentiuem 4, Ahtlon 64)pmmintrin.h SSE3 + SSE2 + SSE + MMX (Pentium 4 Prescott, Ahtlon 64 San
Diego)tmmintrin.h SSSE3 + SSE3 + SSE2 + SSE + MMX (Core 2, Bulldozer)popcntintrin.h POPCNT (Core i7, Phenom subset of SSE4.2 and SSE4A)ammintrin.h SSE4A + SSE3 + SSE2 + SSE + MMX (Phenom)smmintrin.h SSE4_1 + SSSE3 + SSE3 + SSE2 + SSE + MMX (Core i7,
Bulldozer)nmmintrin.h SSE4_2 + SSE4_1 + SSSE3 + SSE3 + SSE2 + SSE + MMX (Core i7,
Bulldozer)wmmintrin.h AES (Core i7 Westmere, Bulldozer)immintrin.h AVX, SSE4_2 + SSE4_1 + SSSE3 + SSE3 + SSE2 + SSE + MMX
(Core i7 Sandy Bridge, Bulldozer)
![Page 22: Something about SSE and beyond](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587854141a28ab68198b6de9/html5/thumbnails/22.jpg)
Classes• A set of C++ classes has been defined and available in Intel C++
Compiler to provide both a higher-level abstraction and more flexibility for programming with SIMD technology.
![Page 23: Something about SSE and beyond](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587854141a28ab68198b6de9/html5/thumbnails/23.jpg)
Automatic Vectorization• The Intel C++ Compiler provides an optimization mechanism by which
loops, such as in Example 4-13 can be automatically vectorized, or converted into Streaming SIMD Extensions code.• Compile this code using the -QAX and -QRESTRICT switches of the
Intel C++ Compiler, version 4.0 or later.
![Page 25: Something about SSE and beyond](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587854141a28ab68198b6de9/html5/thumbnails/25.jpg)
Outline• What is SSE?• Why SSE?• How to use SSE?• CPUID• Useful References• Discussions
![Page 26: Something about SSE and beyond](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587854141a28ab68198b6de9/html5/thumbnails/26.jpg)
CPUID• CPU IDentification• The CPUID instruction can be used to retrieve various amount of
information about your CPU, like its vendor string and model number, the size of internal caches and (more interesting), the list of CPU features supported.
![Page 27: Something about SSE and beyond](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587854141a28ab68198b6de9/html5/thumbnails/27.jpg)
CPUID evolution• 1. Originally, Intel published code sequences that could detect minor
implementation or architectural differences to identify processor generations.• 2. With the advent of the Intel386 processor, Intel implemented
processor signature identification that provided the processor family, model, and stepping numbers to software, but only upon reset.• 3. As the Intel Architecture evolved, Intel extended the processor
signature identification into the CPUID instruction. The CPUID instruction not only provides the processor signature, but also provides information about the features supported by and implemented on the Intel processor.
![Page 29: Something about SSE and beyond](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587854141a28ab68198b6de9/html5/thumbnails/29.jpg)
Outline• What is SSE?• Why SSE?• How to use SSE?• CPUID• Useful References• Discussions
![Page 30: Something about SSE and beyond](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587854141a28ab68198b6de9/html5/thumbnails/30.jpg)
Useful References• http://
www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html• http://
www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf (Chapter 4 Coding For SIMD Architectures, Chapter 5 & 6 & 10 & 11)
• https://software.intel.com/en-us/isa-extensions• https://www.scss.tcd.ie/Jeremy.Jones/CS4021/processor-identification-cpuid-instruction-note.pdf• https://software.intel.com/en-us/articles/intel-software-development-emulator• http://supercomputingblog.com/optimization/getting-started-with-sse-programming/• http://felix.abecassis.me/2011/09/cpp-getting-started-with-sse/• http://wiki.osdev.org/CPUID• http://sandpile.org/x86/cpuid.htm• http://www.etallen.com/cpuid.html
![Page 31: Something about SSE and beyond](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587854141a28ab68198b6de9/html5/thumbnails/31.jpg)
More to explore• Memory alignment• AVX• FMA• ARM NEON• Intel® SHA Extensions• Intel® VTune™ Amplifier• Intel® VTune™ Performance Analyzer• Intel® Software Development Emulator• …
![Page 32: Something about SSE and beyond](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587854141a28ab68198b6de9/html5/thumbnails/32.jpg)
Thank You!Lihang Li @ IEG