intel® xeon® phi coprocessor high performance programming
DESCRIPTION
TRANSCRIPT
![Page 1: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/1.jpg)
Intel® Xeon® Phi Coprocessor High Performance ProgrammingParallelizing a Simple Image Blurring Algorithm
Brian Gesiak
April 16th, 2014
Research Student, The University of Tokyo
@modocache
![Page 2: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/2.jpg)
Today
• Image blurring with a 9-point stencil algorithm • Comparing performance
• Intel® Xeon® Dual Processor • Intel® Xeon® Phi Coprocessor
• Iteratively improving performance • Worst: Completely serial • Better: Adding loop vectorization • Best: Supporting multiple threads
• Further optimizations • Padding arrays for improved cache performance • Read-less writes, i.e.: streaming stores • Using huge memory pages
![Page 3: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/3.jpg)
Stencil AlgorithmsA 9-Point Stencil on a 2D Matrix
![Page 4: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/4.jpg)
Stencil AlgorithmsA 9-Point Stencil on a 2D Matrix
![Page 5: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/5.jpg)
Stencil Algorithms
typedef double real; typedef struct { real center; real next; real diagonal; } weight_t;
A 9-Point Stencil on a 2D Matrix
![Page 6: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/6.jpg)
Stencil Algorithms
typedef double real; typedef struct { real center; real next; real diagonal; } weight_t;
weight.center;
A 9-Point Stencil on a 2D Matrix
![Page 7: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/7.jpg)
Stencil Algorithms
typedef double real; typedef struct { real center; real next; real diagonal; } weight_t;
weight.center;
weight.next;
A 9-Point Stencil on a 2D Matrix
![Page 8: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/8.jpg)
Stencil Algorithms
typedef double real; typedef struct { real center; real next; real diagonal; } weight_t;
weight.center;
weight.diagonal;
weight.next;
A 9-Point Stencil on a 2D Matrix
![Page 9: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/9.jpg)
Image BlurringApplying a 9-Point Stencil to a Bitmap
![Page 10: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/10.jpg)
Image BlurringApplying a 9-Point Stencil to a Bitmap
![Page 11: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/11.jpg)
Image BlurringApplying a 9-Point Stencil to a Bitmap
![Page 12: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/12.jpg)
Halo Effect
Image BlurringApplying a 9-Point Stencil to a Bitmap
![Page 13: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/13.jpg)
• Apply a 9-point stencil to a 5,900 x 10,000 px image • Apply the stencil 1,000 times
Sample Application
![Page 14: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/14.jpg)
• Apply a 9-point stencil to a 5,900 x 10,000 px image • Apply the stencil 1,000 times
Sample Application
![Page 15: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/15.jpg)
• Apply a 9-point stencil to a 5,900 x 10,000 px image • Apply the stencil 1,000 times
Sample Application
![Page 16: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/16.jpg)
Comparing ProcessorsXeon® Dual Processor vs. Xeon® Phi Coprocessor
Processor Clock Frequency
Number of Cores
Memory Size/Type
Peak DP/SP FLOPs
Peak Memory
Bandwidth
![Page 17: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/17.jpg)
Comparing ProcessorsXeon® Dual Processor vs. Xeon® Phi Coprocessor
Processor Clock Frequency
Number of Cores
Memory Size/Type
Peak DP/SP FLOPs
Peak Memory
Bandwidth
Intel® Xeon® Dual
Processor2.6 GHz 16 (8 x 2
CPUs)63 GB / DDR3
345.6 / 691.2 GigaFLOP/s 85.3 GB/s
![Page 18: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/18.jpg)
Comparing ProcessorsXeon® Dual Processor vs. Xeon® Phi Coprocessor
Processor Clock Frequency
Number of Cores
Memory Size/Type
Peak DP/SP FLOPs
Peak Memory
Bandwidth
Intel® Xeon® Dual
Processor2.6 GHz 16 (8 x 2
CPUs)63 GB / DDR3
345.6 / 691.2 GigaFLOP/s 85.3 GB/s
Intel® Xeon® Phi
Coprocessor1.091 GHz 61 8 GB/
GDDR51.065/2.130 TeraFLOP/s 352 GB/s
![Page 19: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/19.jpg)
1st Comparison: Serial Execution
![Page 20: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/20.jpg)
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
![Page 21: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/21.jpg)
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
![Page 22: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/22.jpg)
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
![Page 23: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/23.jpg)
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
![Page 24: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/24.jpg)
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
![Page 25: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/25.jpg)
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
![Page 26: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/26.jpg)
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
![Page 27: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/27.jpg)
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
![Page 28: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/28.jpg)
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
![Page 29: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/29.jpg)
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
Assumed vector dependency
![Page 30: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/30.jpg)
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual Processor
244.178 seconds (4 minutes) 4,107.658
Intel® Xeon® Phi Coprocessor
2,838.342 seconds (47.3 minutes) 353.375
1st Comparison: Serial ExecutionResults
![Page 31: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/31.jpg)
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual Processor
244.178 seconds (4 minutes) 4,107.658
Intel® Xeon® Phi Coprocessor
2,838.342 seconds (47.3 minutes) 353.375
1st Comparison: Serial ExecutionResults
$ icc -openmp -O3 stencil.c -o stencil
![Page 32: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/32.jpg)
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual Processor
244.178 seconds (4 minutes) 4,107.658
Intel® Xeon® Phi Coprocessor
2,838.342 seconds (47.3 minutes) 353.375
1st Comparison: Serial ExecutionResults
$ icc -openmp -mmic -O3 stencil.c -o stencil_phi
![Page 33: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/33.jpg)
Dual is 11 times faster than Phi
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual Processor
244.178 seconds (4 minutes) 4,107.658
Intel® Xeon® Phi Coprocessor
2,838.342 seconds (47.3 minutes) 353.375
1st Comparison: Serial ExecutionResults
![Page 34: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/34.jpg)
2nd Comparison: Vectorization
for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. ! #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... }
Ignoring Assumed Vector Dependencies
![Page 35: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/35.jpg)
2nd Comparison: Vectorization
for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. ! #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... }
Ignoring Assumed Vector Dependencies
![Page 36: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/36.jpg)
ivdepTells compiler to ignore assumed dependencies
Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599-AEDF-2434F4676E1B.htm
![Page 37: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/37.jpg)
ivdepTells compiler to ignore assumed dependencies
• In our program, the compiler cannot determine whether the two pointers refer to the same block of memory. So the compiler assumes they do.
Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599-AEDF-2434F4676E1B.htm
![Page 38: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/38.jpg)
ivdepTells compiler to ignore assumed dependencies
• In our program, the compiler cannot determine whether the two pointers refer to the same block of memory. So the compiler assumes they do.
• The ivdep pragma negates this assumption.
Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599-AEDF-2434F4676E1B.htm
![Page 39: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/39.jpg)
ivdepTells compiler to ignore assumed dependencies
• In our program, the compiler cannot determine whether the two pointers refer to the same block of memory. So the compiler assumes they do.
• The ivdep pragma negates this assumption.• Proven dependencies may not be ignored.
Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599-AEDF-2434F4676E1B.htm
![Page 40: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/40.jpg)
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual Processor
186.585 seconds (3.1 minutes) 5,375.572
Intel® Xeon® Phi Coprocessor
623.302 seconds (10.3 minutes) 1,609.171
2nd Comparison: VectorizationResults
![Page 41: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/41.jpg)
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual Processor
186.585 seconds (3.1 minutes) 5,375.572
Intel® Xeon® Phi Coprocessor
623.302 seconds (10.3 minutes) 1,609.171
2nd Comparison: VectorizationResults
$ icc -openmp -O3 stencil.c -o stencil
1.3 times faster
![Page 42: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/42.jpg)
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual Processor
186.585 seconds (3.1 minutes) 5,375.572
Intel® Xeon® Phi Coprocessor
623.302 seconds (10.3 minutes) 1,609.171
2nd Comparison: VectorizationResults
$ icc -openmp -mmic -O3 stencil.c -o stencil_phi
4.5 times faster
1.3 times faster
![Page 43: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/43.jpg)
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual Processor
186.585 seconds (3.1 minutes) 5,375.572
Intel® Xeon® Phi Coprocessor
623.302 seconds (10.3 minutes) 1,609.171
2nd Comparison: VectorizationResults
4.5 times faster
1.3 times faster
Dual is now only 4 times faster than Phi
![Page 44: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/44.jpg)
3rd Comparison: MultithreadingWork Division Using Parallel For Loops
![Page 45: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/45.jpg)
3rd Comparison: Multithreading
#pragma omp parallel for for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. ! #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... }
Work Division Using Parallel For Loops
![Page 46: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/46.jpg)
3rd Comparison: Multithreading
#pragma omp parallel for for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. ! #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... }
Work Division Using Parallel For Loops
![Page 47: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/47.jpg)
Processor Elapsed Wall Time (seconds) MegaFLOPS
Xeon® Dual Proc., 16 Threads 43.862 22,867.185
Xeon® Dual Proc., 32 Threads 46.247 21,688.103
Xeon® Phi, 61 Threads 11.366 88,246.452
Xeon® Phi, 122 Threads 8.772 114,338.399
Xeon® Phi, 183 Threads 10.546 94,946.364
Xeon® Phi, 244 Threads 12.696 78,999.44
3rd Comparison: MultithreadingResults
![Page 48: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/48.jpg)
Processor Elapsed Wall Time (seconds) MegaFLOPS
Xeon® Dual Proc., 16 Threads 43.862 22,867.185
Xeon® Dual Proc., 32 Threads 46.247 21,688.103
Xeon® Phi, 61 Threads 11.366 88,246.452
Xeon® Phi, 122 Threads 8.772 114,338.399
Xeon® Phi, 183 Threads 10.546 94,946.364
Xeon® Phi, 244 Threads 12.696 78,999.44
3rd Comparison: MultithreadingResults
![Page 49: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/49.jpg)
Processor Elapsed Wall Time (seconds) MegaFLOPS
Xeon® Dual Proc., 16 Threads 43.862 22,867.185
Xeon® Dual Proc., 32 Threads 46.247 21,688.103
Xeon® Phi, 61 Threads 11.366 88,246.452
Xeon® Phi, 122 Threads 8.772 114,338.399
Xeon® Phi, 183 Threads 10.546 94,946.364
Xeon® Phi, 244 Threads 12.696 78,999.44
3rd Comparison: MultithreadingResults
![Page 50: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/50.jpg)
Processor Elapsed Wall Time (seconds) MegaFLOPS
Xeon® Dual Proc., 16 Threads 43.862 22,867.185
Xeon® Dual Proc., 32 Threads 46.247 21,688.103
Xeon® Phi, 61 Threads 11.366 88,246.452
Xeon® Phi, 122 Threads 8.772 114,338.399
Xeon® Phi, 183 Threads 10.546 94,946.364
Xeon® Phi, 244 Threads 12.696 78,999.44
3rd Comparison: MultithreadingResults
4x
71x
![Page 51: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/51.jpg)
Processor Elapsed Wall Time (seconds) MegaFLOPS
Xeon® Dual Proc., 16 Threads 43.862 22,867.185
Xeon® Dual Proc., 32 Threads 46.247 21,688.103
Xeon® Phi, 61 Threads 11.366 88,246.452
Xeon® Phi, 122 Threads 8.772 114,338.399
Xeon® Phi, 183 Threads 10.546 94,946.364
Xeon® Phi, 244 Threads 12.696 78,999.44
3rd Comparison: MultithreadingResults
4x
71x
Phi now 5 times faster
![Page 52: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/52.jpg)
Further Optimizations
![Page 53: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/53.jpg)
Further Optimizations
1. Padded arrays
![Page 54: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/54.jpg)
Further Optimizations
1. Padded arrays2. Streaming stores
![Page 55: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/55.jpg)
Further Optimizations
1. Padded arrays2. Streaming stores3. Huge memory pages
![Page 56: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/56.jpg)
Optimization 1: Padded ArraysOptimizing Cache Access
![Page 57: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/57.jpg)
Optimization 1: Padded Arrays
• We can add extra, unused data to the end of each row
Optimizing Cache Access
![Page 58: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/58.jpg)
Optimization 1: Padded Arrays
• We can add extra, unused data to the end of each row• Doing so aligns heavily used memory addresses for efficient cache line access
Optimizing Cache Access
![Page 59: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/59.jpg)
Optimization 1: Padded Arrays
![Page 60: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/60.jpg)
Optimization 1: Padded Arraysstatic const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
![Page 61: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/61.jpg)
Optimization 1: Padded Arraysstatic const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
![Page 62: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/62.jpg)
Optimization 1: Padded Arraysstatic const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
![Page 63: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/63.jpg)
Optimization 1: Padded Arraysstatic const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
![Page 64: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/64.jpg)
Optimization 1: Padded Arraysstatic const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
![Page 65: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/65.jpg)
Optimization 1: Padded Arraysstatic const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
![Page 66: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/66.jpg)
Optimization 1: Padded Arraysstatic const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
![Page 67: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/67.jpg)
Optimization 1: Padded Arraysstatic const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
((5900*sizeof(real)+63)/64)*(64/sizeof(real));
![Page 68: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/68.jpg)
Optimization 1: Padded Arraysstatic const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
(real *)_mm_malloc(size, kPaddingSize);(real *)_mm_malloc(size, kPaddingSize);
sizeof(real)* width*kPaddingSize * height;
((5900*sizeof(real)+63)/64)*(64/sizeof(real));
![Page 69: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/69.jpg)
Optimization 1: Padded Arraysstatic const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
_mm_free(fin); _mm_free(fout);
(real *)_mm_malloc(size, kPaddingSize);(real *)_mm_malloc(size, kPaddingSize);
sizeof(real)* width*kPaddingSize * height;
((5900*sizeof(real)+63)/64)*(64/sizeof(real));
![Page 70: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/70.jpg)
Optimization 1: Padded ArraysAccommodating for Padding
![Page 71: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/71.jpg)
Optimization 1: Padded Arrays
#pragma omp parallel for for (int y = 1; y < height - 1; ++y) { ! // ...calculate center, east, northwest, etc. int center = 1 + y * kPaddingSize + 1; int north = center - kPaddingSize; int south = center + kPaddingSize; int east = center + 1; int west = center - 1; int northwest = north - 1; int northeast = north + 1; int southwest = south - 1; int southeast = south + 1; ! #pragma ivdep // ... }
Accommodating for Padding
![Page 72: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/72.jpg)
Optimization 1: Padded Arrays
#pragma omp parallel for for (int y = 1; y < height - 1; ++y) { ! // ...calculate center, east, northwest, etc. int center = 1 + y * kPaddingSize + 1; int north = center - kPaddingSize; int south = center + kPaddingSize; int east = center + 1; int west = center - 1; int northwest = north - 1; int northeast = north + 1; int southwest = south - 1; int southeast = south + 1; ! #pragma ivdep // ... }
Accommodating for Padding
![Page 73: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/73.jpg)
Processor Elapsed Wall Time (seconds) MegaFLOPS
Xeon® Phi, 61 Threads 11.644 86,138.371
Xeon® Phi, 122 Threads 8.973 111,774.803
Xeon® Phi, 183 Threads 10.326 97,132.546
Xeon® Phi, 244 Threads 11.469 87,452.707
Optimization 1: Padded ArraysResults
![Page 74: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/74.jpg)
Processor Elapsed Wall Time (seconds) MegaFLOPS
Xeon® Phi, 61 Threads 11.644 86,138.371
Xeon® Phi, 122 Threads 8.973 111,774.803
Xeon® Phi, 183 Threads 10.326 97,132.546
Xeon® Phi, 244 Threads 11.469 87,452.707
Optimization 1: Padded ArraysResults
![Page 75: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/75.jpg)
Processor Elapsed Wall Time (seconds) MegaFLOPS
Xeon® Phi, 61 Threads 11.644 86,138.371
Xeon® Phi, 122 Threads 8.973 111,774.803
Xeon® Phi, 183 Threads 10.326 97,132.546
Xeon® Phi, 244 Threads 11.469 87,452.707
Optimization 1: Padded ArraysResults
![Page 76: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/76.jpg)
Optimization 2: Streaming StoresRead-less Writes
![Page 77: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/77.jpg)
Optimization 2: Streaming StoresRead-less Writes
• By default, Xeon® Phi processors read the value at an address before writing to that address.
![Page 78: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/78.jpg)
Optimization 2: Streaming StoresRead-less Writes
• By default, Xeon® Phi processors read the value at an address before writing to that address.
• When calculating the weighted average for a pixel in our program, we do not use the original value of that pixel. Therefore, enabling streaming stores should result in better performance.
![Page 79: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/79.jpg)
Optimization 2: Streaming Stores
for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. ! #pragma ivdep #pragma vector nontemporal for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... }
Read-less Writes with Vector Nontemporal
![Page 80: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/80.jpg)
Optimization 2: Streaming Stores
for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. ! #pragma ivdep #pragma vector nontemporal for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... }
Read-less Writes with Vector Nontemporal
![Page 81: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/81.jpg)
Processor Elapsed Wall Time (seconds) MegaFLOPS
Xeon® Phi, 61 Threads 13.588 73,978.915
Xeon® Phi, 122 Threads 8.491 111,774.803
Xeon® Phi, 183 Threads 8.663 115,773.405
Xeon® Phi, 244 Threads 9.507 105,498.781
Optimization 2: Streaming StoresResults
![Page 82: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/82.jpg)
Processor Elapsed Wall Time (seconds) MegaFLOPS
Xeon® Phi, 61 Threads 13.588 73,978.915
Xeon® Phi, 122 Threads 8.491 111,774.803
Xeon® Phi, 183 Threads 8.663 115,773.405
Xeon® Phi, 244 Threads 9.507 105,498.781
Optimization 2: Streaming StoresResults
![Page 83: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/83.jpg)
Processor Elapsed Wall Time (seconds) MegaFLOPS
Xeon® Phi, 61 Threads 13.588 73,978.915
Xeon® Phi, 122 Threads 8.491 111,774.803
Xeon® Phi, 183 Threads 8.663 115,773.405
Xeon® Phi, 244 Threads 9.507 105,498.781
Optimization 2: Streaming StoresResults
![Page 84: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/84.jpg)
Optimization 3: Huge Memory Pages
![Page 85: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/85.jpg)
• Memory pages map virtual memory used by our program to physical memory
Optimization 3: Huge Memory Pages
![Page 86: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/86.jpg)
• Memory pages map virtual memory used by our program to physical memory
• Mappings are stored in a translation look-aside buffer (TLB)
Optimization 3: Huge Memory Pages
![Page 87: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/87.jpg)
• Memory pages map virtual memory used by our program to physical memory
• Mappings are stored in a translation look-aside buffer (TLB)
• Mappings are traversed in a “page table walk”
Optimization 3: Huge Memory Pages
![Page 88: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/88.jpg)
• Memory pages map virtual memory used by our program to physical memory
• Mappings are stored in a translation look-aside buffer (TLB)
• Mappings are traversed in a “page table walk”• malloc and _mm_malloc use 4KB memory pages by default
Optimization 3: Huge Memory Pages
![Page 89: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/89.jpg)
• Memory pages map virtual memory used by our program to physical memory
• Mappings are stored in a translation look-aside buffer (TLB)
• Mappings are traversed in a “page table walk”• malloc and _mm_malloc use 4KB memory pages by default
• By increasing the size of each memory page, traversal time may be reduced
Optimization 3: Huge Memory Pages
![Page 90: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/90.jpg)
Optimization 3: Huge Memory Pages
size_t size = sizeof(real) * width * kPaddingSize * height; real *fin = (real *)_mm_malloc(size, kPaddingSize);
![Page 91: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/91.jpg)
Optimization 3: Huge Memory Pages
size_t size = sizeof(real) * width * kPaddingSize * height; real *fin = (real *)_mm_malloc(size, kPaddingSize); real *fin = (real *)mmap(0,
size, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE|MAP_HUGETLB, -1.0);
![Page 92: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/92.jpg)
Optimization 3: Huge Memory Pages
size_t size = sizeof(real) * width * kPaddingSize * height; real *fin = (real *)_mm_malloc(size, kPaddingSize); real *fin = (real *)mmap(0,
size, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE|MAP_HUGETLB, -1.0);
![Page 93: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/93.jpg)
Processor Elapsed Wall Time (seconds) MegaFLOPS
Xeon® Phi, 61 Threads 14.486 69,239.365
Xeon® Phi, 122 Threads 8.226 121,924.389
Xeon® Phi, 183 Threads 8.749 114,636.799
Xeon® Phi, 244 Threads 9.466 105,955.358
ResultsOptimization 3: Huge Memory Pages
![Page 94: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/94.jpg)
Processor Elapsed Wall Time (seconds) MegaFLOPS
Xeon® Phi, 61 Threads 14.486 69,239.365
Xeon® Phi, 122 Threads 8.226 121,924.389
Xeon® Phi, 183 Threads 8.749 114,636.799
Xeon® Phi, 244 Threads 9.466 105,955.358
ResultsOptimization 3: Huge Memory Pages
![Page 95: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/95.jpg)
Processor Elapsed Wall Time (seconds) MegaFLOPS
Xeon® Phi, 61 Threads 14.486 69,239.365
Xeon® Phi, 122 Threads 8.226 121,924.389
Xeon® Phi, 183 Threads 8.749 114,636.799
Xeon® Phi, 244 Threads 9.466 105,955.358
ResultsOptimization 3: Huge Memory Pages
![Page 96: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/96.jpg)
Takeaways
• The key to achieving high-performance is to use loop vectorization and multiple threads
• Completely serial programs run faster on standard processors
• Only properly designed programs achieve peak performance on an Intel® Xeon® Phi Coprocessor
• Other optimizations may be used to tweak performance • Data padding, • Streaming stores • Huge memory pages
![Page 97: Intel® Xeon® Phi Coprocessor High Performance Programming](https://reader033.vdocuments.mx/reader033/viewer/2022051109/54843acd5806b5b3588b45d5/html5/thumbnails/97.jpg)
Sources and Additional Resources
• Today’s slides • http://modocache.io/xeon-phi-high-performance
• Intel® Xeon® Phi Coprocessor High Performance Programming (James Jeffers, James Reinders)
• http://www.amazon.com/dp/0124104142 • Intel Documentation
• ivdep: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599-AEDF-2434F4676E1B.htm
• vector: https://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/cref_cls/common/cppref_pragma_vector.htm