openacc course - gpu technology...

23
OpenACC Course Office Hour 3: Expressing Data Locality and Optimizations with OpenACC Nikolai Sakharnykh, NVIDIA Developer Technologies

Upload: others

Post on 27-Apr-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: OpenACC Course - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2015/webinar/openacc-course/offic… · OpenACC Course Office Hour 3: Expressing Data Locality and Optimizations

OpenACC CourseOffice Hour 3: Expressing Data Locality and Optimizations with OpenACC

Nikolai Sakharnykh, NVIDIA Developer Technologies

Page 2: OpenACC Course - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2015/webinar/openacc-course/offic… · OpenACC Course Office Hour 3: Expressing Data Locality and Optimizations

2

Course Syllabus

Oct 1: Introduction to OpenACC

Oct 6: Office Hours

Oct 15: Profiling and Parallelizing with the OpenACC Toolkit

Oct 20: Office Hours

Oct 29: Expressing Data Locality and Optimizations with OpenACC

Nov 3: Office Hours

Nov 12: Advanced OpenACC Techniques

Nov 24: Office Hours

Page 3: OpenACC Course - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2015/webinar/openacc-course/offic… · OpenACC Course Office Hour 3: Expressing Data Locality and Optimizations

3

Agenda

Review: Lab 3 Solution & Results

Q&A

Homework & Next Lecture

Page 4: OpenACC Course - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2015/webinar/openacc-course/offic… · OpenACC Course Office Hour 3: Expressing Data Locality and Optimizations

4

Answered Questions and Recordingshttps://developer.nvidia.com/openacc-course

Lab 3: http://bit.ly/nvoacclab3

Page 5: OpenACC Course - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2015/webinar/openacc-course/offic… · OpenACC Course Office Hour 3: Expressing Data Locality and Optimizations

5

Lab 3 Results

Page 6: OpenACC Course - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2015/webinar/openacc-course/offic… · OpenACC Course Office Hour 3: Expressing Data Locality and Optimizations

6

Identify Available

Parallelism

Express Parallelism

Express Data Movement

Optimize Loop

Performance

Page 7: OpenACC Course - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2015/webinar/openacc-course/offic… · OpenACC Course Office Hour 3: Expressing Data Locality and Optimizations

7

Explicit Data Movement: Copy In Matrix

void allocate_3d_poission_matrix(matrix &A, int N) {

int num_rows=(N+1)*(N+1)*(N+1);

int nnz=27*num_rows;

A.num_rows=num_rows;

A.row_offsets = (unsigned int*) \

malloc((num_rows+1)*sizeof(unsigned int));

A.cols = (unsigned int*)malloc(nnz*sizeof(unsigned int));

A.coefs = (double*)malloc(nnz*sizeof(double));

// Initialize Matrix

A.row_offsets[num_rows]=nnz;

A.nnz=nnz;

#pragma acc enter data copyin(A)

#pragma acc enter data \

copyin(A.row_offsets[:num_rows+1],A.cols[:nnz],A.coefs[:nnz])

}

Add data directives to matrix.h

Page 8: OpenACC Course - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2015/webinar/openacc-course/offic… · OpenACC Course Office Hour 3: Expressing Data Locality and Optimizations

8

Explicit Data Movement: Vector

void allocate_vector(vector &v, unsigned int n) {

v.n=n;

v.coefs=(double*)malloc(n*sizeof(double));

#pragma acc enter data copyin(v)

#pragma acc enter data create(v.coefs[:n])

}

void free_vector(vector &v) {

double *vcoefs=v.coefs;

#pragma acc exit data delete(v.coefs)

#pragma acc exit data delete(v)

free(v.coefs);

}

void initialize_vector(vector &v,double val) {

for(int i=0;i<v.n;i++)

v.coefs[i]=val;

#pragma acc update device(v.coefs[:v.n])

}

Add data directives to vector.h

Page 9: OpenACC Course - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2015/webinar/openacc-course/offic… · OpenACC Course Office Hour 3: Expressing Data Locality and Optimizations

9

Explicit Data Movement: Present Clause

#pragma acc kernels

present(row_offsets,cols,Acoefs,xcoefs,ycoefs)

{

for(int i=0;i<num_rows;i++) {

double sum=0;

int row_start=row_offsets[i];

int row_end=row_offsets[i+1];

for(int j=row_start;j<row_end;j++) {

unsigned int Acol=cols[j];

double Acoef=Acoefs[j];

double xcoef=xcoefs[Acol];

sum+=Acoef*xcoef;

}

ycoefs[i]=sum;

}

}

#pragma acc kernels present(xcoefs,ycoefs)

{

for(int i=0;i<n;i++) {

sum+=xcoefs[i]*ycoefs[i];

}

}

#pragma acc kernels

present(xcoefs,ycoefs,wcoefs)

{

for(int i=0;i<n;i++) {

wcoefs[i] =

alpha*xcoefs[i]+beta*ycoefs[i];

}

}

Add present to all kernels/parallel regions

Page 10: OpenACC Course - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2015/webinar/openacc-course/offic… · OpenACC Course Office Hour 3: Expressing Data Locality and Optimizations

10

Optimize: Optimize Vector & Worker Loops

#pragma acc kernels present(row_offsets,cols,Acoefs,xcoefs,ycoefs)

{

#pragma acc loop device_type(nvidia) gang worker(4)

for(int i=0;i<num_rows;i++) {

double sum=0;

int row_start=row_offsets[i];

int row_end=row_offsets[i+1];

#pragma acc loop device_type(nvidia) vector(32)

for(int j=row_start;j<row_end;j++) {

unsigned int Acol=cols[j];

double Acoef=Acoefs[j];

double xcoef=xcoefs[Acol];

sum+=Acoef*xcoef;

}

ycoefs[i]=sum;

}

}

Add loop optimizations to matvec function (kernels)

Page 11: OpenACC Course - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2015/webinar/openacc-course/offic… · OpenACC Course Office Hour 3: Expressing Data Locality and Optimizations

11

Optimize: Optimize Vector & Worker Loops

#pragma acc parallel loop present(row_offsets,cols,Acoefs,xcoefs,ycoefs) \

device_type(nvidia) gang worker vector_length(32) num_workers(4)

for(int i=0;i<num_rows;i++) {

double sum=0;

int row_start=row_offsets[i];

int row_end=row_offsets[i+1];

#pragma acc loop reduction(+:sum) device_type(nvidia) vector

for(int j=row_start;j<row_end;j++) {

unsigned int Acol=cols[j];

double Acoef=Acoefs[j];

double xcoef=xcoefs[Acol];

sum+=Acoef*xcoef;

}

ycoefs[i]=sum;

}

Add loop optimizations to matvec function (parallel loop)

Page 12: OpenACC Course - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2015/webinar/openacc-course/offic… · OpenACC Course Office Hour 3: Expressing Data Locality and Optimizations

12

Lab 2 and 3 Speed-up

1.00X

3.61X 3.62X

2.62X

6.42X

1.00X1.14X 1.10X

1.54X

4.15X

0.00X

1.00X

2.00X

3.00X

4.00X

5.00X

6.00X

7.00X

Original Lab 2 Lab 3 (Data) Lab 3 (VL32) Lab 3 (Best)

K40 Speed-up QwikLab Speed-up

Speed-U

p

PGI 15.9 Compiler; NVIDIA Tesla K40 + Intel Xeon CPU E5-2698 v3 @ 2.30GHz (Haswell) (Green) ; GRID K520 GPU + Intel Xeon CPU E5-2670 0 @ 2.60GHz (Grey)

Page 13: OpenACC Course - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2015/webinar/openacc-course/offic… · OpenACC Course Office Hour 3: Expressing Data Locality and Optimizations

13

OpenACC Multicore Speed-up

OpenACC was not designed for GPUs, it was designed for any parallel architecture.

PGI 15.10 officially releases support for targeting OpenACC directives on multi-core CPUs.

Rebuild the code with: -ta=multicore.

• The compiler does not currently support two targets in the same executable.

• CPU-specific loop optimizations can use device_type(host)

• Multicore target currently targets 1 gang per CPU core without vectorization.

Page 14: OpenACC Course - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2015/webinar/openacc-course/offic… · OpenACC Course Office Hour 3: Expressing Data Locality and Optimizations

14

OpenACC Multicore Speed-up

pgc++ -fast -ta=multicore -Minfo=accel main.cpp -o cg

dot(const vector &, const vector &):

6, include "vector_functions.h"

14, Loop is parallelizable

Generating Multicore code

14, #pragma acc loop gang

waxpby(double, const vector &, double, const vector &, const vector &):

6, include "vector_functions.h"

29, Loop is parallelizable

Generating Multicore code

29, #pragma acc loop gang

matvec(const matrix &, const vector &, const vector &):

8, include "matrix_functions.h"

17, Loop is parallelizable

Generating Multicore code

17, #pragma acc loop gang

22, Loop is parallelizable

Compiler generates 1

gang per CPU core and

ignores

device_type(nvidia)

loop optimizations.

Page 15: OpenACC Course - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2015/webinar/openacc-course/offic… · OpenACC Course Office Hour 3: Expressing Data Locality and Optimizations

15

OpenACC Multi-core Speed-up

1.00X

1.81X

3.27X

4.17X

4.54X4.68X 4.66X 4.70X 4.68X

0.00X

0.50X

1.00X

1.50X

2.00X

2.50X

3.00X

3.50X

4.00X

4.50X

5.00X

1 2 4 6 8 10 12 14 16CPU Cores

Speed-U

p

PGI 15.9 Compiler (-fast optimization), Intel Xeon CPU E5-2698 v3 @ 2.30GHz (Haswell)

Page 16: OpenACC Course - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2015/webinar/openacc-course/offic… · OpenACC Course Office Hour 3: Expressing Data Locality and Optimizations

16

Q&A

Page 17: OpenACC Course - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2015/webinar/openacc-course/offic… · OpenACC Course Office Hour 3: Expressing Data Locality and Optimizations

17

Q1: What does "unstructured" mean in the context of unstructured data movement?

Q2: Unstructured data directives, is it coming from OpenMP, or a new concept?

Q3: When we copy in the structure which contains embedded pointers that are pointing to host memory, OpenACC will automatically fix up those pointers to refer to the device copies?

Q4: A gang would be similar to a block in CUDA, a worker to a thread?

Q5: In which file can I see the code generated for the GPU after the code is compiled? Is necessary to ask for it using a compiler flag?

Questions from the last class

Page 18: OpenACC Course - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2015/webinar/openacc-course/offic… · OpenACC Course Office Hour 3: Expressing Data Locality and Optimizations

18

Next Steps & Homework

Page 19: OpenACC Course - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2015/webinar/openacc-course/offic… · OpenACC Course Office Hour 3: Expressing Data Locality and Optimizations

19

Homework Reminder

This week’s homework will build upon the previous lab by applying the two steps discussed today: Express Data Movement and Optimize Loops.

Go to http://bit.ly/nvoacclab3 from your web browser to take the free lab, or download the code from https://github.com/NVIDIA-OpenACC-Course/nvidia-openacc-course-sources/ to do the lab on your own machine.

If you have not already completed the previous labs, it’s highly recommended that you do http://bit.ly/nvoacclab2 first.

Page 20: OpenACC Course - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2015/webinar/openacc-course/offic… · OpenACC Course Office Hour 3: Expressing Data Locality and Optimizations

20

Course Syllabus

Oct 1: Introduction to OpenACC

Oct 6: Office Hours

Oct 15: Profiling and Parallelizing with the OpenACC Toolkit

Oct 20: Office Hours

Oct 29: Expressing Data Locality and Optimizations with OpenACC

Nov 3: Office Hours

Nov 12: Advanced OpenACC Techniques

Nov 24: Office Hours

Page 21: OpenACC Course - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2015/webinar/openacc-course/offic… · OpenACC Course Office Hour 3: Expressing Data Locality and Optimizations

21

OpenACC Kernels Matvec

matvec(const matrix &, const vector &, const vector &):

8, include "matrix_functions.h"

15, Generating copyout(ycoefs[:num_rows])

Generating

copyin(xcoefs[:],Acoefs[:],cols[:],row_offsets[:num_rows+1])

16, Loop is parallelizable

Accelerator kernel generated

Generating Tesla code

16, #pragma acc loop gang, vector(128) /* blockIdx.x

threadIdx.x */

20, Loop is parallelizable

Page 22: OpenACC Course - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2015/webinar/openacc-course/offic… · OpenACC Course Office Hour 3: Expressing Data Locality and Optimizations

22

OpenACC and OpenMP

Number of Cores OpenACC (s) OpenMP(s)

1 38.689294 36.354030

2 21.616645 20.052727

4 11.856071 11.085337

6 9.370762 8.963102

8 8.594175 8.357504

10 8.322460 8.229950

12 8.253005 8.262911

14 8.249902 8.222371

16 8.253441 8.219130

OpenACC column uses the same, acc kernels code as the GPU and multicore target.

OpenMP column uses ompparallel for on the outer loop.

Same CPU used for both

PGI 15.10 Compiler (-fast optimization), Intel Xeon CPU E5-2698 v3 @ 2.30GHz (Haswell)

Page 23: OpenACC Course - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2015/webinar/openacc-course/offic… · OpenACC Course Office Hour 3: Expressing Data Locality and Optimizations

23

Multicore & GPU Speed-Up

1.00X

3.70X

6.42X

0.00X

1.00X

2.00X

3.00X

4.00X

5.00X

6.00X

7.00X

Original CPU Multicore K40 GPU

Speed-U

p

PGI 15.9 Compiler; NVIDIA Tesla K40 + Intel Xeon CPU E5-2698 v3 @ 2.30GHz (Haswell)