1 why derived data types message data contains different data types can use several separate...

32
1 Why Derived Data Types Message data contains different data types Can use several separate messages performance may not be good Message data involves non-contiguous memory locations Can copy non-contiguous data to a contiguous storage, then communicate additional memory copies i j k A[100][80][50] struct _tagStudent { int id; double grade; char note[100]; }; struct _tagStudent Students[25]; Surface [i][j][0]

Upload: archibald-foster

Post on 01-Jan-2016

220 views

Category:

Documents


1 download

TRANSCRIPT

1

Why Derived Data Types

Message data contains different data types Can use several separate messages performance may not be

good Message data involves non-contiguous memory

locations Can copy non-contiguous data to a contiguous storage, then

communicate additional memory copies

i

j

k

A[100][80][50]

struct _tagStudent { int id; double grade; char note[100];};

struct _tagStudent Students[25];

Surface [i][j][0]

2

Derived Data Type

MPI’s solution: derived data typeNo additional memory copyTransfer directly of data with various shape and size

Idea: Specify the memory layout of data and corresponding basic data types.

Usage:Construct derived data typeCommit derived data typeUse it in communication routines where MPI_Datatype argument is required.

Free derived data type

3

Type Map & Type Signature A general data type consists of

A sequence of basic data types A sequence of byte displacements

Type map: sequence of pairs (basic data type, displacement) for the general data type E.g. double A[2] {(MPI_DOUBLE,0), (MPI_DOUBLE,8)}

_tagStudent {(MPI_INT,0), (MPI_DOUBLE,8), (MPI_CHAR,16), …}

Type signature: sequence of basic data types for the general data type E.g. double A[2] {MPI_DOUBLE, MPI_DOUBLE} _tagStudent {MPI_INT, MPI_DOUBLE, MPI_CHAR …}

4

Communication Buffer

Given a type map {(type0,disp0),(type1,disp1)} and base address buf, the communication buffer:Consists of 2 entries1st entry at address buf+disp0, of type type0;2nd entry at address buf+disp1, of type type1.

E.g. double A[2] 1st entry at A, of type MPI_DOUBLE; 2nd entry at A+8, of type MPI_DOUBLE.

If type map contains n entries similar semantics

5

Type Constructor

newtype is a concatenation of count copies of oldtype

oldtype can be a basic data type or a derived data type

i

j

k

A[100][80][50]

int MPI_Type_contiguous(int count, MPI_Datatype oldtype, MPI_Datatype *newtype)MPI_TYPE_CONTIGUOUS(COUNT, OLDTYPE, NEWTYPE, IERROR) integer COUNT, OLDTYPE, NEWTYPE, IERROR

Surface: A[0][:][:]

MPI_Datatype face_jk;MPI_Type_contiguous(80*50, MPI_DOUBLE, &face_jk);MPI_Type_commit(&face_jk);MPI_Send(&A[0][0][0],1,face_jk,rank,tag,comm);// MPI_Send(&A[0][0][0],80*50,MPI_DOUBLE,rank,tag,comm);MPI_Send(&A[99][0][0],1,face_jk,rank,tag,comm);...MPI_Type_free(&face_jk);

6

Type Constructor

count – number of blocks blocklength – number of elements in each block, in terms of oldtype stride – number of elements between start of each block, in terms of

oldtype oldtype – old data type, can be basic or derived data type newtype – created new data type

Data consists of equally spaced blocks: same oldtype, same block length, same spacing in terms of oldtype Each block is a concatenation of blocklength copies of old datatype Spacing between blocks is stride number of oldtype.

int MPI_Type_vector(int count, int blocklength, int stride, MPI_Datatype oldtype, MPI_Datatype *newtype)

blocklength

stride

7

Example

A[4][4]

double A[4][4];MPI_Datatype column;

MPI_Type_vector(4,1,4,MPI_DOUBLE, &column);MPI_Type_commit(&column);MPI_Send(&A[0][1],1,column,rank,tag,comm);MPI_Send(&A[0][3],1,column, rank, tag, comm);...

i

j

k

A[100][80][50]

Surface: A[:][0][:]

double A[100][80][50];MPI_Datatype face_ik;

MPI_Type_vector(100,50,80*50,MPI_DOUBLE,&face_ik);MPI_Type_commit(&face_ik);MPI_Send(&A[0][0][0],1,face_ik,rank,tag,comm);MPI_Send(&A[0][1][0],1,face_ik,rank,tag,comm);MPI_Send(&A[0][79][0],1,face_ik,rank,tag,comm);...

8

Type Constructor

Same as MPI_Type_vector, except that stride is in terms of number of bytes, not number of elements of oldtype.

blocklength is still in terms of number of elements of oldtype.

Same oldtype in different blocks; same block lengths; same spacing between neighboring blocks, but in terms of bytes (not in terms of oldtype)

int MPI_Type_hvector(int count, int blocklength, MPI_Aint stride, MPI_Datatype oldtype, MPI_Datatype *newtype)

blocklength

stride

9

Example

A[4][4]

double A[4][4];MPI_Datatype column;

MPI_Type_hvector(4,1,4*sizeof(double), MPI_DOUBLE, &column);MPI_Type_commit(&column);MPI_Send(&A[0][1],1,column,rank,tag,comm);...

i

j

k

A[100][80][50]

Surface: A[:][:][49]

double A[100][80][50];MPI_Datatype face_ij, line_j;

MPI_Type_vector(80,1,50,MPI_DOUBLE,&line_j);MPI_Type_hvector(100,1,80*50*sizeof(double), line_j, &face_ij);MPI_Type_commit(&face_ij);MPI_Send(&A[0][0][49],1,face_ij,rank,tag,comm);...

10

Type Constructor

count – number of blocks array_blocklen – number of elements per block in term s of oldtype,

dimension: count. array_disp – displacements of each block in terms of number of elements

of oldtype, dimension: count oldtype – old data type newtype – new data type

Data consists of count blocks of oldtype: same oldtype; different block lengths; different spacing between blocks block i has length array_blocklen[i] Block i has displacement array_disp[i], in terms of number of oldtype

elements.

int MPI_Type_indexed(int count, int *array_blocklen, int *array_disp, MPI_Datatype oldtype, MPI_Datatype *newtype)

blocklen[i]

disp[i]

11

Upper triangle of matrix A[4][4]

double A[4][4];MPI_Datatype upper_tri;int blocklen[4], disp[4];int i;for(i=0;i<4;i++) { blocklen[i] = 4-i; disp[i] = (4+1)*i;}MPI_Type_indexed(4,blocklen,disp,MPI_DOUBLE,&upper_tri);MPI_Type_commit(&upper_tri);MPI_Send(&A[0][0], 1, upper_tri, rank, tag, comm);...

// Strict lower triangularMPI_Type lower_tri;for(i=0;i<3;i++) { blocklen[i] = i+1; disp[i] = (i+1)*4;}MPI_Type_indexed(3, blocklen, disp, MPI_DOUBLE, &lower_tri);...

Example 1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

12

Type Constructor

Same as MPI_Type_indexed, except that array_disp is specified in terms of number of bytes instead of number of oldtype.

Same oldtype; Different block lengths; different spacing between blocks, displacement in terms of bytes, instead of number of oldtype elements

int MPI_Type_hindexed(int count, int *array_blocklen, MPI_Aint *array_disp, MPI_Datatype oldtype, MPI_Datatype *newtype)blocklen[i]

disp[i]

13

Upper triangle of matrix A[4][4]

double A[4][4];MPI_Datatype upper_tri;int blocklen[4]; MPI_Aint disp[4];int i;for(i=0;i<4;i++) { blocklen[i] = 4-i; disp[i] = (4+1)*i*sizeof(double);}MPI_Type_hindexed(4,blocklen,disp,MPI_DOUBLE,&upper_tri);MPI_Type_commit(&upper_tri);MPI_Send(A, 1, upper_tri, rank, tag, comm);...

Example 1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

14

Address Calculation

Returns the address of the memory location (or variable) The difference between two addresses gives the number of bytes

between these two memory locations. Address is different from pointers in C/C++

Cannot do pointer subtraction Pointer + (or -) an integer n new location: n*sizeof(data-type)

int MPI_Address(void *location, MPI_Aint *address)MPI_ADDRESS(location, address) <type> location(*) integer address

Struct _tagStudent{ int id; double grade; char note[100];} A_Student;

MPI_Aint addr1, addr2, disp;MPI_Address(&A_Student.id, &addr1);MPI_Address(&A_Student.grade, &addr2);Disp = addr2 – addr1;

15

Type Constructor

count – number of blocks array_blocklen – array, number of elements in each block, in terms of

oldtype; dimension: count array_disp – array, displacements of each block, in terms of number of

bytes; dimension: count array_types, array, data types of each block; dimension: count newtype – new data type

Different oldtype; different block lengths; different spacing between blocks, displacement in terms of bytes

Each block may have different data types Most general

Int MPI_Type_struct(int count, int *array_blocklen, MPI_Aint *array_disp, int *array_types, MPI_Datatype *new type)

blocklen[i]type[i]

disp[i]

16

struct _tagStudent { int id; double grade; char note[100];};

struct _tagStudent Students[25];

MPI_Datatype one_student, all_students;int block_len[3];MPI_Datatype types[3];MPI_Aint disp[3];block_len[0] = block_len[1] = 1; block_len[2] = 100;types[0] = MPI_INT;types[1] = MPI_DOUBLE;types[2] = MPI_CHAR;MPI_Address(&Students[0].id, &disp[0]); // memory addressMPI_Address(&Students[0].grade, &disp[1]);MPI_Address(&Students[0].note[0],&disp[2]);disp[1] = disp[1]-disp[0];disp[2] = disp[2]-disp[0];disp[0] = 0;MPI_Type_struct(3, block_len, disp, types, &one_student);MPI_Type_contiguous(25, one_student, &all_students);MPI_Type_commit(&all_students);MPI_Send(Students, 1, all_students, rank, tag, comm);// MPI_Type_commit(&one_student);// MPI_Send(Students, 25, one_student, rank, tag, comm);...

Example

17

Type Extent “Length” of a data type in terms of bytes

E.g. double – MPI_DOUBLE – extent is 8 or sizeof(double) int – MPI_INT – extent is 2 or sizeof(int)

Situation more complex for derived data types; There are two cases

Case 1: derived data types encountered so far (no boundary markers MPI_UB or MPI_LB) Distance between first byte and the last byte of data type, plus

some increment for memory alignment.• Memory alignment: A basic data type of length n will only be

allocated in memory starting from an address of a multiple of n

{(MPI_DOUBLE,0), (MPI_CHAR, 8)}Double – 8 bytes, byte 0-7Char – 1 byte, byte 8Increment – 7 bytes, to round off to next multiple of 8Extent is: 8+1+7 = 16

18

Type Extent Case 2: boundary marker(s) appear in data type

definition Pre-defined type MPI_LB marks lower boundary of data type; MPI_UB marks upper boundary of data type.

Length of MPI_LB and MPI_UB is zero. Extent: distance between boundary markers If only MPI_UB appears, extent is distance between first byte

and MPI_UB If only MPI_LB appears, extent is distance between MPI_LB and

last byte, plus increment for memory alignment

{(MPI_DOUBLE,0) (MPI_CHAR,8) (MPI_UB,8)}Extent of data type is 8 instead of 16.{(MPI_LB,-8) (MPI_DOUBLE,0) (MPI_CHAR,8)}Extent is: 8+8+1+7 = 24{(MPI_LB,-8) (MPI_DOUBLE,0) (MPI_CHAR,8) (MPI_UB 9)}Extent: 9+8 = 17

Can use MPI_LB and MPI_UB to modify the extent to suit one’s needs

19

Example

double A[4][4];MPI_Datatype column;

MPI_Type_vector(4,1,4,MPI_DOUBLE, &column);MPI_Type_commit(&column);// Extent of column is 13*sizeof(double)=104 bytes

// Now modify extent of column to be sizeof(double)=8 using MPI_LB, MPI_UB// Create a new type, same as column, but with extent 8// {(column, 0) (MPI_UB, 8)}MPI_Datatype modified_column;MPI_Datatype types[2];MPI_Aint disp[2];int block_len[2];types[0] = column;types[1] = MPI_UB;block_len[0] = block_len[1] = 1;disp[0] = 0;disp[1] = sizeof(double);MPI_Type_struct(2, block_len, disp, types, &modified_column);// Now modified_column is same as column, but extent is sizeof(double)=8.

20

Type Extent is Important

Concatenation of derived data types is based on their type extent

A_type

MPI_Send(buf, 2, A_type, …);orMPI_Type_contiguous(2, A_type, &B_type);

B_type

extent extent

A_type

extent

Modify extent of A_type using MPI_UB, MPI_LB

B_type

extent

21

Exampleextent extent

1.0 2.0 3.0 4.0 5.0 6.0 …

buf

A_type

MPI_Send(buf,2,A_type,...)

Actual data send out:4 numbers: 1.0, 3.0, 4.0, 6.0

A_type

extent extent

MPI_Send(buf,2,A_type,...)

Actual data sent out:4 numbers: 1.0, 3.0, 2.0, 4.0

MPI_Send(buf, 4, MPI_DOUBLE, ...)

Actual data sent out:4 numbers: 1.0, 2.0, 3.0, 4.0

22

Example

extent extent

4.0 …

buf

A_type

MPI_Recv(buf,2,A_type,...)

A_type

extent extent

MPI_Recv(buf,2,A_type,...) MPI_Recv(buf, 4, MPI_DOUBLE, ...)

Data arrived: 4 numbers: 1.0, 2.0, 3.0, 4.0

1.0 2.0 3.0

4.0

buf

1.0 2.03.0 4.0

buf

1.0 2.0 3.0

23

Type Commit & Free

A derived data type must be committed before being used in communication.

Once committed, can be used comm routines same as pre-defined data types.

If not used any more, need to free the derived data type

int MPI_Type_commit(MPI_Datatype &datatype)int MPI_Type_free(MPI_Datatype &datatype)

24

Type Matching

Type matching rules need to be generalized with derived data types

New rule: the type signature of the data sent must match the type signature of the that specified in receive routineSequence of basic data types must matchNumber of basic elements in message sent

can be smaller than that specified in receive, but must match.

25

Example1.0 2.0 3.0 4.0

1.0 2.0 3.0 4.0

A

B

double A[4], B[8];double C[2], D[8];int my_rank;...MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);MPI_Datatype recv_type;if(my_rank==1) { MPI_Type_vector(4, 1, 2, MPI_DOUBLE, &recv_type); MPI_Commit(&recv_type); MPI_Recv(B, 1, recv_type, 0, tag, MPI_COMM_WORLD, &stat); MPI_Recv(D, 1, recv_type, 0, tag, MPI_COMM_WORLD, &stat); MPI_Type_free(&recv_type);}else if (my_rank==0) { MPI_Send(A, 4, MPI_DOUBLE, 1, tag, MPI_COMM_WORLD); MPI_Send(C, 2, MPI_DOUBLE, 1, tag, MPI_COMM_WORLD);}

1.0 2.0

1.0 2.0

C

D

Cpu 0: A cpu 1: BCpu 0: C cpu 1: D

26

Example

1.0

2.0

3.0

A B

double A[N][N], B[N][N], C[N];MPI_Datatype diag;...MPI_Type_vector(N, 1, N+1, MPI_DOUBLE, &diag);MPI_Type_commit(&diag);if(my_rank==0) { MPI_Send(&A[0][0], 1, diag, 1, tag, MPI_COMM_WORLD); MPI_Send(&A[0][0], 1, diag, 1, tag, MPI_COMM_WORLD);}else if(my_rank==1) { MPI_Recv(&B[0][0], 1, diag, 0, tag, MPI_COMM_WORLD, &stat); MPI_Recv(&C[0], N, MPI_DOUBLE, 0, tag, MPI_COMM_WORLD, &stat);}MPI_Type_free(&diag);

1.0 2.0 3.0

C

27

Example 1.0 2.0 3.0

4.0 5.0 6.0

7.0 8.0 9.0

1.0 4.0 7.0

2.0 5.0 8.0

3.0 6.0 9.0

A B

double A[N][N], B[N][N];MPI_Datatype column, mat_transpose;...MPI_Type_vector(N, 1, N, MPI_DOUBLE, &column);MPI_Type_hvector(N, 1, sizeof(double), column, &mat_transpose);// MPI_Datatype column_modified, types[2];// int block_len[2];// MPI_Aint disp[2];// types[0] = column; types[1] = MPI_UB;// block_len[0] = block_len[1] = 1;// disp[0] = 0; disp[1] = sizeof(double);// MPI_Type_struct(2,block_len,disp,types,&column_modified);// MPI_Type_contiguous(N, column_modified, &mat_transpose);MPI_Type_commit(&mat_transpose);

if(my_rank==0) { MPI_Send(&A[0][0], N*N, MPI_DOUBLE, 1, tag, MPI_COMM_WORLD);}else if(my_rank==1) { MPI_Recv(&B[0][0], 1, mat_transpose, 0, tag, MPI_COMM_WORLD, &stat);}MPI_Type_free(&mat_transpose);

Cpu0: A^T cpu 1: B

28

Matrix Transpose Revisited

A11 A12 A13

A21 A22 A23

A31 A32 A33

A – NxN matrixDistributed on P cpusRow-wise decomposition

B = AT

B also distributed on P cpusRwo-wise decomposition

Aij – (N/P)x(N/P) matricesBij=Aji

T

Input: A[i]][j] = 2*i+j

A11T A21

T A31T

A12T A22

T A32T

A13T A23

T A33T

A B

A11T A12

T A13T

A21T A22

T A23T

A31T A32

T A33T

Local transpose

All-to-all

29

Example: Matrix Transpose

0 1 2 3

4 5 6 7

0 1 2 3

4 5 6 7

0 4 0 4

1 5 1 5

2 6 2 6

3 7 3 7

On each cpu, A is (N/P)xN matrix; First need to first re-write to P blocks of (N/P)x(N/P) matrices, then can do local transpose

0 1 2 3 4 5 6 7A: 2x4

0 1 4 5 2 3 6 7Two 2x2blocks

After all-to-all comm, have P blocks of (N/P)x(N/P) matrices; Need to merge into a (N/P)xN matrix

Three steps:1. Divide A into blocks;2. Transpose each

block locally;3. All-to-all comm;4. Merge blocks locally;

30

Transpose

All-to-all

A B

Read data column by column

Need to be careful about extent

Receive data block by block

Careful about extent

extent

Create derived data types for send and receive; No additional local manipulations

31

#include <stdio.h>#include <string.h>#include <mpi.h>#include "dmath.h"

#define DIM 1000 // global A[DIM], B[DIM]

int main(int argc, char **argv){ int ncpus, my_rank, i, j, iblock; int Nx, Ny; // Nx=DIM/ncpus, Ny=DIM, local array: A[Nx][Ny], B[Nx][Ny] double **A, **B; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &ncpus);

if(DIM%ncpus != 0) { // make sure DIM can be divided by ncpus if(my_rank==0) printf("ERROR: DIM cannot be divided by ncpus!\n"); MPI_Finalize(); return -1; } Nx = DIM/ncpus; Ny = DIM;

A = DMath::newD(Nx, Ny); // allocate memory B = DMath::newD(Nx, Ny); for(i=0;i<Nx;i++) for(j=0;j<Ny;j++) A[i][j] = 2*(my_rank*Nx+i) + j;

memset(&B[0][0], '\0', sizeof(double)*Nx*Ny); // zero out B

Matrix Transposition

32

// Create derived data types MPI_Datatype type_send, type_recv; MPI_Datatype type_line1, type_block; MPI_Aint displ[2]; MPI_Datatype types[2]; int block_len[2];

MPI_Type_vector(Nx, 1, Ny, MPI_DOUBLE, &type_line1); // a column in A types[0] = type_line1; types[1] = MPI_UB; // modify the extent of column to be 1 double block_len[0] = block_len[1] = 1; displ[0] = 0; displ[1] = sizeof(double); MPI_Type_struct(2, block_len, displ, types, &type_send); // modified column MPI_Type_commit(&type_send); // Now A is a concatenation of type_send

MPI_Type_vector(Nx, Nx, Ny, MPI_DOUBLE, &type_block); // submatrix block types[0] = type_block; types[1] = MPI_UB; // modify extent of type_block block_len[0] = block_len[1] = 1; displ[0] = 0; displ[1] = Nx*sizeof(double); MPI_Type_struct(2, block_len, displ, types, &type_recv); // modified block MPI_Type_commit(&type_recv); // Now B is a cancatenation of type_recv

// send/recv data MPI_Alltoall(&A[0][0], Nx, type_send, &B[0][0], 1, type_recv, MPI_COMM_WORLD);

// clean up MPI_Type_free(&type_send); MPI_Type_free(&type_recv); DMath::del(A); DMath::del(B); MPI_Finalize(); return 0;}