programming the cow!

46
Programming the CoW! Tools to start with on Tools to start with on the new cluster. the new cluster.

Upload: iokina

Post on 25-Feb-2016

37 views

Category:

Documents


3 download

DESCRIPTION

Programming the CoW!. Tools to start with on the new cluster. What’s it good for?. Net DOOM? It should be good for computation, and to a lesser extent visualization. It’s a shame about Ray. So the CoW is ~4x faster right?. Fortunately for SGI, no!. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Programming the CoW!

Programming the CoW!

Tools to start with on the new cluster.Tools to start with on the new cluster.

Page 2: Programming the CoW!

What’s it good for?

Net DOOM?

It should be good for computation, and to a lesser extent visualization.

Page 3: Programming the CoW!

It’s a shame about Ray.RAY CoW

$1.5 million.$1.5 million. $150 thousand.$150 thousand.

32 R12K 400MHz, 8MB $32 R12K 400MHz, 8MB $ 64 Xeon 1.7GHz, 256K $.64 Xeon 1.7GHz, 256K $.

SPEC(base) SPEC(base) INT=328 FP=382INT=328 FP=382

SPEC(base) SPEC(base) INT=579 FP=656INT=579 FP=656

16GB RAM total16GB RAM total 32GB RAM total32GB RAM total

InfiniteReality2E GFX pipes.InfiniteReality2E GFX pipes. 32 GeForce3 GFX cards.32 GeForce3 GFX cards.IRIX64 6.5IRIX64 6.5   Linux 2.4.18Linux 2.4.18

Page 4: Programming the CoW!

So the CoW is ~4x faster right?

Page 5: Programming the CoW!

Fortunately for SGI, no!RAY CoW

Uses the system bus.Uses the system bus. Uses 1G and 100M ethernet Uses 1G and 100M ethernet cards.cards.

335ns avg latency from 335ns avg latency from processor to remote memory.processor to remote memory.

~59us latency onto the net.~59us latency onto the net.

10GB/sec sustained 10GB/sec sustained bandwidth.bandwidth.

.5GB/sec sustained bandwith..5GB/sec sustained bandwith.

Page 6: Programming the CoW!

So it should be great for high granularity computations.

That is, design your programs to have long processing cycles and infrequent inter-node communication needs, and you should be just fine.

Page 7: Programming the CoW!

How do we program it?

Shared MemoryShared Memory – – A global memory space is available to all nodes.A global memory space is available to all nodes.Nodes use synchronization primitives to avoid contention.Nodes use synchronization primitives to avoid contention.

   Message PassingMessage Passing – – Every node has only private memory space. Every node has only private memory space. All communications between nodes have to be explicitly All communications between nodes have to be explicitly directed.directed.

MEMORY

NODE NODE NODE NODE NODE

MEM

NODE NODE NODE NODE NODE

MEM MEM MEM MEM

Page 8: Programming the CoW!

L R RESx =

Thread Matrix Multiply

Workers split L, and each multiplies with all of R to get a part of RES.

Page 9: Programming the CoW!

Thread Matrix Multiply Example

Page 10: Programming the CoW!

On the cluster we have no hardware support On the cluster we have no hardware support for SM, so MP is the natural alternative.for SM, so MP is the natural alternative.

Unix supports sockets for MP.Unix supports sockets for MP.

People have built higher level MP libraries People have built higher level MP libraries out of sockets that make life easier.out of sockets that make life easier.

Two that I am familiar with are PVM and Two that I am familiar with are PVM and MPI.MPI.

Page 11: Programming the CoW!

PVM: Parallel Virtual Machine.Started in 1989.Started in 1989.http://www.csm.ornl.gov/pvmhttp://www.csm.ornl.gov/pvm

A PVM is a virtual machine made of a A PVM is a virtual machine made of a collection of independent nodes.collection of independent nodes.

It has a lot of support for heterogeneous It has a lot of support for heterogeneous clusters.clusters.

It’s easy to use, and maybe lower performing It’s easy to use, and maybe lower performing than MPI.than MPI.

Page 12: Programming the CoW!

PVM

Each node runs one pvmd daemon.Each node runs one pvmd daemon.Each node can run one or more tasks.Each node can run one or more tasks.Tasks use the pvmd to communicate with other tasks.Tasks use the pvmd to communicate with other tasks.Task can start new tasks, stop tasks, or delete nodes Task can start new tasks, stop tasks, or delete nodes

from the PVM at will.from the PVM at will.Tasks can be grouped.Tasks can be grouped.PVM comes with a console program that lets you PVM comes with a console program that lets you

control the PVM easily.control the PVM easily.

Page 13: Programming the CoW!

PVM: Setup#Where PVM is installed.#Where PVM is installed.setenv PVM_ROOT /home/demarle/sci/distrib/mps/pvm3 setenv PVM_ROOT /home/demarle/sci/distrib/mps/pvm3 #What type of machine this node is.#What type of machine this node is.setenv PVM_ARCH LINUX setenv PVM_ARCH LINUX #Where the ssh command is.#Where the ssh command is.setenv PVM_RSH /local/bin/ssh setenv PVM_RSH /local/bin/ssh #Where your PVM applications are.#Where your PVM applications are.setenv PVMBIN $PVM_ROOT/bin/LINUX setenv PVMBIN $PVM_ROOT/bin/LINUX #Where the pvm executables are.#Where the pvm executables are.setenv PATH ${PATH}:$PVM_ROOT/lib setenv PATH ${PATH}:$PVM_ROOT/lib setenv PATH ${PATH}:$PVM_ROOT/bin/LINUXsetenv PATH ${PATH}:$PVM_ROOT/bin/LINUX

Page 14: Programming the CoW!

PVM CONSOLE:

[demarle@labnix13 scisem]$ pvmpvm> add labnix14add labnix141 successful HOST DTID labnix14 80000pvm> confconf2 hosts, 1 data format HOST DTID ARCH SPEED DSIG labnix14 40000 LINUX 1000 0x00408841 labnix13 80000 LINUX 1000 0x00408841pvm> quitquitConsole: exit handler calledpvmd still running.[demarle@labnix13 scisem]$

Page 15: Programming the CoW!

[demarle@labnix13 scisem]$ cord_racerSuspended[demarle@labnix13 scisem]$ pvmpvmd already running.pvm> psps HOST TID FLAG 0x COMMAND labnix13 40016 4/c - labnix13 40017 6/c,f adsmd

use "pvm> help" to get a list of commands.use "pvm> kill" to kill tasks use "pvm> delete" to delete nodes from the PVM.use "pvm> halt" to stop every pvm task and daemon.

PVM CONSOLE, continued:

Page 16: Programming the CoW!

PVM: IMPORTANT LIBRARY CALLS

pvm_spawn()pvm_spawn() Task starts children.Task starts children. PvmTaskDebug argument starts them under gdb.PvmTaskDebug argument starts them under gdb.

pvm_catchout()pvm_catchout() Task outputs the terminal output of all children.Task outputs the terminal output of all children.

pvm_mytid()pvm_mytid() What is my task id?What is my task id?

pvm_parent()pvm_parent() What is my parent's task id?What is my parent's task id?

Page 17: Programming the CoW!

PVM: IMPORTANT LIBRARY CALLS

pvm_initsend()pvm_initsend() Clear default buffer and prepare it to send.Clear default buffer and prepare it to send.

pvm_packf()pvm_packf() Put data into a send buffer.Put data into a send buffer.

pvm_send() pvm_send() Transmit a buffer.Transmit a buffer.

pvm_recv()pvm_recv() Receive data into a buffer.Receive data into a buffer.

pvm_nrecv()pvm_nrecv() Non blocking receive.Non blocking receive.

pvm_unpackf()pvm_unpackf() Move data from a buffer into variables.Move data from a buffer into variables.

Page 18: Programming the CoW!

PVM: IMPORTANT LIBRARY CALLS

pvm_joingroup()pvm_joingroup() Add this task to a group.Add this task to a group.

pvm_lvgroup()pvm_lvgroup() Remove this task from a group.Remove this task from a group.

pvm_bcast()pvm_bcast() Broadcast a buffer to all members of a group.Broadcast a buffer to all members of a group.

pvm_barrier()pvm_barrier() Wait here until all other tasks in the group are also Wait here until all other tasks in the group are also here.here.

pvm_reduce()pvm_reduce() Perform a global operation. Ex. max, each tasks give Perform a global operation. Ex. max, each tasks give a number, one reduces the values to the max value.a number, one reduces the values to the max value.

Page 19: Programming the CoW!

PVM: IMPORTANT LIBRARY CALLS

pvm_config()pvm_config() What machines are in the PVM?What machines are in the PVM?

pvm_addhosts()pvm_addhosts() Add a machine to the PVM.Add a machine to the PVM.

pvm_delhosts()pvm_delhosts() Remove a machine from the PVM.Remove a machine from the PVM.

pvm_tasks() pvm_tasks() What tasks are running on the PVM?What tasks are running on the PVM?

pvm_exit()pvm_exit() Remove this task from the PVM.Remove this task from the PVM.

Page 20: Programming the CoW!

L R RESx =

Message Passing Matrix MultiplyWorkers split L and R.They always multiply their L’, and take turns broadcasting their R’.

Page 21: Programming the CoW!

PVM Matrix Multiply Example

Page 22: Programming the CoW!

MPI: Message Passing Interface.Started in 1992.Started in 1992.http:// http:// www-unix.mcs.anl.gov/mpi/index.htmlwww-unix.mcs.anl.gov/mpi/index.html

Goal - to standardize message passing so that Goal - to standardize message passing so that parallel code can be portable.parallel code can be portable.

Unlike PVM it does not specify the virtual machine Unlike PVM it does not specify the virtual machine environment.environment.

For instance, it does say how to start a program.For instance, it does say how to start a program.It has more basic operations than PVM.It has more basic operations than PVM.It's supposed to be lower level and faster.It's supposed to be lower level and faster.

Page 23: Programming the CoW!

MPICHA free implementation of the MPI standard.A free implementation of the MPI standard.http://www-unix.mcs.anl.gov/mpi/mpichhttp://www-unix.mcs.anl.gov/mpi/mpich

+ it comes with some extras, like scripts that give + it comes with some extras, like scripts that give you some of PVM’s niceties.you some of PVM’s niceties.

mpirun - a script to start your programs with.mpirun - a script to start your programs with. mpicc, mpiCC, mpif77, and mpif90.mpicc, mpiCC, mpif77, and mpif90. MPE – a set of performance analysis and program MPE – a set of performance analysis and program

visualization tools.visualization tools.

Page 24: Programming the CoW!

MPI: Setup#where MPI is installed.#where MPI is installed.setenv MYMPI /home/demarle/sci/distrib/mps/mpi/mpich-1.2.3setenv MYMPI /home/demarle/sci/distrib/mps/mpi/mpich-1.2.3#Where the ssh command is.#Where the ssh command is.setenv RSHCOMMAND /local/bin/sshsetenv RSHCOMMAND /local/bin/ssh#where the executables are.#where the executables are.setenv PATH ${PATH}:${MYMPI}/binsetenv PATH ${PATH}:${MYMPI}/bin

Uses a file to specify which machines you can use.Uses a file to specify which machines you can use.${MYMPI}/util/machines/machines.LINUX${MYMPI}/util/machines/machines.LINUX

To start an executable:To start an executable:mpirun <-dbg-gdb> -np # filenamempirun <-dbg-gdb> -np # filename

Page 25: Programming the CoW!

MPI: IMPORTANT LIBRARY CALLSMPI_Init()MPI_Init() Begin the MPI session for this task.Begin the MPI session for this task.

MPI_Finalize()MPI_Finalize() Leaves MPI.Leaves MPI.

MPI_Comm_create()MPI_Comm_create() Creates a Communicator, something like a group Creates a Communicator, something like a group of groups.of groups.

MPI_Comm_size()MPI_Comm_size() How many tasks are in the Communicator?How many tasks are in the Communicator?

MPI_Comm_rank()MPI_Comm_rank() Which task am I?Which task am I?

MPI_Comm_group()MPI_Comm_group() Access a specific group in the Communicator.Access a specific group in the Communicator.

MPI_Graph_get()MPI_Graph_get() Query topology of a Communicator.Query topology of a Communicator.

Page 26: Programming the CoW!

MPI: IMPORTANT LIBRARY CALLSMPI_Group_size()MPI_Group_size() What is the size of a group?What is the size of a group?

MPI_Group_rank()MPI_Group_rank() What is this task’s place in the group?What is this task’s place in the group?

MPI_Barrier()MPI_Barrier() Wait for all tasks in the group to catch up.Wait for all tasks in the group to catch up.

MPI_Bcast()MPI_Bcast() Broadcast a message to all others in the group.Broadcast a message to all others in the group.

MPI_Reduce()MPI_Reduce() Performs an operation across a group’s values.Performs an operation across a group’s values.

MPI_File_*()MPI_File_*() Group wide file operations.Group wide file operations.

Page 27: Programming the CoW!

MPI: IMPORTANT LIBRARY CALLSMPI_Pack()MPI_Pack() puts data into a buffer for a later sendputs data into a buffer for a later send

MPI_Send()MPI_Send() Send a buffer.Send a buffer.

MPI_Isend()MPI_Isend() Non blocking send.Non blocking send.

MPI_Probe()MPI_Probe() Test for an incoming buffer.Test for an incoming buffer.

MPI_Iprobe()MPI_Iprobe() Non blocking test.Non blocking test.

MPI_Recv()MPI_Recv() Receive a buffer.Receive a buffer.

MPI_Irecv()MPI_Irecv() Non blocking receive.Non blocking receive.

MPI_Unpack()MPI_Unpack() gets data from a received buffergets data from a received buffer

Page 28: Programming the CoW!

If you don't want the overhead of the PVM If you don't want the overhead of the PVM and MPI libraries and daemons, you can do and MPI libraries and daemons, you can do essentially the same thing with sockets.essentially the same thing with sockets.

Sockets will be faster, but also harder to use. Sockets will be faster, but also harder to use. They don’t come with groups, barriers, They don’t come with groups, barriers, reductions, etc. You have to create these reductions, etc. You have to create these yourself.yourself.

Page 29: Programming the CoW!

SOCKETSThink of file desriptors: sock = socket() ~ fd = fopenThink of file desriptors: sock = socket() ~ fd = fopenint sock = socket(int sock = socket(DomainDomain, , TypeType, , ProtocolProtocol););DomainDomain

AF_INET AF_INET over the netover the netAF_UNIX AF_UNIX local to a nodelocal to a node

TypeTypeSOCK_STREAM SOCK_STREAM 2ended connections, reliable, no limit.2ended connections, reliable, no limit.ie TCPie TCPSOCK_DGRAM SOCK_DGRAM connectionless, unreliable, ~1500 bytesconnectionless, unreliable, ~1500 bytesie UDPie UDP

ProtocolProtocol - like a flavor of the domain, these two just take 0 - like a flavor of the domain, these two just take 0

Page 30: Programming the CoW!

Basic Process for a Master Task//open a socket, like a file descriptor//open a socket, like a file descriptorsock=socket(AF_INET, SOCK_STREAM, 0);sock=socket(AF_INET, SOCK_STREAM, 0);

//bind your end to this machine's IP address and this programs PORT//bind your end to this machine's IP address and this programs PORTint ret = bind (sock, (struct sockaddr *) &servAddr, sizeof(servAddr));int ret = bind (sock, (struct sockaddr *) &servAddr, sizeof(servAddr));

//let the socket listen for connections from remote machines//let the socket listen for connections from remote machinesret = listen(sock, BACKLOG);ret = listen(sock, BACKLOG);

//start remote programs//start remote programssystem("ssh labnix14 worker.exe");system("ssh labnix14 worker.exe");

TO BE CONTINUED …TO BE CONTINUED …

Page 31: Programming the CoW!

Basic Process for a Worker//put yourself in background and nohup, to let the master continue//put yourself in background and nohup, to let the master continueret = daemon(1,0);ret = daemon(1,0);

//open a socket//open a socketint sock = socket(AF_INET,SOCK_STREAM,0);int sock = socket(AF_INET,SOCK_STREAM,0);

//bind your end with this machine's IP address and this program’s PORT//bind your end with this machine's IP address and this program’s PORTret = bind(sock, (struct sockaddr *) &cliAddr, sizeof(cliAddr));ret = bind(sock, (struct sockaddr *) &cliAddr, sizeof(cliAddr));

//connect this socket to the listening one in the master//connect this socket to the listening one in the masterret = connect(sock, (struct sockaddr *) &servAddr, sizeof(servAddr)); ret = connect(sock, (struct sockaddr *) &servAddr, sizeof(servAddr));

TO BE CONTINUED…TO BE CONTINUED…

Page 32: Programming the CoW!

Basic Process for a Master Task, cont.//accept each worker’s connection to finish a new two ended socket//accept each worker’s connection to finish a new two ended socket..children[c].sock = accept(sock, children[c].sock = accept(sock,

(struct sockaddr *)&children[c].cliAddr, (struct sockaddr *)&children[c].cliAddr, &children[c].cliAddrLen &children[c].cliAddrLen

););

//send and receive over the socket as you like//send and receive over the socket as you likeret = send(children[c].sock, parms, 8*sizeof(double), 0);ret = send(children[c].sock, parms, 8*sizeof(double), 0);ret = recv(children[c].sock, RES+rr*rsc, rpr*rpc, MSG_WAITALL);ret = recv(children[c].sock, RES+rr*rsc, rpr*rpc, MSG_WAITALL);

//close the sockets when you are done with them//close the sockets when you are done with themclose(children[c].sock);close(children[c].sock);

Page 33: Programming the CoW!

Basic Process for a Worker, cont.

//send and receive data as you please//send and receive data as you pleaseret = recv(sock, parms, 7*sizeof(int), 0);ret = recv(sock, parms, 7*sizeof(int), 0);ret = send(sock, (void *)RET, len2, 0);ret = send(sock, (void *)RET, len2, 0);

//close the socket when you are done with it//close the socket when you are done with itclose(sock);close(sock);

Page 34: Programming the CoW!

Shared Memory on cluster?

SM code was so much simpler.SM code was so much simpler.

So a lot of people have built DSM Systems.So a lot of people have built DSM Systems.

Adsmith, CRL, CVM, DIPC, DSM-PM2, Adsmith, CRL, CVM, DIPC, DSM-PM2, PVMSYNC, Quarks, SENSE, TreadMarks PVMSYNC, Quarks, SENSE, TreadMarks to name a few…to name a few…

Page 35: Programming the CoW!

Two types of Software DSMs

Page 36: Programming the CoW!

PAGE Based DSMsUse of the Virtual Memory Manager.Use of the Virtual Memory Manager. Install a signal handler to catch segfaults.Install a signal handler to catch segfaults.

Use mprotect to protect virtual memory pages assigned to Use mprotect to protect virtual memory pages assigned to remote nodes.remote nodes.

On a segfault - the process blocks - the segfault handler gets a On a segfault - the process blocks - the segfault handler gets a page from a remote node – returns to the process.page from a remote node – returns to the process.

It suffers when two or more nodes want to write to differentIt suffers when two or more nodes want to write to differentand unrelated places on the same memory page.and unrelated places on the same memory page.

Page 37: Programming the CoW!

Object Based DSMs Let the programmer define the unit of sharing and Let the programmer define the unit of sharing and

then provide each shared object with something then provide each shared object with something like load, modify and save methods.like load, modify and save methods.

They can eliminate false sharing, but they often They can eliminate false sharing, but they often aren’t as easy to use.aren’t as easy to use.

Page 38: Programming the CoW!

DIPC

Distributed Inter Process CommunicationDistributed Inter Process Communication Page Based.Page Based. It’s an extension to the Linux KernelIt’s an extension to the Linux Kernel

Specifically it extends SYSTEM V IPCSpecifically it extends SYSTEM V IPC

Page 39: Programming the CoW!

SYSTEM V IPC? Like an alternative to threads, it lets arbitrary Like an alternative to threads, it lets arbitrary

unrelated processes work together.unrelated processes work together.

Threads share the program's entire global space.Threads share the program's entire global space.

For shmem, processes explicitly declare what is For shmem, processes explicitly declare what is shared.shared.

SYSTEM V IPC also means messages and SYSTEM V IPC also means messages and semaphores.semaphores.

Page 40: Programming the CoW!

Basic idea//create an object to share//create an object to sharevolatile struct shared { int i; } *shared;volatile struct shared { int i; } *shared;

//make the object shareable//make the object shareableshmid = shmget(IPC_PRIVATE, shmid = shmget(IPC_PRIVATE,

sizeof(struct shared), sizeof(struct shared), (IPC_CREAT | 0600));(IPC_CREAT | 0600));

shared = ((volatile struct shared *) shmat(shmid, 0, 0));shared = ((volatile struct shared *) shmat(shmid, 0, 0));shmctl(shmid, IPC_RMID, 0);shmctl(shmid, IPC_RMID, 0);

//start children, now they don't have copies of “shared”, they all actually //start children, now they don't have copies of “shared”, they all actually access the original one.access the original one.

fork()fork()

//all children can access the shared whenever they want//all children can access the shared whenever they wantshared->i = 0;shared->i = 0;

Page 41: Programming the CoW!

How would this change for DIPC?

#define IPC_DIPC 00010000 #define IPC_DIPC 00010000

shmid = shmget(IPC_PRIVATE, shmid = shmget(IPC_PRIVATE, sizeof(struct shared), sizeof(struct shared), (IPC_CREAT | IPC_DIPC | 0600)(IPC_CREAT | IPC_DIPC | 0600) ););

//Same thing applies for semget and msgget.//Same thing applies for semget and msgget.

Page 42: Programming the CoW!

DIPC works by adding a small modification to the DIPC works by adding a small modification to the Linux kernel.Linux kernel.

The kernel looks for IPC_DIPC structures, and bumps The kernel looks for IPC_DIPC structures, and bumps them out to a user level daemon. Structures without them out to a user level daemon. Structures without the flag are treated normally.the flag are treated normally.

The daemon satisfies the request over the network, and The daemon satisfies the request over the network, and then returns the data to the kernel. Which in turn then returns the data to the kernel. Which in turn returns the data to the user process.returns the data to the user process.

Page 43: Programming the CoW!

The great thing about DIPC is that it is very The great thing about DIPC is that it is very compatible with normal Linux.compatible with normal Linux.

A DIPC program will run just fine on an isolated A DIPC program will run just fine on an isolated machine without DIPC, the flag will just be ignored.machine without DIPC, the flag will just be ignored.

This means you can develop your software off the This means you can develop your software off the cluster and then just throw it on to make use of all cluster and then just throw it on to make use of all the CPU's.the CPU's.

Page 44: Programming the CoW!

DIPC Problems?

Does strict sequential consistency, which is very Does strict sequential consistency, which is very easy to use but wastes a lot of network traffic.easy to use but wastes a lot of network traffic.

The version for the 2.4.X kernel isn't finished yet.The version for the 2.4.X kernel isn't finished yet.

Page 45: Programming the CoW!

Summary

CPU CPU , COMMUNICATIONS , COMMUNICATIONS MP: PVM, MPI, SOCKETSMP: PVM, MPI, SOCKETS DSM: DIPC?, Quarks?, …DSM: DIPC?, Quarks?, …

Page 46: Programming the CoW!

REFERENCESPVMPVM http://www.csm.ornl.gov/pvmhttp://www.csm.ornl.gov/pvm

MPIMPI http://www-unix.mcs.anl.gov/mpi/index.htmlhttp://www-unix.mcs.anl.gov/mpi/index.html

MPICHMPICH http://www-unix.mcs.anl.gov/mpi/mpichhttp://www-unix.mcs.anl.gov/mpi/mpich

DIPCDIPC http://wallybox.cei.net/dipchttp://wallybox.cei.net/dipc