china mcp 1 open mpi. agenda mpi overview open mpi architecture open mpi ti implementation open mpi...

China MCP

1

Open MPI

Agenda

• MPI Overview• Open MPI Architecture• Open MPI TI Implementation• Open MPI Run-time Parameters• Open MPI Usage Example• Getting Started

4

• Message Passing Interface– “De facto” standard– Not an “official” standard(IEEE, IETF)

• Written and ratified by the MPI Forum– Body of academic, research, and industry representatives

• MPI spec– MPI-1 published in 1994– MPI-2 published in 1996– MPI-3 published in 2012– Specified interfaces in C, C++, Fortran 77/90

What is MPI?

MPI High-Level View

User Application

MPI API

Operation System

6

• High-level network API– Abstract away the underlying transport– Easy to use for customers

• API designed to be “friendly” to high performance network– Ultra low latency (nanoseconds matter)– Rapid ascent to wire-rate bandwidth

• Typically used in High Performance Computing(HPC) environments– Has a bias for large compute jobs

• “HPC” definition is evolving– MPI starting to be used outside of HPC– MPI is a good network IPC API

MPI Goal

Agenda


8

• OpenMPI is an open source, high-performance implementation of MPI– Open MPI represents the union of four research/academic, open

source MPI implementations: LAM(Local Area Multicomputer)/MPI, LA(Los Alamos)/MPI, FT-MPI(Fault-Tolerant MPI) and PACX-MPI(Parallel Computer eXtension MPI)

• Open MPI has three main abstraction project layers– Open Portable Access Layer (OPAL): Open MPI's core portability

between different operating systems and basic utilities.– Open MPI Run-Time Environment (ORTE): Launch, monitor

individual processes, and group individual processes in to “jobs”– Open MPI (OMPI): Public MPI API and only one exposed to

applications.

Open MPI Overview

Open MPI High-Level View

MPI Application

Open MPI (OMPI) Project

Operation System

Open MPI Run-Time Environment (ORTE) Project

Open Portable Access Layer (OPAL) Project

Hardware

10

Project Separation

MPI Application

libompi

Operation System

libopen-rte

libopen-pal

Hardware

11

Library dependencies

MPI Application

libompi

Operation System

libopen-rte

libopen-pal

Hardware

12

• Open MPI architecture design– Portable, high-performance implementation of the MPI standard– Share common base code to meet widely different requirement– Run-time loadable components were natural choice, the same

interface behavior can be implemented multiple different ways. Users can then choose, at run time, which plugin(s) to use

• Plugin Architecture– Each project is structured similarly

• Main / Core code• Components(Plugins)• Frameworks

– Governed by the Modular Component Architecture

Plugin Architecture

13

MCA Architecture Overview

User Application

MPI API

Modular Component Architecture (MCA)

Framework Framework Framework Framework Framework…

Co

mp.

Co

mp.

Co

mp.

…

Co

mp.

Co

mp.

Co

mp.

…

Co

mp.

Co

mp.

Co

mp.

…

Co

mp.

Co

mp.

Co

mp.

…

Co

mp.

Co

mp.

Co

mp.

…

14

MCA Layout• MCA

– Top-level architecture for component services– Find, load, unload components

• Frameworks– Targeted set of functionality– Defined interfaces– Essentially: a group of one type of plugins– E.g., MPI point-to-point, high-resolution timers

• Components– Code that exports a specific interface– Loaded/unloaded rum-time– “Plugins”

• Modules– A components paired with resources– E.g., TCP component loaded, find 2 IP interface(eth0, eth1), make 2 TCP

modules

15

OMPI Architecture Overview

OMPI Layer

MPI Byte Transfer Layer

(btl)

MPI collective operations

(coll)

MPI one-sided communicatio

n interface( osc)

Memory Pool Framework

(mpool)Framework…

Ba

se

tcp

sm

…

sm

tun

ed…

pt2

pt.

rdm

a…

grd

ma

rgp

us

m.

…

Co

mp

.

Co

mp

.

Co

mp

.

…

Ba

se

Ba

se

Ba

se

16

ORTE Architecture Overview

ORTE Layer

Process Lifecycle

Management (PLM)

I/O Forwarding service (iof)

Routing table for the RML

(routed)

OpenRTE Group

Communication(grpcomm)

Framework…

tm slu

rm…

hn

p

too

l…

rad

ix

dire

ct…

pm

i

ba

d

…

Co

mp

.

Co

mp

.

Co

mp

.

…

Ba

se

Ba

se

Ba

se

Ba

se

17

OPAL Architecture Overview

OPAL Layer

IP interface (if)High

resolution timer (timer)

Hardware locality (hwloc)

Compression Framework (compress)

Framework…

Po

six

_ip

v4 L

inu

x_

ipv

6

…

linu

x

da

win…

ex

tern

al

hw

loc

15

1

…

bzip

gzip

…

Co

mp

.

Co

mp

.

Co

mp

.

…

Ba

se

Ba

se

Ba

se

Ba

se

Agenda


19

Open MPI TI Implementation• Open MPI on K2H platform

– All components in 1.7.1 are supported– Launching and initial interfacing by using “SSH”– Adding BTLs for SRIO and Hyperlink transports

OpenMPRun-time

Kernel

Kernel

Kernel

C6

6x

sub

syst

em

IPC

IPC

A15 SMP Linux

OpenCL

MPI

OpenMPRun-time

Kernel

Kernel

Kernel

C6

6x

sub

syst

em

IPC

IPC

A15 SMP Linux

OpenCL

MPI

K2H K2HShared

memory/NavigatorShared

memory/Navigator

Ethernet

Hyperlink

SRIO

MPIApplication

Node 0 Node 1

20

OMPI TI Added Components

OMPI Layer

MPI Byte Transfer Layer

(btl)

MPI collective operations

(coll)

MPI one-sided communicatio

n interface( osc)

Memory Pool Framework

(mpool)Framework…

Ba

se

hlin

k

srio

…

sm

tun

ed…

pt2

pt.

rdm

a…

grd

ma

rgp

us

m.

…

Co

mp

.

Co

mp

.

Co

mp

.

…

Ba

se

Ba

se

Ba

se

21

• Hyperlink is TI-proprietary high speed, point-to-point interface, with 4 lanes up to 12.5Gbps (maximum transfer of 5.5-6 Gbytes/s).

• New BTL module has been added to ti-openmpi (openmpi 1.7.1 based) to support transport over Hyperlink. MPI Hyperlink communication is driven by A15 only.

• K2H device has 2 Hyperlink ports (0 and 1) allowing one SoC to connect directly with two neighboring SoCs.

– Daisy chaining is not supported.– Additional connectivity can be obtained by mapping common memory region in intermediate node– Data transfers are operated by EDMA

• Hyperlink BTL support is seamlessly integrated into OpenMPI run-time:– Example code to run mpptest using 2 nodes over hyperlink:

/opt/ti-openmpi/bin/mpirun --mca btl self,hlink -np 2 -host c1n1,c1n2 ./mpptest -sync logscale

– Example code to run nbody using 4 nodes hyperlink:

/opt/ti-openmpi/bin/mpirun --mca btl self,hlink -np 4 -host c1n1,c1n2,c1n3,c1n4 ./nbody 1000

OpenMPI Hyperlink BTL

K2H K2H

K2H

HL0 HL0

HL1 HL1

HL0 HL1

K2H K2HHL0 HL0

HL1 HL1

K2H K2H

HL0 HL0

HL1 HL13 node Hyperlink topology

4 node Hyperlink topology

22

OpenMPI Hyperlink BTL – connection types

Node 2 Node 3

HL0 HL0

HL1 HL1

Node 1 Node 4

HL0HL0

HL1 HL1

No

de

1 w

rite

s to

No

de

2Node 3 reads from Node 2t

Sending fragment from node 1 to node 3

Sending fragment from node 3 to node 1

No

de 3 w

rites to N

od

e 4

Node1 reads from Node 4

Same memory block mapped via both Hyperlink ports (to different nodes), used only for diagonal uni-directional connection

Diagonal connections

Node 2 Node 3src

dst

dst

src

Node 2 writes to Node 3

Node 3 writes to Node 2

Local read

Local read

Adjacent connections

Same memory block mapped via both Hyperlink ports (to different nodes), used only for diagonal uni-directional connection

src

src

dst

dst

transfer

transfer

23

OpenMPI SRIO BTL• Serial RapidIO connections are high speed low-latency connections that can be switched via external

switching fabric (SRIO switches) or by K2H on-chip packet forwarding tables (when SRIO switch is not available)

• K2H device has 4 SRIO lanes that can be configured as 4x1 lane links, or 1x4 lane link. Wire speed can be up to 5Gbps, with data link speed of 4 Gbps (due to 8/10b encoding)

• Texas Instruments ti-openmpi (based on openmpi 1.7.1) includes SRIO BTL based on SRIO DIO transport, using Linux rio_mport device driver. MPI SRIO communication is driven by A15 only.

• SRIO nodes are statically enumerated (current support) and programming of packet forwarding tables is done inside MPI run-time, based on list of participating nodes. HW topology is specified by JSON file

• Programming of packet forwarding tables is static and allows HW-assisted routing of packets w/o any SW intervention in transferring nodes.

– Packet forwarding table has 8 entries (some limitations can be encountered based on topology and traffic patters)– Each entry specify min-SRIO-ID, max-SRIO-ID, outgoing port– External SRIO fabric typically provide non-blocking switching capabilities and might be favorable for certain

applications and HW designs

• SRIO BTL, based on destination hostname determines outgoing port and destination ID. Previously programmed packet forwarding tables in all nodes ensure deterministic routability to destination node.

• SRIO BTL support is seamlessly integrated into OpenMPI run-time:– Example code to run mpptest using 2 nodes over SRIO:

/opt/ti-openmpi/bin/mpirun --mca btl self,srio -np 2 -host c1n1,c1n2 ./mpptest -sync logscale

– Example code to run nbody using 12 nodes over SRIO:

/opt/ti-openmpi/bin/mpirun --mca btl self,srio -np 12 -host c1n1,c1n2,c1n3,c1n4,c4n1,c4n2,c4n3,c4n4,c7n1,c7n2,c7n3,c7n4 ./nbody 1000

24

OpenSRIO BTL – possible topologies

K2H K2H

K2H K2H

K2H K2H

K2H K2H

Full connectivity of 4 nodes – 1 lane per link

Connections with 4 lanes per link

K2H K2H K2H K2H

K2H K2H K2H K2H

K2H K2H K2H K2H

K2H K2H K2H K2H

2-D torus (16-nodes)

K2H K2H K2H K2H K2H K2H

SRIO switch star topology

Packet forwarding capability allows creation of HW virtual links (no SW operation!)

Agenda


26

Open MPI Run-time Parameters• MCA parameters are the basic unit of run-time tuning for

Open MPI.– The system is a flexible mechanism that allows users to change

internal Open MPI parameter values at run time– If a task can be implemented in multiple, user-discernible ways,

implement as many as possible and make choosing between them be an MCA parameter

• Service provided by the MCA base– Does not mean that they are restricted to the MCA components of

frameworks– OPAL, ORTE, and OMPI projects all have “base” parameters– Allows users to be proactive and tweak Open MPI's behavior for

their environment. It’s allows users to experiment with the parameter space to find the best configuration for their specific system.

27

MCA parameters lookup order1. mpirun command line

2. Environment variable

3. File, these location are themselves tunable– $HOME/.openmpi/mca-params.conf– $prefix/etc/openmpi-mca-params.conf

4. Default value

mpirun –mca <name> <value>

export OMPI_MCA_<name> <value>

28

MCA run-time parameters usage• Get the MCA information

– The ompi_info command can list the parameters for a given component, all the parameters for a specific framework, or all parameters.

• MCA Usage– The mpirun command execute serial and parallel jobs in Open MPI

/opt/ti-openmpi/bin/ompi_info –param all all

Show all the MCA parameters for all components that ompi_info finds

/opt/ti-openmpi/bin/ompi_info –param btl all

Show all the MCA parameters for all BTL components

/opt/ti-openmpi/bin/ompi_info –param btl tcp

Show all the MCA parameters for TCP BTL component

/opt/ti-openmpi/bin/mpirun –mca orte_base_help_aggregate 0 –mca btl_base_verbose 100 –mca btl self, tcp –np 2 –host k2node1, k2node2 /home/mpiuser/nbody 1000

Select the btl_base_verbose and use tcp for transport

Agenda


30

Open MPI API Usage• Open MPI API is standard MPI API, refer to the following link to get more

information: http://www.open-mpi.org/doc/

• This example project locate at <mcsdk-hpc_install_path>/demos/testmpi

MPI_Init (&argc, &argv); /* Startup *//* starts MPI */ MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* Who am I?*//* get current process id */ MPI_Comm_size (MPI_COMM_WORLD, &size);/* How many peers do I have *//* get number of processes */{ /* Get the name of the processor */ char processor_name[320]; int name_len; MPI_Get_processor_name(processor_name, &name_len); printf("Hello world from processor %s, rank %d out of %d processors\n", processor_name, rank, size); gethostname(processor_name, 320); printf ("locally obtained hostname %s\n", processor_name);} MPI_Finalize(); /* Finish the MPI application and release sources*/

http://www.open-mpi.org/doc/



31

Run the Open MPI example• Use the mpirun and mca parameters to run the example

• Output messages

/opt/ti-openmpi/bin/mpirun –mca btl self, sm, tcp –np 8 –host k2node1, k2node2 ./testmpi

>>>

Hello world from processor k2hnode1, rank 3 out of 8 processors locally obtained hostname k2hnode1 Hello world from processor k2hnode1, rank 0 out of 8 processors locally obtained hostname k2hnode1 Hello world from processor k2hnode2, rank 5 out of 8 processors locally obtained hostname k2hnode2 Hello world from processor k2hnode2, rank 4 out of 8 processors locally obtained hostname k2hnode2Hello world from processor k2hnode2, rank 7 out of 8 processors locally obtained hostname k2hnode2 Hello world from processor k2hnode2, rank 6 out of 8 processors locally obtained hostname k2hnode2 Hello world from processor k2hnode1, rank 1 out of 8 processors locally obtained hostname k2hnode1Hello world from processor k2hnode1, rank 2 out of 8 processors locally obtained hostname k2hnode1

<<<

Agenda


33

Getting Started

Bookmarks URL

Download http://software-dl.ti.com/sdoemb/sdoemb_public_sw/mcsdk_hpc/latest/index_FDS.html

Getting Started Guide http://processors.wiki.ti.com/index.php/MCSDK_HPC_3.x_Getting_Started_Guide

TI OpenMPI User Guide

http://processors.wiki.ti.com/index.php/MCSDK_HPC_3.x_OpenMPI

Open MPI Open Source High Performance Computing, Message Passing Interface(http://www.open-mpi.org/)

Open MPI Training Documents

http://www.open-mpi.org/video/

Support http://e2e.ti.com/support/applications/high-performance-computing/f/952.aspx

http://software-dl.ti.com/sdoemb/sdoemb_public_sw/mcsdk_hpc/latest/index_FDS.html

http://software-dl.ti.com/sdoemb/sdoemb_public_sw/mcsdk_hpc/latest/index_FDS.html

http://processors.wiki.ti.com/index.php/MCSDK_HPC_3.x_Getting_Started_Guide

http://processors.wiki.ti.com/index.php/MCSDK_HPC_3.x_Getting_Started_Guide

http://processors.wiki.ti.com/index.php/MCSDK_HPC_3.x_OpenMPI

http://www.open-mpi.org/

http://e2e.ti.com/support/applications/high-performance-computing/f/952.aspx

china mcp 1 open mpi. agenda mpi overview open mpi architecture open mpi ti implementation open mpi...

Documents

open mpi slide

hpc mpi

tolerant mpi

mpi pointtopoint

open mpi overview slide

mpi goal slide

public mpi api

mpi forum body of academic