china mcp 1 open mpi. agenda mpi overview open mpi architecture open mpi ti implementation open mpi...
TRANSCRIPT
China MCP
1
Open MPI
Agenda
• MPI Overview• Open MPI Architecture• Open MPI TI Implementation• Open MPI Run-time Parameters• Open MPI Usage Example• Getting Started
Agenda
• MPI Overview• Open MPI Architecture• Open MPI TI Implementation• Open MPI Run-time Parameters• Open MPI Usage Example• Getting Started
4
• Message Passing Interface– “De facto” standard– Not an “official” standard(IEEE, IETF)
• Written and ratified by the MPI Forum– Body of academic, research, and industry representatives
• MPI spec– MPI-1 published in 1994– MPI-2 published in 1996– MPI-3 published in 2012– Specified interfaces in C, C++, Fortran 77/90
What is MPI?
MPI High-Level View
User Application
MPI API
Operation System
6
• High-level network API– Abstract away the underlying transport– Easy to use for customers
• API designed to be “friendly” to high performance network– Ultra low latency (nanoseconds matter)– Rapid ascent to wire-rate bandwidth
• Typically used in High Performance Computing(HPC) environments– Has a bias for large compute jobs
• “HPC” definition is evolving– MPI starting to be used outside of HPC– MPI is a good network IPC API
MPI Goal
Agenda
• MPI Overview• Open MPI Architecture• Open MPI TI Implementation• Open MPI Run-time Parameters• Open MPI Usage Example• Getting Started
8
• OpenMPI is an open source, high-performance implementation of MPI– Open MPI represents the union of four research/academic, open
source MPI implementations: LAM(Local Area Multicomputer)/MPI, LA(Los Alamos)/MPI, FT-MPI(Fault-Tolerant MPI) and PACX-MPI(Parallel Computer eXtension MPI)
• Open MPI has three main abstraction project layers– Open Portable Access Layer (OPAL): Open MPI's core portability
between different operating systems and basic utilities.– Open MPI Run-Time Environment (ORTE): Launch, monitor
individual processes, and group individual processes in to “jobs”– Open MPI (OMPI): Public MPI API and only one exposed to
applications.
Open MPI Overview
Open MPI High-Level View
MPI Application
Open MPI (OMPI) Project
Operation System
Open MPI Run-Time Environment (ORTE) Project
Open Portable Access Layer (OPAL) Project
Hardware
10
Project Separation
MPI Application
libompi
Operation System
libopen-rte
libopen-pal
Hardware
11
Library dependencies
MPI Application
libompi
Operation System
libopen-rte
libopen-pal
Hardware
12
• Open MPI architecture design– Portable, high-performance implementation of the MPI standard– Share common base code to meet widely different requirement– Run-time loadable components were natural choice, the same
interface behavior can be implemented multiple different ways. Users can then choose, at run time, which plugin(s) to use
• Plugin Architecture– Each project is structured similarly
• Main / Core code• Components(Plugins)• Frameworks
– Governed by the Modular Component Architecture
Plugin Architecture
13
MCA Architecture Overview
User Application
MPI API
Modular Component Architecture (MCA)
Framework Framework Framework Framework Framework…
Co
mp.
Co
mp.
Co
mp.
…
Co
mp.
Co
mp.
Co
mp.
…
Co
mp.
Co
mp.
Co
mp.
…
Co
mp.
Co
mp.
Co
mp.
…
Co
mp.
Co
mp.
Co
mp.
…
14
MCA Layout• MCA
– Top-level architecture for component services– Find, load, unload components
• Frameworks– Targeted set of functionality– Defined interfaces– Essentially: a group of one type of plugins– E.g., MPI point-to-point, high-resolution timers
• Components– Code that exports a specific interface– Loaded/unloaded rum-time– “Plugins”
• Modules– A components paired with resources– E.g., TCP component loaded, find 2 IP interface(eth0, eth1), make 2 TCP
modules
15
OMPI Architecture Overview
OMPI Layer
MPI Byte Transfer Layer
(btl)
MPI collective operations
(coll)
MPI one-sided communicatio
n interface( osc)
Memory Pool Framework
(mpool)Framework…
Ba
se
tcp
sm
…
sm
tun
ed…
pt2
pt.
rdm
a…
grd
ma
rgp
us
m.
…
Co
mp
.
Co
mp
.
Co
mp
.
…
Ba
se
Ba
se
Ba
se
16
ORTE Architecture Overview
ORTE Layer
Process Lifecycle
Management (PLM)
I/O Forwarding service (iof)
Routing table for the RML
(routed)
OpenRTE Group
Communication(grpcomm)
Framework…
tm slu
rm…
hn
p
too
l…
rad
ix
dire
ct…
pm
i
ba
d
…
Co
mp
.
Co
mp
.
Co
mp
.
…
Ba
se
Ba
se
Ba
se
Ba
se
17
OPAL Architecture Overview
OPAL Layer
IP interface (if)High
resolution timer (timer)
Hardware locality (hwloc)
Compression Framework (compress)
Framework…
Po
six
_ip
v4 L
inu
x_
ipv
6
…
linu
x
da
win…
ex
tern
al
hw
loc
15
1
…
bzip
gzip
…
Co
mp
.
Co
mp
.
Co
mp
.
…
Ba
se
Ba
se
Ba
se
Ba
se
Agenda
• MPI Overview• Open MPI Architecture• Open MPI TI Implementation• Open MPI Run-time Parameters• Open MPI Usage Example• Getting Started
19
Open MPI TI Implementation• Open MPI on K2H platform
– All components in 1.7.1 are supported– Launching and initial interfacing by using “SSH”– Adding BTLs for SRIO and Hyperlink transports
OpenMPRun-time
Kernel
Kernel
Kernel
C6
6x
sub
syst
em
IPC
IPC
A15 SMP Linux
OpenCL
MPI
OpenMPRun-time
Kernel
Kernel
Kernel
C6
6x
sub
syst
em
IPC
IPC
A15 SMP Linux
OpenCL
MPI
K2H K2HShared
memory/NavigatorShared
memory/Navigator
Ethernet
Hyperlink
SRIO
MPIApplication
Node 0 Node 1
20
OMPI TI Added Components
OMPI Layer
MPI Byte Transfer Layer
(btl)
MPI collective operations
(coll)
MPI one-sided communicatio
n interface( osc)
Memory Pool Framework
(mpool)Framework…
Ba
se
hlin
k
srio
…
sm
tun
ed…
pt2
pt.
rdm
a…
grd
ma
rgp
us
m.
…
Co
mp
.
Co
mp
.
Co
mp
.
…
Ba
se
Ba
se
Ba
se
21
• Hyperlink is TI-proprietary high speed, point-to-point interface, with 4 lanes up to 12.5Gbps (maximum transfer of 5.5-6 Gbytes/s).
• New BTL module has been added to ti-openmpi (openmpi 1.7.1 based) to support transport over Hyperlink. MPI Hyperlink communication is driven by A15 only.
• K2H device has 2 Hyperlink ports (0 and 1) allowing one SoC to connect directly with two neighboring SoCs.
– Daisy chaining is not supported.– Additional connectivity can be obtained by mapping common memory region in intermediate node– Data transfers are operated by EDMA
• Hyperlink BTL support is seamlessly integrated into OpenMPI run-time:– Example code to run mpptest using 2 nodes over hyperlink:
/opt/ti-openmpi/bin/mpirun --mca btl self,hlink -np 2 -host c1n1,c1n2 ./mpptest -sync logscale
– Example code to run nbody using 4 nodes hyperlink:
/opt/ti-openmpi/bin/mpirun --mca btl self,hlink -np 4 -host c1n1,c1n2,c1n3,c1n4 ./nbody 1000
OpenMPI Hyperlink BTL
K2H K2H
K2H
HL0 HL0
HL1 HL1
HL0 HL1
K2H K2HHL0 HL0
HL1 HL1
K2H K2H
HL0 HL0
HL1 HL13 node Hyperlink topology
4 node Hyperlink topology
22
OpenMPI Hyperlink BTL – connection types
Node 2 Node 3
HL0 HL0
HL1 HL1
Node 1 Node 4
HL0HL0
HL1 HL1
No
de
1 w
rite
s to
No
de
2Node 3 reads from Node 2t
Sending fragment from node 1 to node 3
Sending fragment from node 3 to node 1
No
de 3 w
rites to N
od
e 4
Node1 reads from Node 4
Same memory block mapped via both Hyperlink ports (to different nodes), used only for diagonal uni-directional connection
Diagonal connections
Node 2 Node 3src
dst
dst
src
Node 2 writes to Node 3
Node 3 writes to Node 2
Local read
Local read
Adjacent connections
Same memory block mapped via both Hyperlink ports (to different nodes), used only for diagonal uni-directional connection
src
src
dst
dst
transfer
transfer
23
OpenMPI SRIO BTL• Serial RapidIO connections are high speed low-latency connections that can be switched via external
switching fabric (SRIO switches) or by K2H on-chip packet forwarding tables (when SRIO switch is not available)
• K2H device has 4 SRIO lanes that can be configured as 4x1 lane links, or 1x4 lane link. Wire speed can be up to 5Gbps, with data link speed of 4 Gbps (due to 8/10b encoding)
• Texas Instruments ti-openmpi (based on openmpi 1.7.1) includes SRIO BTL based on SRIO DIO transport, using Linux rio_mport device driver. MPI SRIO communication is driven by A15 only.
• SRIO nodes are statically enumerated (current support) and programming of packet forwarding tables is done inside MPI run-time, based on list of participating nodes. HW topology is specified by JSON file
• Programming of packet forwarding tables is static and allows HW-assisted routing of packets w/o any SW intervention in transferring nodes.
– Packet forwarding table has 8 entries (some limitations can be encountered based on topology and traffic patters)– Each entry specify min-SRIO-ID, max-SRIO-ID, outgoing port– External SRIO fabric typically provide non-blocking switching capabilities and might be favorable for certain
applications and HW designs
• SRIO BTL, based on destination hostname determines outgoing port and destination ID. Previously programmed packet forwarding tables in all nodes ensure deterministic routability to destination node.
• SRIO BTL support is seamlessly integrated into OpenMPI run-time:– Example code to run mpptest using 2 nodes over SRIO:
/opt/ti-openmpi/bin/mpirun --mca btl self,srio -np 2 -host c1n1,c1n2 ./mpptest -sync logscale
– Example code to run nbody using 12 nodes over SRIO:
/opt/ti-openmpi/bin/mpirun --mca btl self,srio -np 12 -host c1n1,c1n2,c1n3,c1n4,c4n1,c4n2,c4n3,c4n4,c7n1,c7n2,c7n3,c7n4 ./nbody 1000
24
OpenSRIO BTL – possible topologies
K2H K2H
K2H K2H
K2H K2H
K2H K2H
Full connectivity of 4 nodes – 1 lane per link
Connections with 4 lanes per link
K2H K2H K2H K2H
K2H K2H K2H K2H
K2H K2H K2H K2H
K2H K2H K2H K2H
2-D torus (16-nodes)
K2H K2H K2H K2H K2H K2H
SRIO switch star topology
Packet forwarding capability allows creation of HW virtual links (no SW operation!)
Agenda
• MPI Overview• Open MPI Architecture• Open MPI TI Implementation• Open MPI Run-time Parameters• Open MPI Usage Example• Getting Started
26
Open MPI Run-time Parameters• MCA parameters are the basic unit of run-time tuning for
Open MPI.– The system is a flexible mechanism that allows users to change
internal Open MPI parameter values at run time– If a task can be implemented in multiple, user-discernible ways,
implement as many as possible and make choosing between them be an MCA parameter
• Service provided by the MCA base– Does not mean that they are restricted to the MCA components of
frameworks– OPAL, ORTE, and OMPI projects all have “base” parameters– Allows users to be proactive and tweak Open MPI's behavior for
their environment. It’s allows users to experiment with the parameter space to find the best configuration for their specific system.
27
MCA parameters lookup order1. mpirun command line
2. Environment variable
3. File, these location are themselves tunable– $HOME/.openmpi/mca-params.conf– $prefix/etc/openmpi-mca-params.conf
4. Default value
mpirun –mca <name> <value>
export OMPI_MCA_<name> <value>
28
MCA run-time parameters usage• Get the MCA information
– The ompi_info command can list the parameters for a given component, all the parameters for a specific framework, or all parameters.
• MCA Usage– The mpirun command execute serial and parallel jobs in Open MPI
/opt/ti-openmpi/bin/ompi_info –param all all
Show all the MCA parameters for all components that ompi_info finds
/opt/ti-openmpi/bin/ompi_info –param btl all
Show all the MCA parameters for all BTL components
/opt/ti-openmpi/bin/ompi_info –param btl tcp
Show all the MCA parameters for TCP BTL component
/opt/ti-openmpi/bin/mpirun –mca orte_base_help_aggregate 0 –mca btl_base_verbose 100 –mca btl self, tcp –np 2 –host k2node1, k2node2 /home/mpiuser/nbody 1000
Select the btl_base_verbose and use tcp for transport
Agenda
• MPI Overview• Open MPI Architecture• Open MPI TI Implementation• Open MPI Run-time Parameters• Open MPI Usage Example• Getting Started
30
Open MPI API Usage• Open MPI API is standard MPI API, refer to the following link to get more
information: http://www.open-mpi.org/doc/
• This example project locate at <mcsdk-hpc_install_path>/demos/testmpi
MPI_Init (&argc, &argv); /* Startup *//* starts MPI */ MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* Who am I?*//* get current process id */ MPI_Comm_size (MPI_COMM_WORLD, &size);/* How many peers do I have *//* get number of processes */{ /* Get the name of the processor */ char processor_name[320]; int name_len; MPI_Get_processor_name(processor_name, &name_len); printf("Hello world from processor %s, rank %d out of %d processors\n", processor_name, rank, size); gethostname(processor_name, 320); printf ("locally obtained hostname %s\n", processor_name);} MPI_Finalize(); /* Finish the MPI application and release sources*/
31
Run the Open MPI example• Use the mpirun and mca parameters to run the example
• Output messages
/opt/ti-openmpi/bin/mpirun –mca btl self, sm, tcp –np 8 –host k2node1, k2node2 ./testmpi
>>>
Hello world from processor k2hnode1, rank 3 out of 8 processors locally obtained hostname k2hnode1 Hello world from processor k2hnode1, rank 0 out of 8 processors locally obtained hostname k2hnode1 Hello world from processor k2hnode2, rank 5 out of 8 processors locally obtained hostname k2hnode2 Hello world from processor k2hnode2, rank 4 out of 8 processors locally obtained hostname k2hnode2Hello world from processor k2hnode2, rank 7 out of 8 processors locally obtained hostname k2hnode2 Hello world from processor k2hnode2, rank 6 out of 8 processors locally obtained hostname k2hnode2 Hello world from processor k2hnode1, rank 1 out of 8 processors locally obtained hostname k2hnode1Hello world from processor k2hnode1, rank 2 out of 8 processors locally obtained hostname k2hnode1
<<<
Agenda
• MPI Overview• Open MPI Architecture• Open MPI TI Implementation• Open MPI Run-time Parameters• Open MPI Usage Example• Getting Started
33
Getting Started
Bookmarks URL
Download http://software-dl.ti.com/sdoemb/sdoemb_public_sw/mcsdk_hpc/latest/index_FDS.html
Getting Started Guide http://processors.wiki.ti.com/index.php/MCSDK_HPC_3.x_Getting_Started_Guide
TI OpenMPI User Guide
http://processors.wiki.ti.com/index.php/MCSDK_HPC_3.x_OpenMPI
Open MPI Open Source High Performance Computing, Message Passing Interface(http://www.open-mpi.org/)
Open MPI Training Documents
http://www.open-mpi.org/video/
Support http://e2e.ti.com/support/applications/high-performance-computing/f/952.aspx