a u.s. department of energy office of science laboratory operated by the university of chicago...
TRANSCRIPT
![Page 1: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/1.jpg)
A U.S. Department of EnergyOffice of Science LaboratoryOperated by The University of Chicago
Argonne National Laboratory
Office of ScienceU.S. Department of Energy
Hétérogénéité des réseaux dans MPI : défis et perspectives
Guillaume Mercierwww-unix.mcs.anl.gov/~mercierg
![Page 2: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/2.jpg)
2
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Plan de l’exposé
• Problématique
• Deux archétypes- PACX-MPI- MPICH-G2
• MPICH-Madeleine- Architecture- Utilisation- Performances
• Conclusion
![Page 3: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/3.jpg)
3
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Quelques repères …
• Février – Juillet 2000 : Stage de DEA au LIP dans le projet INRIA ReMap (Loïc Prylli)
• Novembre 2000 – Février 2002 : Service National au Service Scientifique de l’Ambassade de France à Tokyo
• Mars – Août 2002 : Début de thèse au LIP, dans le projet INRIA ReMap (Raymond Namyst)
• Septembre 2002 – Février 2005 : Suite et fin de thèse dans le projet INRIA RUNTIME à Bordeaux
• Mars 2005 – Août 2006 : Post-doctorat à l’Argonne National Laboratory, dans l’équipe de développement de MPICH2 (William Gropp et Ewing Lusk)
• Domaines de recherche :Domaines de recherche :- Communications dans les architectures parallèlesCommunications dans les architectures parallèles- Supports d’exécution hautes-performancesSupports d’exécution hautes-performances- Standard MPIStandard MPI
![Page 4: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/4.jpg)
4
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Contexte et problématique
Il n’y a pas si longtemps :
• Grappes qui se répandent rapidement
• Multiples réseaux rapides disponibles- Myrinet , SCI, VIA
• Popularisation des grilles de calcul
Conséquences logiques
• Apparition des grappes de grappes- Peut être vu comme un premier pas vers les grilles
• Volonté d’adapter les outils existants- Nombreux projets concernants MPI
![Page 5: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/5.jpg)
5
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Caractéristiques des grappes de grappes
• Interconnexion de grappes• Architectures hiérarchiques • Architectures hétérogènes
smp smp
smp smp
smp smp
smp smp
Comm intra-grappe
Comm intra-grappe
Communications inter-grappes
![Page 6: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/6.jpg)
6
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Pertinence du support multi-réseaux
• Grappes de grappes ?- Pas si répandues que ça
• Objectif (peut-être ?) obsolète mais problématique toujours d’actualité- Systèmes multirails homogènes ou hétérogènes- Grappes multiplement câblées
- Ex: IBM BG/L #1 au Top 500, réseaux Tree et Torus - Grilles de calcul
![Page 7: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/7.jpg)
7
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Premier exemple : PACX-MPI
• Surcouche logicielle- Transparente pour l’application
• Passerelles pour les communications inter-grappes- Goulot d’étranglement potentiel
• Communications inter-grappes basées sur TCP/IP uniquement
• Deux noeuds (au moins) ne participent pas au calcul
Nœud 2
Noeud 3
Noeud 0
Noeud 1
TCP
![Page 8: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/8.jpg)
8
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Second exemple : MPICH-G2
• Basé sur MPICH et Globus
- Vise les grilles de calcul
• L’ensemble des noeuds sont connectés
• Communications inter-grappes basées sur TCP/IP
• Opérations collectives optimisées, exploitant la topologie
Node 0
Node 1
Node 2
Node 3
TCP
TCP
TCP
![Page 9: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/9.jpg)
9
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Caractéristiques communes
• Schéma de type inter-opérable- Plusieurs implémentations de MPI communicantes
- Illustration du « MPI Paradox »- Communications intra-grappes avec un MPI local (optimisé)- Communications inter-grappes basées sur TCP (optimisé ?)
• Liens rapides inter-grappes inexploités • Passerelles inexploitées • Pas vraiment multi-réseaux !
- Exemple typique : la gestion des grappes multiplement câblées- Les vendors MPI sous-jacent doivent posséder la propriété
TCPMyrinet
SCI
![Page 10: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/10.jpg)
10
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Objectifs et hypothèses
Objectifs :• Exploitation optimale du matérial sous-jacent• Maîtrise au niveau applicatif de la configuration matérielle
- L’intelligence est au niveau du programmeur, pas du middleware
- On n’interdit pas au logiciel de prendre des décisions- On veut quelque chose de « user-friendly »
Hypothèses :• Echelle concernée : les grappes de grappes• Pas de problèmes :
- De réservation de ressources- De déploiement- D’authentification et de sécurité
![Page 11: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/11.jpg)
11
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
MPICH-Madeleine
• Approche originale - Pile unifiée (logiciel « Stand-alone »)
- Approche originale à l’époque, reprise par VMI notamment- Moteur de progression multithreadé
- Essais précédents : échecs ( sauf MPI/Pro ?)- Moteur partiellement multithreadé (SCI-MPICH, Open MPI)
• Une déclinaison de MPI supportant des configurations complexes- Exploitation des réseaux rapides intra-grappes- Exploitation des passerelles et liens haut-débit inter-grappes
• Une implémentation de MPI efficace sur les configurations multi-protocoles - Grappes homogènes de machines multi-processeurs - Grappes de grappes hétérogènes - Grilles exclues du champ de l’étude (a priori, cf hypothèses)
![Page 12: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/12.jpg)
13
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Architecture de MPICH-Madeleine
SCI GM TCP
INTERFACE GENERIQUESHME
M
Bibliothèque
ProcessusLégers
SOUS-MODULE RESEAUGestion
Multithread
Sous
Module
SMPMoteur
MODULE UNIQUE DE COMMUNICATION
Structures de données Files de messages
IMPLEMENTATION MPI
APPLICATION PARALLELE
• Architecture Architecture modulairemodulaire
![Page 13: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/13.jpg)
14
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Principes architecturaux
• Utilisation de processus légers- Implantation d’un moteur de progression des communications - Gestion multi-protocole
- Simplifiée- Extensible
• Utilisation d’une interface générique de communication- Interface opaque- Exploitation efficace des réseaux sous-jacents- Gestion à bas niveau de services pour la gestion des grappes
de grappes (passerelles, routage)
![Page 14: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/14.jpg)
15
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Avantages du multithreading
• Unification des mécanismes de scrutation • Portabilité
- communications «motorisées» sur tout type de matériel• Extensible
- Rajout simple de nouveaux protocoles• Amélioration de la progression des communications
- Appels non-bloquants asynchrones- Recouvrement calcul/communication- Progression découplée des communications et de l’application
• Utilisation des processus légers au niveau applicatif- MPI_THREAD_MULTIPLE supporté
![Page 15: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/15.jpg)
16
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
La bibliothèque Madeleine
• Une bibliothèque de communication haute-performance- Sous-système de communication de PM2
• Interface orientée «Passage de Messages» - Construction incrémentale des messages- Interface restreinte
• Propriétés intéressantes- Utilisation en environnement multithread- Sélection dynamique de la meilleure méthode de transfert- Support des grappes de grappes (routage, retransmission)
• Objets de base pour les communications : les canaux- Un ensemble de connexions- Similaire à un communicateur MPI
Canal
Processus
Connexion
![Page 16: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/16.jpg)
17
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Les canaux dans Madeleine
• Deux grandes familles de canaux- Les canaux physiques- Les canaux virtuels
- uniquement construits à partir de canaux physiques
• Création de réseaux virtuels hétérogènes- Exploitation des passerelles
• Manipulation avec des fichiers de configuration- Flexibilité- Changement de réseau sans recompilation
MyrinetSCI
Virtuel
![Page 17: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/17.jpg)
18
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Support multi-protocole
• Un processus léger est associé à chaque canal de communication - Bonus : utilisation des processus légers au niveau applicatif - Cas particulier : le canal default
• Un canal est associé à un communicateur- Manipulation de la topologie au niveau applicatif
• Interface réduite :- MPI_COMM_NUMBER: nombre de canaux disponibles- MPI_USER_COM[i]: Communicateur associé au canal #i- MPI_GetCommFromName : Accès au communicateur par son nom de canal
• Interface optionnelle- Uniquement pour tirer parti de la topologie sous-jacente
• Portabilité préservée
- Pas de modification de la sémantique MPI pour les communicateurs- Tous les programmes existants peuvent utiliser MPICH-Madeleine sans
modifications - MPI_GetCommFromName implémenté avec des appels de la bibliothèque MPI- Utiliser les attributes ?
![Page 18: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/18.jpg)
19
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Utilisation de MPICH/Madeleine
• L’intégralité des informations est donnée à MPI avec des fichiers de configuration- Un premier fichier contient la disposition physique des réseaux
(correspondance machine/réseaux)- Fichier écrit une fois pour toute à l’installation (et en cas de
modification physique de la grappe) - Un second fichier contient la correspondance canaux/réseaux
- Fichier différent pour chaque lancement
![Page 19: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/19.jpg)
20
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Utilisation de MPICH/Madeleine
• Exemple de configuration:- 2 nœuds avec SCI - 2 noeuds avec GM/Myrinet - GigabitEthernet entre les deux grappes
Node 0
Node 1
Node 2
Node 3
GBE
GBE
SCI GMGBE
![Page 20: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/20.jpg)
21
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Le Fichier de Réseaux
networks : ({ name : tcp_net; hosts : (node0,node1,node2,node3); dev : tcp; },{ name : sci_net; hosts : (node0,node1); dev : sisci; },{ name : gm_net; hosts : (node2,node3); dev : gm; });
![Page 21: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/21.jpg)
22
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Le Fichier de Canaux
channels : ({ name : my_sci; net : sci_net; hosts :(node0,node1); },{ name : my_gm; net : gm_net; hosts : (node2,node3); },{ name : my_tcp; net : tcp_net; hosts : (node0,node1,node2,node3); }); vchannels: { name : default; channels: (my_sci,my_gm,my_tcp); };
![Page 22: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/22.jpg)
23
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Construction des Canaux
• Changement de réseaux sans recompilation - Doit être compilé avec le support pour tous les réseaux
• Plusieurs canaux physiques peuvent être déclarés au-dessus du même réseau
• Un canal virtuel “consume” les canaux physiques sur lesquels il est construit- L’application ne peut plus y accéder- Important : ordre de déclaration = ordre de priorité
![Page 23: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/23.jpg)
24
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Comparaison avec MPICH-G2
• Environnement de tests
- Une grappe SCI connectée à une grappe Myrinet
• Communications inter-grappes directes
• MPICH-Madeleine est utilisé comme Vendor MPI pour MPICH-G2
• Comparaison des mécanismes inter-grappes de G2 avec les canaux virtuels de Madeleine
Node 0
Node 1
Node 2
Node 3
GBE
GBE
SCI GMGBE
![Page 24: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/24.jpg)
25
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Pourquoi MPICH-Madeleine en tant que Vendor MPI ?
0
10
20
30
40
50
60
70
80
Taille des Messages (bytes)
Débit
(M
B/s)
G2/ MAD/ GM => G2/ MAD/ GBEG2/ MPICH-GM => G2/ MPICH-P4/ GBEG2/ MPICH-VMI/ GM => G2/ MPICH-P4/ GBEG2/ MPICH-GM => G2/ MPICH-VMI/ GBE
0
10
20
30
40
50
60
70
80
Taille des Messages (bytes)
Débit
(M
B/s)
G2/ MAD/ GM => G2/ MAD/ GBEG2/ MPICH-GM => G2/ MPICH-P4/ GBEG2/ MPICH-VMI/ GM => G2/ MPICH-P4/ GBEG2/ MPICH-GM => G2/ MPICH-VMI/ GBE
![Page 25: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/25.jpg)
26
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Comparaison avec MPICH-G2 : résultats
0102030405060708090
100110120
1 4 16 64 256
1024
4096
1638
4
6553
6
2621
44
1048
576
4194
304
Taille des messages(octets)
Débit
(M
o/s)
MPICH-G2
MPICH-Madeleine(GBE)
Passerelle (SCI-GM)
0102030405060708090
100110120
1 4 16 64 256
1024
4096
1638
4
6553
6
2621
44
1048
576
4194
304
Taille des messages(octets)
Débit
(M
o/s)
MPICH-G2
MPICH-Madeleine(GBE)
Passerelle (SCI-GM)
![Page 26: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/26.jpg)
27
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Comparaison avec MPICH-G2 : conclusion
• Les résultats dépendent des «Vendor MPI»- L’intégration de MPICH-Madeleine élimine le biais
• Comparaison entre Pile Unifiée et Approche Inter-Opérable- Avantage à la pile unifiée- Mais : micro-benchmarks- Mais : contexte faible distance (grappes de grappes)
![Page 27: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/27.jpg)
28
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Comparaison avec PACX-MPI
• Environnement de tests- Deux grappes SCI connectées - Passerelles: GigaBit Ethernet, Myrinet, directe SCI-SCI
• Communications inter-grappes avec retransmissions • Comparaison des passerelles de PACX-MPI avec celles de
Madeleine
SCIGBE/GM
SCI
Nœud 2
Noeud 3
SCI
Noeud 0
Noeud 1
SCIGBE/GM
![Page 28: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/28.jpg)
29
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Comparaison avec PACX-MPI : résultats
0102030405060708090
100110120
1 4 16 64 256
1024
4096
1638
4
6553
6
2621
44
1048
576
4194
304
Taille des messages(octets)
Déb
it (
Mo/s)
PACX-MPI (SCI-GBE-SCI)
PASSERELLE (SCI-GBE-SCI)
PASSERELLE (SCI-GM-SCI)PASSERELLE (SCI-SCI)
0102030405060708090
100110120
1 4 16 64 256
1024
4096
1638
4
6553
6
2621
44
1048
576
4194
304
Taille des messages(octets)
Déb
it (
Mo/s)
PACX-MPI (SCI-GBE-SCI)
PASSERELLE (SCI-GBE-SCI)
PASSERELLE (SCI-GM-SCI)PASSERELLE (SCI-SCI)
![Page 29: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/29.jpg)
30
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Comparaison avec PACX-MPI : conclusion
• 1 passerelle vaut mieux que 2 (évident)
• Goulot d’étranglement : utilisation de TCP- PACX-MPI et MPICH-Mad ont des résultats similaires- MPICH-Mad peut utiliser d’autres protocoles
• TCP testé : version «standard »- Et avec une version optimisée ?- Quelles contraintes liées à la longue distance ?
• Test de type micro-benchmark (ping-pong)- Tests en situation réelle ?- Quelles applis prendre ?
![Page 30: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/30.jpg)
31
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
High-Performance Linpack
02468
101214
Gfl
op
s
8000 10000 12000
Taille du problème
MPICH-P4
MPICH-MAD(GBE)
MPICH-GM
MPICH-MAD(GM)
MPICH-MAD(SCI)
HPL sur deux nœuds bi-pro avec HyperThreading
![Page 31: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/31.jpg)
32
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
High-Performance Linpack (2)
02468
101214
Gfl
op
s
8000 10000 12000
Taille du problème
MPICH-P4MPICH-MAD(GBE)MPICH-GMMPICH-MAD(GM)SCI-MPICHMPICH-MAD(SCI)
HPL sur deux nœuds bi-pro sans HyperThreading
![Page 32: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/32.jpg)
33
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
High-Performance Linpack (3)
02468
10121416182022
Gfl
op
s
8000 12000 16000 20000
Taille du problème
MPICH-P4
MPICH-MAD(GBE)
MPICH-MAD(SCI)
HPL sur quatre nœuds bi-pro avec HyperThreading
![Page 33: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/33.jpg)
34
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Progression indépendante des communications
• Exemple : anneau de processus• Appels non-bloquants : Isend • Cas du protocole rendez-vous
Implémentation non-bloquante et non-asynchrone
MPICH-Madeleine
![Page 34: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/34.jpg)
35
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Résultats
Version de MPI Temps total (eager) Temps total (rendez-vous)
MPICH-GM 11,5 sec 130,8 sec
MPICH-Madeleine/GM 11,8 sec 19,2 sec
Version de MPI Temps total (eager) Temps total (rendez-vous)
MPICH-P4 5,6 sec 45,5 sec
MPICH-Madeleine/TCP 8,1 sec 9,5 sec
• Anneau de huits processus• Message de 2 Ko (eager) et 1 Mo (rendez-vous)
![Page 35: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/35.jpg)
36
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Conclusion
• Comparaison facile entre pile unifiée et interopérable- Architectures et implémentations différentes, comparables
• Pas de comparaisons entre modèles- Passerelles vs Fully connected- Trop de biais possibles- Manque d’outils
- MPICH-Mad supporte les deux modèles- MetaMPICH aussi (depuis peu)
• Pas d’études sur le sujet- Trop dépendant du couple (application, architecture matérielle)- A sa place dans un cadre « longue distance »
![Page 36: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/36.jpg)
37
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Pespectives : Support des grilles dans MPICH-Madeleine
• Grilles a priori exclues du champ de l’étude- La gestion du multi-réseau indépendante de l’échelle- MPICH-Madeleine applatit la hiérarchie
- Stratégie valable pour grappes de grappes- Stratégie non-viable pour les grilles
• Options- Utiliser MPICH-Madeleine dans un environnement plus large
- Portage dans MPICH-G2 déjà opérationnel- Outils pour Grid’5000
- Adapter MPICH-Madeleine- Point a revoir : articulation SAN/WAN- Etudier les modèles de programmation
- Revoir le déploiement par rapport a la topologie
![Page 37: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/37.jpg)
39
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
New Communication Core : Nemesis
• Provide an infrastructure to answer basic questions about scaling MPI implementations
- What is the overhead of MPI?- Typically, one measures some MPI implementation, then claims that is the
overhead of MPI; confuses an implementation with a specification• Our goal: Develop a fast, well-instrumented and analyzed communication core
- Answer questions about overhead, cost of MPI- Provide higher-performance, lower-latency open MPI
![Page 38: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/38.jpg)
40
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
New MPICH2 Communication Device
• Current work is developing a “channel” for the ch3 device
• Key Features
- Shared memory is considered as “first-class citizen”- Lock-free queues
- Low latency- Extremely scalable
- Multi-method- New networks are easy to add- 5 required functions
- init, finalize, direct_send, poll_send, poll_recv- Optional functions for RMA and collectives for enhanced
performance- Follows standard MPICH approach that allows easier initial ports,
followed by performance tuning
![Page 39: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/39.jpg)
42
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Basic Design
• Each process features several queues in shared memory that handle most of the communication:
- One Free Queue - One Recv Queue
• Each process also features a pair of “Fast Boxes” for each destination process
- Used for small messages- Best achievable latency- Lowest Instruction count
• Plus Network Modules for non-shared memory communication- Use Processes’ Free and Recv Queues
![Page 40: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/40.jpg)
43
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Lock-Free Queues
• Low latency- No locks
- Uses compare-and-swap and swap atomic instructions- Simple implementation
- Enqueue: 6 instructions, 1 L2 cache miss- Dequeue: 11 instructions, 1-2 L2 cache misses
- Progress engine has only one queue to poll• Extremely scalable
- Each process needs two queues regardless of the number of processes- Recv queue- Free queue
- Progress engine has only one queue to poll• Same queue mechanism is used for networks
- Messages received from networks are enqueued on the recv queue
![Page 41: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/41.jpg)
44
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Network Modules and Shared Queues
ProcessNetworkModule
NetworkModule
NetworkModule
NetworkModule
NetworkModule
NetworkModule
NetworkModule
NetworkModule
Process NetworkModule
NetworkModule
Lock-freeQueues
![Page 42: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/42.jpg)
45
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Multi-method Receive
• Network module receives a message and queues it on the recv queue• Where does the network module get the queue entry?
- The entry must be in shared memory because other SMP processes need to enqueue behind it
- Each network module has its own free queue• How do we avoid memcopies?
- Register all queue entries- Network needs to be able to register shared memory- No data needs to be copied: simply enqueue the received packet
![Page 43: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/43.jpg)
46
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Multiple Network Multi-method Issues
• In order to be able to have networks send and receive directly from/into queue elements, they need to be registered by each network
- Can multiple networks register the same memory region? (probably not)• Solution:
- Only one network will be able to register- Others will have to copy
• Motivation:- There’s probably only one high-speed local network- Inter-cluster networks can afford an extra copy
![Page 44: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/44.jpg)
47
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Multiple Networks
RecvNetwork Module
Free
Network Module
Network Module Network Module
RecvRecv
Free Free
memcpy
![Page 45: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/45.jpg)
48
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Performances sur processeurs 32-bits
• Communications utilisant la mémoire partagée• Processeurs Intel Pentium 4 Xeon, 2 GHz• Programme de test Netpipe :
- Latence minimale : 0,68 µs- Débit maximal : 650 Mo/s
![Page 46: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/46.jpg)
49
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Performances sur processeurs 64-bits
• Communications utilisant la mémoire partagée• Processeurs AMD Opteron 280, 2,4 GHz, dual core• Programme de test Netpipe :
- Latence minimale : 0,34 µs- Débit maximal : 1,37 Go/s
![Page 47: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/47.jpg)
50
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Performances sur processeurs 32-bits
• Communications utilisant Myrinet/GM• Processeurs Intel Pentium 4 Xeon, 2 GHz• Programme de test Netpipe
![Page 48: A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department](https://reader034.vdocuments.mx/reader034/viewer/2022051614/551d9d95497959293b8cc45b/html5/thumbnails/48.jpg)
51
Pioneering Science andTechnology
Office of Science U.S. Department
of Energy
Performances sur processeurs 32-bits
• Communications utilisant TCP/GigaBit Ethernet• Processeurs Intel Pentium 4 Xeon, 2 GHz• Programme de test Netpipe