towards an exa-scale operating system* ely levy, the hebrew university *work supported in part by a...
Post on 15-Dec-2015
213 Views
Preview:
TRANSCRIPT
Towards an Exa-scale Operating System*
Ely Levy, The Hebrew University
*Work supported in part by a grant from the DFG
program SPPEXA, project FFMK
Copyright © Amnon Barak 2014
The eXascale challenge
1,000,000 = 1M teraflop (double precision)
Low-power ~20MW (~10% of current systems).
Current top CPUs/GPUs/Phi: ~O(1) teraflops, ~O(10) integer operations.
To obtain 1M teraflop we need:
~1M CPU for traditional applications.
~ 100K processors for integer applications.
Copyright © Amnon Barak 2014
The project in a NutshellExascale challenges include:
Scaling
Failures
Load imbalances
Heat and power management
Information collection
Self-organizing platform and applications
Adapt and combine 4 mature technologies
L4(micro kernel), XtreemFS, MOSIX and MPI
Copyright © Amnon Barak 2014
Hardware assumptionsLarge numbers of components
•High failure rates
•Not all cores may be active simultaneously due to heat/energy constraints
Low-power storage for checkpoint state on each node
Usage:Manycore Nodes with compute and service cores
Copyright © Amnon Barak 2014
State of the Art: Static AssignmentExascale System
Many- core Node
Application
Application
Copyright © Amnon Barak 2014
Goal - Dynamic ApplicationsExascale System
Many- core Node
Application
ApplicationApplication
Copyright © Amnon Barak 2014
Goal - Dynamic ApplicationsExascale System
Many- core Node
Application
Application
Copyright © Amnon Barak 2014
The information collection problem
Given a cluster with O(1M) nodes: Nodes: computing servers or mobile devices.
Each node sends 1 message each unit of time. Message contains information about the state of the node and its relevant
resources, e.g., availability, load, free memory, temperature, etc.
One master computer regularly collects information about the
state of all the nodes. Performs management decisions that require system-wide info,
e.g., job allocations as in MPI or SLURM, load-balancing, IPC optimization.
Can be mirrored for fault tolerance.
The problem: how to collect fresh information without overloading the master computer.
Copyright © Amnon Barak 2014
Distributed bulletin board
• Information is circulated continually.• Unit of time not too small - reduces
communication congestion.
• Recall that some events, e.g. load-balancing, are triggered by cluster nodes, not by the master computer.
• Available instantly to client processes, even with “relaxed” circulation. • Example: watch – works continuously,
provides the time instantly.
Copyright © Amnon Barak 2014
Possible algorithms
Centralized: every unit of time, each node sends a message to the master computer.Drawback: does not scale up well due to the communication congestion at the master computer.
May be suitable for medium size configurations, e.g., with thousands of nodes, but it is unlikely to be suitable for configurations with millions of nodes.
Hierarchical tree:Each node sends its information to its parent node until all the information arrives at the master computer in O(log # of nodes) units of time. Sensitive to node failures, log delay.
Copyright © Amnon Barak 2014
Randomized Gossip
14
14
Distributed Bulletin Board
•Each node keeps vector with per-node info (own + info received from others)
•Once per time step, each node sends to 1 other randomly selected node a subset of its own vector entries (called “window”)
•Node merges received window entries into local vector (if newer)
Copyright © Amnon Barak 2014
MOSIX: Gossip Algorithm
15
15
A:0A:0 B:12B:12 C:2C:2 D:4D:4 E:11E:11 ......
A:0A:0 C:2C:2 ......D:4D:4
Each time unit:
•Update local info
•Find all vector entries up to age T (called a window)
•Send window to 1 randomly selected node
Upon receiving a window:
•Update the received entries’ age (+1 for transfer)
•Update entries in local vector where newer information has been received
A:5A:5 B:2B:2 C:4C:4 D:3D:3 E:0E:0 ......
A:1A:1 C:3C:3 ......D:5D:5C:3C:3A:1A:1
Copyright © Amnon Barak 2014
A two layer gossip algorithm
Compute nodes are divided into colonies. Based on some optimization criteria, e.g.
network proximity. Colony nodes exchange local information for
performing localized management decisions, such as load-balancing.
Any client application in need of up-to-date information about the state of the resources in its colony can directly obtain it locally.
Master computer Collects information from a few nodes in each
colony about all its nodes. Provide it to client process, e.g. scheduler.
Outcome: distributed bulletin board
Copyright © Amnon Barak 2014
Algorithm Trade-offs
• Colony size• Fresh limited information VS older wider one
• Colony topology• Spread VS focused node data
• Dissemination rate• Fresh information VS network overhead
• Bigger threshold value • Results in bigger window size • Lower average age VS more data being sent
top related