towards an exa-scale operating system* ely levy, the hebrew university *work supported in part by a...

Towards an Exa-scale Operating System*

Ely Levy, The Hebrew University

*Work supported in part by a grant from the DFG

program SPPEXA, project FFMK

The eXascale challenge

1,000,000 = 1M teraflop (double precision)

Low-power ~20MW (~10% of current systems).

Current top CPUs/GPUs/Phi: ~O(1) teraflops, ~O(10) integer operations.

To obtain 1M teraflop we need:

~1M CPU for traditional applications.

~ 100K processors for integer applications.

The project in a NutshellExascale challenges include:

Scaling

Failures

Load imbalances

Heat and power management

Information collection

Self-organizing platform and applications

Adapt and combine 4 mature technologies

L4(micro kernel), XtreemFS, MOSIX and MPI

Hardware assumptionsLarge numbers of components

•High failure rates

•Not all cores may be active simultaneously due to heat/energy constraints

Low-power storage for checkpoint state on each node

Usage:Manycore Nodes with compute and service cores

Target Applications

communication

computation

communication

State of the Art: Static AssignmentExascale System

Many- core Node

Application

Goal - Dynamic ApplicationsExascale System

Many- core Node

Application

ApplicationApplication

Goal - Dynamic ApplicationsExascale System

Many- core Node

Application

ChallengesExascale System

Application

System Architecture

The information collection problem

Given a cluster with O(1M) nodes: Nodes: computing servers or mobile devices.

Each node sends 1 message each unit of time. Message contains information about the state of the node and its relevant

resources, e.g., availability, load, free memory, temperature, etc.

One master computer regularly collects information about the

state of all the nodes. Performs management decisions that require system-wide info,

e.g., job allocations as in MPI or SLURM, load-balancing, IPC optimization.

Can be mirrored for fault tolerance.

The problem: how to collect fresh information without overloading the master computer.

Distributed bulletin board

• Information is circulated continually.• Unit of time not too small - reduces

communication congestion.

• Recall that some events, e.g. load-balancing, are triggered by cluster nodes, not by the master computer.

• Available instantly to client processes, even with “relaxed” circulation. • Example: watch – works continuously,

provides the time instantly.

Possible algorithms

Centralized: every unit of time, each node sends a message to the master computer.Drawback: does not scale up well due to the communication congestion at the master computer.

May be suitable for medium size configurations, e.g., with thousands of nodes, but it is unlikely to be suitable for configurations with millions of nodes.

Hierarchical tree:Each node sends its information to its parent node until all the information arrives at the master computer in O(log # of nodes) units of time. Sensitive to node failures, log delay.

Randomized Gossip

Distributed Bulletin Board

•Each node keeps vector with per-node info (own + info received from others)

•Once per time step, each node sends to 1 other randomly selected node a subset of its own vector entries (called “window”)

•Node merges received window entries into local vector (if newer)

MOSIX: Gossip Algorithm

A:0A:0 B:12B:12 C:2C:2 D:4D:4 E:11E:11 ......

A:0A:0 C:2C:2 ......D:4D:4

Each time unit:

•Update local info

•Find all vector entries up to age T (called a window)

•Send window to 1 randomly selected node

Upon receiving a window:

•Update the received entries’ age (+1 for transfer)

•Update entries in local vector where newer information has been received

A:5A:5 B:2B:2 C:4C:4 D:3D:3 E:0E:0 ......

A:1A:1 C:3C:3 ......D:5D:5C:3C:3A:1A:1

A two layer gossip algorithm

Compute nodes are divided into colonies. Based on some optimization criteria, e.g.

network proximity. Colony nodes exchange local information for

performing localized management decisions, such as load-balancing.

Any client application in need of up-to-date information about the state of the resources in its colony can directly obtain it locally.

Master computer Collects information from a few nodes in each

colony about all its nodes. Provide it to client process, e.g. scheduler.

Outcome: distributed bulletin board

Algorithm Trade-offs

• Colony size• Fresh limited information VS older wider one

• Colony topology• Spread VS focused node data

• Dissemination rate• Fresh information VS network overhead

• Bigger threshold value • Results in bigger window size • Lower average age VS more data being sent

Results – Average Information in Cluster (WIP)

Results – Average Information in Master (WIP)

towards an exa-scale operating system* ely levy, the hebrew university *work supported in part by a...

copyright amnon barak

mpi slide

core node application

system architecture

master computer

project ffmk slide

node usage

parent node

Documents

dfg/tfg 660-690 02.11 · 2018. 4. 27. · dfg/tfg 660-690...

dfg / tfg 425s - 435s - jungheinrich · 2018. 4. 27. ·...

dfg-100-es and dfg-200-es convection oven …

verwendungsrichtlinien - dfg

dfg funding opportunities · the dfg – about us....

dfg / tfg 540s - 550s · 2018. 4. 27. · dfg / tfg 540s -...

vista48mu dvsdfsdfvsdfvdsfvsdfvsdfvdfvsdf sfagf tehfgb fgb...

report - dfg

dash status update sppexa annual plenary meeting 2017...

dfg-100-es and dfg-200-es - blodgett

dfg/tfg 316s/320s04.11 - 03.13 51209015 dfg/tfg 316s/320s...

exa camioneta

karrierewege - dfg

exa 6medio

dfg / tfg 425s - 435s · 2018. 4. 27. · 11.09 - 07.10...

exa ivannia

dfg / tfg 425s - 435s - jungheinrich11.09 - 07.10 51158594...

arquitectura exa

dfg/tfg 316s/320s · 04.11 - 03.13 51300964 dfg/tfg...

dfg presentation