gpu cloud with job scheduler and container

26
Serverless GPU Cloud with Job scheduler and Container Andrew yongjoon kong CloudComputingCell, kakao [email protected]

Upload: andrew-yongjoon-kong

Post on 28-Jan-2018

531 views

Category:

Engineering


1 download

TRANSCRIPT

Page 1: GPU cloud with Job scheduler and Container

Serverless GPU Cloudwith Job scheduler and

ContainerAndrew yongjoon kong

CloudComputingCell, kakao

[email protected]

Page 2: GPU cloud with Job scheduler and Container

• Cloud Technical Advisory for Government Broad Cast Agency• Adjunct Prof. Ajou Univ• Korea Data Base Agency Acting Professor for Bigdata• Member of National Information Agency Bigdata Advisory committee • Kakao à Daum Kakao à Kakaocorp, Cloud Computing Cell lead• Talks

• Embrace clouds (2017, openstack days, korea) • Full route based network with linux (2016, netdev, Tokyo)• SDN without SDN (2015, openstack, Vancouber)

Who am I

Andrew. Yongjoon kong

Supervised,Korean edition

Korean Edition.

2nd Editions are coming…

Page 3: GPU cloud with Job scheduler and Container

Serverless computing is rising

Page 4: GPU cloud with Job scheduler and Container

Serverless computing , GPU

Page 5: GPU cloud with Job scheduler and Container

Serverless computing , GPU, Docker

Page 6: GPU cloud with Job scheduler and Container

Serverless framework

lots of serverless framework:• Apache OpenWhisk• Iron.io• Openstack’s Picasso• Gestalt ( based on DC/OS)• Fission ( based on kubernetes)• Runway ( kakao’s private FaaS)What these framework’s purpose?• connecting, mostly• flow and automation

Page 7: GPU cloud with Job scheduler and Container

Serverless framework

Connection is very good virtue in public cloud• there’s no resource depletion in public cloud.

connection/automation is directly related with cost savings

• in private cloud, there’s screams for the resources (especially GPU) from the engineers.

• The thing is that “Winner takes it all” • à care for scheduling

Page 8: GPU cloud with Job scheduler and Container

Job scheduler

Scheduling User’s Job based on Algorithm• FIFO• Fair Share• BackFill• Preemption

Page 9: GPU cloud with Job scheduler and Container

Job

Job comprises two parts• The resources

• CPU, Compute Nodes, Memory, Disk and Even Walltime

• Job scheduling system manage the quota per queue, user, user group

• The runnable execution • Traditionally, The executable command• e.g. saved_model_cli run --dir /tmp/saved_model_dir --tag_set serve

Page 10: GPU cloud with Job scheduler and Container

Job sample

Sample Job script

The traditional issue is how we distribute the commandand the data (you can’t specify node in batch system)

#!/bin/bash #PBS -l nodes=1:ppn=2 #PBS -l walltime=00:00:59 cd /home/rcf-proj3/pv/test/ sourcemkdir /test/test/dir/usr/usc/sas/default/setup.sh sas my.sas

execution

resource

Page 11: GPU cloud with Job scheduler and Container

Job scheduler system layout

SharedFile system can handle the file locating issue. à Shared Filesystem is too expensive. àModern

environment it is much easier with the container,

http://beagle.ci.uchicago.edu/using-beagle/

This could be changed bycontainer and registry

Page 12: GPU cloud with Job scheduler and Container

Job scheduler system, GPU and Container

add GPU resource to Job Script.use NVIDIA Docker for the command…then scheduler will do the job #!/bin/bash #PBS -l nodes=1:ppn=2 #PBS -l walltime=00:00:59 #PBS -l gpus=8NV_GPU=$NV_GPU nvidia-docker run --net host -e PASSWORD=root -e USERNAME=root -e PORT=$PORT idock.daumkakao.io/dkos/nvidia-cuda-sshd:dev

dockerregistry

computenode

GPU

computenode

GPU

computenode

GPU

master ( scheduler )

job

Page 13: GPU cloud with Job scheduler and Container

AI Development Cycle over compute resource

computenode

GPU

computenode

GPU

computenode

GPU

computenode

GPU

Training Model on

large scale with

Massive data

Inference thru. the

Model

Develop model on

personal env.

Abstract these to “job(resource, executor)” The output is abstract to container

dockerregistry master ( scheduler )

abstraction

JOB

Page 14: GPU cloud with Job scheduler and Container

AI Service

The output is abstract to container

BTW, you need GPU and Other IT resource to show your effort to the public, as well

And, what about the monitoring & alert ?

The good thing is that if you make your effort with container à kakao cloud can help you

Page 15: GPU cloud with Job scheduler and Container

kakao cloud

Service Repo.

Service catalog

notification

scheduling

IaaS:KRANE

CentralizedMeasuringSystem:KEMI

CentralizedDeployingSystem:DKOS

Management Plane

DataCenter Contol/Dataplane

Event / Alert

Initial Setup

Change

IT operations.IT Services.

Page 16: GPU cloud with Job scheduler and Container

Some Numbers about kakao cloud

1563 projects

632 pull request since 2014.9

88aboutVMs are created/deleted per day

8703 vms

2,xxxprojects

913pull request since 2014.9

100aboutVMs are created/deleted per day

17,xxxvms

2016.8 2017.9

9x,xxx active cores

Page 17: GPU cloud with Job scheduler and Container

KakaocorpSomeinformationaboutkakao cloud

from grizzly to Kilo5 times upgraded

total 4Regionadditional service Heat/Trove/Sahara

from grizzly to Mitaka7 times upgraded

total 4RegionHeat/Trove/Octavia/barbican 2016.82017.10

Page 18: GPU cloud with Job scheduler and Container

event monitoring/alert platform kakao, KEMI

PhysicalServers

VirtualInstances Containers

Others(switches,

logs)

monitoring

KEMIIMS

(kakao CMDBAPI)

SB

RuleEngine

Notification ETL

Data Center Information abstraction layer

API

predicting

scheduling

OpenstackHeat

OtherServiceAPI

Data Center (or Service ) Management Activity

control

KEMI stats KEMI log

Page 19: GPU cloud with Job scheduler and Container

Deployment abstraction in Kakao, DKOS

Data Center

User:Definesresource

VM

PMcontainer

ServiceCatalogue

CentralizedDeployingSystem(DKOS)

Resource Pool Queuescheduler

manager

Page 20: GPU cloud with Job scheduler and Container

DKOS Archtecture

Page 21: GPU cloud with Job scheduler and Container

Services over DKOS

Page 22: GPU cloud with Job scheduler and Container

DKOS Situation

• Active cluster : 3 digit

• Total compute node : 4digit (vm+pm)

• Container counts : 5 digit

• Managed by?

Page 23: GPU cloud with Job scheduler and Container

DKOS Situation

• Why use DKOS(container)?• Container is easy• Container is cool• dc/os is great

• Nop!• Very summit point of integrated/automated infra service api

• authentication, authorization, compute resources, network, volumes

• Metering, logging • Monitoring, Notifications

Page 24: GPU cloud with Job scheduler and Container

kakao cloud now support GPU as well

Service Repo.

Service catalog

notification

scheduling

IaaS:KRANE

CentralizedMeasuringSystem:KEMI

CentralizedDeployingSystem:DKOS

Management Plane

DataCenter Contol/Dataplane

Event / Alert

Initial Setup

Change

IT operations.IT Services.

Page 25: GPU cloud with Job scheduler and Container

Thanks

Page 26: GPU cloud with Job scheduler and Container

Where are you from CMMI-Cloud perspective?

For CMM4, Time to embrace Clouds, not a Cloud

CMM0

legacy

output:cloudTF

CMM1

selfserviceDev

resource

output:krane

(openstackcloud)

CMM2

limitedProd

resources

output:kemi

(MaaS)

CMM3

AutomatedCloudUsage

output:DKOS(CaaS)

CMM4

ManualCloudUsage

--

CMM5

FederatedCloudusage

--