integration of openstack and amazon web service into

15
Integration of OpenStack and Amazon Web Service into local batch job system Wataru Takase , Tomoaki Nakamura, Koichi Murakami, Takashi Sasaki Computing Research Center, KEK, Japan 1

Upload: others

Post on 08-Jan-2022

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Integration of OpenStack and Amazon Web Service into

Integration of OpenStack and Amazon Web Service into

local batch job systemWataru Takase, Tomoaki Nakamura,

Koichi Murakami, Takashi SasakiComputing Research Center, KEK, Japan

1

Page 2: Integration of OpenStack and Amazon Web Service into

Background: KEK Batch Job System

• 10000 CPU cores

• Scientific Linux 6

• IBM Spectrum LSF

work

server

work

server

work

server

LSF

calc.

server

calc.

server

calc.

server

calc.

server

calc.

server

calc.

server

calc.

server

calc.

server

calc.

server

calc.

server

Interactive work and

job submission

Batch service

Remote

login

Batch job

scheduler

Job queues

job job job

job job

job job job job

2

Page 3: Integration of OpenStack and Amazon Web Service into

Background: Challenges for the Batch Job System

•Requirements on specific system from experiments groups•Piled up pending jobs due to resource

shortage•Take advantage of Cloud computing• Provide heterogeneous clusters• Expand computing resource to clouds

3

Page 4: Integration of OpenStack and Amazon Web Service into

Overview of Cloud-integrated Batch Job System (Test Phase)

• Use cloud resources via batch job submission command.$ bsub –q aws /bin/hostname

OpenStack

SL6 cluster

AWS

The other cloud

On-premise resource

Off-premise resource

LSF ResourceConnector

[1]

Queue based resource selection

: /.1/ / /: / 0 / / 0 :/ : / / :4

Page 5: Integration of OpenStack and Amazon Web Service into

Integration with OpenStack

Base image Custom image

Group manager

End user LSF3. Submit job

calc. server (VM)

5. Dispatch

1. Create imageOpenStack

ResourceConnector

{"Name": "CentOS7_01","Attributes": {

"type": ["String", "X86_64"],"openstackhost": ["Numeric", "1"],"template": ["CentOS7_01"]

},"Image": "generic-cent7-01","Flavor": "c04-m016G"

}

2. Create Resource connector template

Cloud admin

4. Launch instance

calc. server

calc. server

Physical machines (SL6) Batch serviceDispatch normal job

5

Page 6: Integration of OpenStack and Amazon Web Service into

Share GPFS between Local Batch and OpenStack

• Each compute node mounts GPFS and exposes the directories to VM via NFS.

Compute node GPFS

calc. server (VM)

calc. server (VM)…calc. server

(VM)

NFS

GPFS mount

NFS mount

calc. server

calc. server

OpenStack Batch service

6

Page 7: Integration of OpenStack and Amazon Web Service into

NFSKEK

EC2

Objectstorage

S3

For sharing input/output data

among KEK and AWS

AWS queue

The other queues

AWS

LSF calc. server

LSF calc. server

Launched on demand

LSF calc. server

LSF calc. server

Physicalmachines (SL6)

…LSFwork

server VPN connection

Integration with AWS

OpenStackOpenStack queue

Filesystem is not shared with KEK batch system

7

Page 8: Integration of OpenStack and Amazon Web Service into

KEK

S3 bucket

AWS

…LSFwork

server

NFS

Use AWS S3 Object Storage for Sharing Data between KEK and AWS

INPUT

OUTPUTOUTPUTINPUT

calc. server

calc. server

1. Put input data

2. Copy input data

3. Submit job

4. Copy output data

5. Get output data

• KEK batch system and OpenStack share GPFS filesystem in KEK.

• AWS environment is independent from the KEK system.

• S3FS[2] allows to Linux to mount an AWS S3 bucket via FUSE.

22 -2 3 /. 3 3 8

Page 9: Integration of OpenStack and Amazon Web Service into

Available resource Transition on AWS

Submit jobsNumber of instances on AWS

Number of total cores on AWS

Transition of total number of cores

9

Page 10: Integration of OpenStack and Amazon Web Service into

Scalability Test: Submission of Geant4 based Particle Therapy Monte Carlo Simulation Jobs to AWS

Mass density distribution generated by input CT data

Simulated dose distribution

Particle beam

direction

Monte Carlo simulation of shooting 2,000,000Protons on N CPU cores (N jobs)

If N=10, 10 jobs carried out simulation events 200,000 times each

10

Page 11: Integration of OpenStack and Amazon Web Service into

Scalability Test: Submission of Geant4 based Particle Therapy Monte Carlo Simulation Jobs• Scalability comparison between on KEK and AWS

AWS

KEK (Better CPU and file system)

The AWS result has the same tendency as the KEK’s one.

11

Page 12: Integration of OpenStack and Amazon Web Service into

Scalability Test: Image Classification by Deep Learning on AWS

• Classify CIFAR-10 image[3] into 10 categories.

• We have built Convolutional Neural Network, then trained for the classification using TensorFlow[4].

[3] https://www.cs.toronto.edu/~kriz/cifar.html [4] https://www.tensorflow.org/tutorials/deep_cnn

conv1 layer

conv2 layer

FC2 layer

FC1 layer

pool2 layer

pool1 layer

auto-mobile

Feedback

Convolution Neural Network

12

Page 13: Integration of OpenStack and Amazon Web Service into

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

100

1000

10000

100000

10 100 1000 10000

Accu

racy

Trai

ning

Tim

e [s

ec]

Number of used cores

CIFAR-10 Training Time of 500 epochs for CNN

Training Time Training Accuracy Test Accuracy

• Submit TensorFlow jobs to AWS queue and measured scalability by changing number of workers.

Scalability Test: Image Classification Multi-node Deep Learning on AWS

Parameter server

Worker

TensorFlowCluster

Store and updateparameters

Calculate loss

23,000 sec(6.5 hours)

1,000 sec

Worker Worker

57 workers(3648 cores)

1 worker (64 cores)

30 workers(1920 cores)

Network bandwidthbottleneck?

13

Page 14: Integration of OpenStack and Amazon Web Service into

Another Use Case:Automatic Offloading to Cloud

Submit 3000 jobs to the mixed-resources (KEK and AWS) queue

1. Some jobs dispatched to KEK servers

2. Launch AWS instances, and some jobs dispatched to the AWS instances

No more free resource on KEK

Find free resource on KEK

3. Some jobs dispatched to KEK servers

4. Some jobs dispatched to AWS serversRUN

RUN

RUN

RUN

PEND

PEND

PEND

Time

Each

job

stat

us

14

Page 15: Integration of OpenStack and Amazon Web Service into

Summary•We have integrated OpenStack and AWS

clouds with LSF batch job system by using Resource Connector.• We are in test phase.

•We have succeeded to offload some batch workloads to cloud.• Cloud resources used in this work was

provided in the Demonstration Experiment of Cloud Use conducted by National Institute of Informatics (NII) Japan (FY2017).

15