integration of openstack and amazon web service into
TRANSCRIPT
Integration of OpenStack and Amazon Web Service into
local batch job systemWataru Takase, Tomoaki Nakamura,
Koichi Murakami, Takashi SasakiComputing Research Center, KEK, Japan
1
Background: KEK Batch Job System
• 10000 CPU cores
• Scientific Linux 6
• IBM Spectrum LSF
work
server
work
server
work
server
LSF
…
calc.
server
calc.
server
calc.
server
calc.
server
calc.
server
calc.
server
calc.
server
calc.
server
calc.
server
calc.
server
…
Interactive work and
job submission
Batch service
Remote
login
Batch job
scheduler
…
Job queues
job job job
job job
job job job job
2
Background: Challenges for the Batch Job System
•Requirements on specific system from experiments groups•Piled up pending jobs due to resource
shortage•Take advantage of Cloud computing• Provide heterogeneous clusters• Expand computing resource to clouds
3
Overview of Cloud-integrated Batch Job System (Test Phase)
• Use cloud resources via batch job submission command.$ bsub –q aws /bin/hostname
OpenStack
SL6 cluster
AWS
The other cloud
On-premise resource
Off-premise resource
LSF ResourceConnector
[1]
Queue based resource selection
: /.1/ / /: / 0 / / 0 :/ : / / :4
Integration with OpenStack
Base image Custom image
Group manager
End user LSF3. Submit job
calc. server (VM)
5. Dispatch
1. Create imageOpenStack
ResourceConnector
{"Name": "CentOS7_01","Attributes": {
"type": ["String", "X86_64"],"openstackhost": ["Numeric", "1"],"template": ["CentOS7_01"]
},"Image": "generic-cent7-01","Flavor": "c04-m016G"
}
2. Create Resource connector template
Cloud admin
4. Launch instance
calc. server
calc. server
Physical machines (SL6) Batch serviceDispatch normal job
5
Share GPFS between Local Batch and OpenStack
• Each compute node mounts GPFS and exposes the directories to VM via NFS.
Compute node GPFS
calc. server (VM)
calc. server (VM)…calc. server
(VM)
NFS
GPFS mount
NFS mount
calc. server
calc. server
OpenStack Batch service
6
NFSKEK
EC2
Objectstorage
S3
For sharing input/output data
among KEK and AWS
AWS queue
The other queues
AWS
LSF calc. server
LSF calc. server
Launched on demand
LSF calc. server
LSF calc. server
Physicalmachines (SL6)
…
…LSFwork
server VPN connection
Integration with AWS
OpenStackOpenStack queue
Filesystem is not shared with KEK batch system
7
KEK
S3 bucket
AWS
…LSFwork
server
NFS
Use AWS S3 Object Storage for Sharing Data between KEK and AWS
INPUT
OUTPUTOUTPUTINPUT
calc. server
calc. server
1. Put input data
2. Copy input data
3. Submit job
4. Copy output data
5. Get output data
• KEK batch system and OpenStack share GPFS filesystem in KEK.
• AWS environment is independent from the KEK system.
• S3FS[2] allows to Linux to mount an AWS S3 bucket via FUSE.
22 -2 3 /. 3 3 8
Available resource Transition on AWS
Submit jobsNumber of instances on AWS
Number of total cores on AWS
Transition of total number of cores
9
Scalability Test: Submission of Geant4 based Particle Therapy Monte Carlo Simulation Jobs to AWS
Mass density distribution generated by input CT data
Simulated dose distribution
Particle beam
direction
Monte Carlo simulation of shooting 2,000,000Protons on N CPU cores (N jobs)
If N=10, 10 jobs carried out simulation events 200,000 times each
10
Scalability Test: Submission of Geant4 based Particle Therapy Monte Carlo Simulation Jobs• Scalability comparison between on KEK and AWS
AWS
KEK (Better CPU and file system)
The AWS result has the same tendency as the KEK’s one.
11
Scalability Test: Image Classification by Deep Learning on AWS
• Classify CIFAR-10 image[3] into 10 categories.
• We have built Convolutional Neural Network, then trained for the classification using TensorFlow[4].
[3] https://www.cs.toronto.edu/~kriz/cifar.html [4] https://www.tensorflow.org/tutorials/deep_cnn
conv1 layer
conv2 layer
FC2 layer
FC1 layer
pool2 layer
pool1 layer
auto-mobile
Feedback
Convolution Neural Network
12
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
100
1000
10000
100000
10 100 1000 10000
Accu
racy
Trai
ning
Tim
e [s
ec]
Number of used cores
CIFAR-10 Training Time of 500 epochs for CNN
Training Time Training Accuracy Test Accuracy
• Submit TensorFlow jobs to AWS queue and measured scalability by changing number of workers.
Scalability Test: Image Classification Multi-node Deep Learning on AWS
Parameter server
Worker
TensorFlowCluster
Store and updateparameters
Calculate loss
23,000 sec(6.5 hours)
1,000 sec
Worker Worker
57 workers(3648 cores)
1 worker (64 cores)
30 workers(1920 cores)
Network bandwidthbottleneck?
13
Another Use Case:Automatic Offloading to Cloud
Submit 3000 jobs to the mixed-resources (KEK and AWS) queue
1. Some jobs dispatched to KEK servers
2. Launch AWS instances, and some jobs dispatched to the AWS instances
No more free resource on KEK
Find free resource on KEK
3. Some jobs dispatched to KEK servers
4. Some jobs dispatched to AWS serversRUN
RUN
RUN
RUN
PEND
PEND
PEND
Time
Each
job
stat
us
14
Summary•We have integrated OpenStack and AWS
clouds with LSF batch job system by using Resource Connector.• We are in test phase.
•We have succeeded to offload some batch workloads to cloud.• Cloud resources used in this work was
provided in the Demonstration Experiment of Cloud Use conducted by National Institute of Informatics (NII) Japan (FY2017).
15