hep cloud production using the cloudscheduler/htcondor

13
HEP cloud production using the CloudScheduler/HTCondor Architecture CHEP 2015 Okinawa, 2015-04-15 University of Victoria Ian Gable on behalf of many people and groups

Upload: dinhthuan

Post on 31-Dec-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HEP cloud production using the CloudScheduler/HTCondor

HEP cloud production using the

CloudScheduler/HTCondor Architecture

CHEP 2015 Okinawa, 2015-04-15

University of VictoriaIan Gable

on behalf of many people and groups

Page 2: HEP cloud production using the CloudScheduler/HTCondor

Ian Gable, University of Victoria

Outline

Overview of the Architecture CloudScheduler/HTCondor

Key enabling technologies:

CVMFS + µCernVM 3

Shoal (web cache discovery)

Glint (OpenStack plugin)

Current cloud resources, and pointers to other talks and posters.

2

Page 3: HEP cloud production using the CloudScheduler/HTCondor

Ian Gable, University of Victoria

Historical Perspective

3

Ian Gable University of Victoria 7

VM: Xen is Useful for HEP •  Xen is a Virtual Machine technology that offers negligible performance penalties

unlike more familiar VM systems like VMware. •  Xen uses a technique called �paravirtualization� to to allow most instructions to run

at their native speed. –  The penalty is that you must run a modified OS kernel –  Linus says Xen included in Linux Kernel mainline as of 2.6.23.

•  �Evaluation of Virtual Machines for HEP Grids�, Proceedings of CHEP 2006, Mumbai India.

CHEP 2007 in Victoria:

We had already started down the path to IaaS

Ian Gable University of Victoria 1

Deploying HEP Applications Using Xen and Globus Virtual Workspaces

A. Agarwal, A. Charbonneau, R. Desmarais, R. Enge, I. Gable, D. Grundy, A. Norton, D. Penefold-Brown, R. Seuster, R.J. Sobie, D. C. Vanderster

Institute of Particle Physics of Canada National Research Council, Ottawa, Ontario, Canada

University of Victoria, Victoria, British Columbia, Canada

Page 4: HEP cloud production using the CloudScheduler/HTCondor

Ian Gable, University of Victoria 4

HTCondor

Amaz

on E

C2

Cloud Interface(EC2)

VM Node

uCernVM

VM Node

CloudSchedulerSc

hedu

ler s

tatu

s co

mm

unic

atio

n

Ope

nSta

ck C

loud

1 Cloud Interface(OpenStack)

VM Node

uCernVM

VM Node

Ope

nSta

ck C

loud

N Cloud Interface(OpenStack)

VM Node

uCernVM

VM Node

...

Goo

gle

Com

pute

Eng

ine Cloud Interface

(GCE)

VM Node

customimage

VM Node

Create, Destroy,and Status IaaSCalls

VM registrationand Job submission

Batch Jobs

Batch Systems CallsIaaS Calls

Page 5: HEP cloud production using the CloudScheduler/HTCondor

Ian Gable, University of Victoria

Integrating with Pilot Frameworks

5

Panda queue

User

U

Job Submission

P

U

Pilot Job

User Job

Condor

Cloud Scheduler

P

Cloud Interface

IaaS Service API Request

uCernVM Batch

P

P

Pilot Job Submiss

ion

U U

Pilot Factory

check queue sta

tus

Page 6: HEP cloud production using the CloudScheduler/HTCondor

Ian Gable, University of Victoria

Using multiple instances

6

HTCondor

CloudSchedulerProject X

HTCondor

CloudSchedulerATLAS

HTCondor

CloudSchedulerBelle II

HTCondor

CloudSchedulerCANFAR

Astronomy

Page 7: HEP cloud production using the CloudScheduler/HTCondor

Ian Gable, University of Victoria

Support System: CVMFS + µCernVM 3

Without CernVM, completely unmanageable number of VMs image types.

All the well known CVMFS advantages

µCernVM brings the image size down to 20 MB, trivial to deploy everywhere

Complete operating system is staged in, as well as experiment software at Run time.

Requires good squid caching

7

Page 8: HEP cloud production using the CloudScheduler/HTCondor

The caching challenge on IaaS cloud

When booting VMs on arbitrary clouds they don’t know which squid they should use

In order to work well VMs need to able to access a local web cache (squid) to be able to efficiently download all the experiment software and now OS libraries they need to run

If a VM is statically configured to access a particular cache it can be slow (Geneva Vancouver for example) and it can overloaded

8

Page 9: HEP cloud production using the CloudScheduler/HTCondor

Ian Gable, University of Victoria

Shoal: web cache publishing

9

ShoalServer

shoal_agent

REST

AMQP

Cloud Y

shoal_agent

Regular Interval AMQP messagePublish: IP, load, etc

REST callGet nearest squids

shoal_client

VM Node

Cloud X

shoal_client

VM Node

Squid Cache

shoal_agent

Squid Cache Squid Cache

use the highly Scalable AMQP protocol to advertise Squid servers to shoal

use GeoIP information to determine which is the closest to each VM

https://github.com/hep-gc/shoal

Squids advertise every 30 seconds, server ‘verifies’ if the squid is functional

Page 10: HEP cloud production using the CloudScheduler/HTCondor

Ian Gable, University of Victoria

Available Shoal Interfaces

10

JSON

Human readable

WPAD

Page 11: HEP cloud production using the CloudScheduler/HTCondor

Ian Gable, University of Victoria

Glint: OpenStack image distribution

Distributing VM images by hand to 5 clouds is manageable but error prone

Distributing VM images by hand to 20+ clouds never happens right the first time and is hugely time consuming

Glint is the solution to this problem for OpenStack

Glint automates the deployment of images to multiple clouds

11

keystone

glan

ce

GLINT

keystone

glan

ce

Image A

Image B

Image A

Image A

SITE 2

SITE 1

keystone

glan

ce

ImageC

Image A

SITE 3

Copy: Image AFrom: Site 1To: Site 2 and Site 3

https://github.com/hep-gc/glint

Page 12: HEP cloud production using the CloudScheduler/HTCondor

Ian Gable, University of Victoria

Clouds in use

12

Site HEP  Cloud? ATLAS BelleII

ComputeCanada-­‐West no ✔ ✔

ComputeCanada-­‐East no ✔ ✔

CERN yes ✔

GridPP-­‐Datacentered yes ✔

CANARIE-­‐West no ✔

CANARIE-­‐East no ✔

Amazon  EC2 no ✔

ChameleonCloud-­‐U  Texas no ✔

Google  Compute  Engine no ✔

Cloud can come and go with time.

Past clouds include:

Currently Operating

NECTAR Australia, National Research Council Ottawa, FutureGrid - U Chicago, Elephant Coud UVic

~ 4000 cores operating today

For comparison, Canadian T2+T1 running ~5000 today

Page 13: HEP cloud production using the CloudScheduler/HTCondor

Ian Gable, University of Victoria

Summary

“Evolution of Cloud Computing in ATLAS” Ryan Taylor, 17:00, this room

“Utilizing cloud computing resources for Belle II”, Randall Sobie, this room

“Glint: VM image distribution in a multi-cloud environment” Poster Session B

13

CloudScheduler/HTCondor flexible multi experiment way to run Batch Jobs on Clouds.

Key enabling technologies for this:

CVMFS + µCernVM 3

Shoal: dynamic Squid cache Publishing

Glint: VM Image Distribution

Current users ATLAS, Belle II, CANFAR, Compute Canada HPC consortium