large-scale computing with grids

32
Large-Scale Computing with Grids Jeff Templon KVI Seminar 17 February 2004

Upload: callum-malone

Post on 13-Mar-2016

56 views

Category:

Documents


1 download

DESCRIPTION

Large-Scale Computing with Grids. Jeff Templon. KVI Seminar 17 February 2004. Information I intend to transfer. Why are Grids interesting? Grids are solutions so I will spend some time talking about the problem and show how Grids are relevant. Solutions should solve a problem. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Large-Scale Computing with Grids

Large-Scale Computing with Grids

Jeff Templon

KVI Seminar17 February 2004

Page 2: Large-Scale Computing with Grids

Jeff Templon – Seminar, KVI, 2004.02.17 - 2

Information I intend to transfer

Why are Grids interesting? Grids are solutions so I will spend some time talking about the problem and show how Grids are relevant. Solutions should solve a problem.

What are Computational Grids? How are Grids being used, in HEP as well as in other

fields? What are some examples of current Grid “research

topics”?

Page 3: Large-Scale Computing with Grids

Jeff Templon – Seminar, KVI, 2004.02.17 - 3

Our Problem

Place event info on 3D mapTrace trajectories through hitsAssign type to each trackFind particles you wantNeedle in a haystack!This is “relatively easy” case

Page 4: Large-Scale Computing with Grids

Jeff Templon – Seminar, KVI, 2004.02.17 - 4

More complex example

Page 5: Large-Scale Computing with Grids

interactivephysicsanalysis

batchphysicsanalysis

detector

event summary data

rawdata

eventreprocessing

eventsimulation

analysis objects(extracted by physics topic)

Data Handling and Computation for

Physics Analysisevent filter(selection &

reconstruction)

processeddata

Page 6: Large-Scale Computing with Grids

Jeff Templon – Seminar, KVI, 2004.02.17 - 6

Computational Aspects

To reconstruct and analyze 1 event takes about 90 seconds

Maybe only a few out of a million are interesting. But we have to check them all!

Analysis program needs lots of calibration; determined from inspecting results of first pass.

Each event will be analyzed several times!

Page 7: Large-Scale Computing with Grids

Jeff Templon – Seminar, KVI, 2004.02.17 - 7

online systemmulti-level triggerfilter out backgroundreduce data volume

level 1 - special hardware

40 MHz (40 TB/sec)level 2 - embedded processorslevel 3 - PCs

75 KHz (75 GB/sec)5 KHz (5 GB/sec)100 Hz(100 MB/sec)data recording &offline analysis

One of the four LHC detectors

Page 8: Large-Scale Computing with Grids

Jeff Templon – Seminar, KVI, 2004.02.17 - 8

Computational Implications (2)

90 seconds per event to reconstruct and analyze 100 incoming events per second To keep up, need either:

A computer that is nine thousand times faster, or nine thousand computers working together

Moore’s Law: wait 20 years and computers will be 9000 times faster (we need them in 2007!)

Page 9: Large-Scale Computing with Grids

Jeff Templon – Seminar, KVI, 2004.02.17 - 9

More Computational Issues

Four LHC experiments – roughly 36k CPUs needed BUT: accelerator not always “on” – need fewer BUT: multiple passes per event – need more! BUT: haven’t accounted for Monte Carlo production –

more!! AND: haven’t addressed the needs of “physics users” at

all!

Page 10: Large-Scale Computing with Grids

Jeff Templon – Seminar, KVI, 2004.02.17 - 10

Large Dynamic Range Computing

Clear that we have enough work to keep somewhere between 50 and 100 k CPUs busy

Most of the activities are “bursty”, so doing them with dedicated facilities is inefficient

Need some meta-facility that can serve all the activities Reconstruction & reprocessing Analysis Monte-Carlo All four LHC experiments All the LHC users

Page 11: Large-Scale Computing with Grids

Jeff Templon – Seminar, KVI, 2004.02.17 - 11

LHC User Distribution

•Putting all computers in one spot leads to traffic jams•Which spot is willing to pay for & maintain 100k CPUs?

Page 12: Large-Scale Computing with Grids

Jeff Templon – Seminar, KVI, 2004.02.17 - 12

Classic Motivation for Grids

Large Scales: 50k CPUs, petabytes of data (if we’re only talking ten machines, who cares?)

Large Dynamic Range: bursty usage patterns Why buy 25k CPUs if 60% of the time you only need 900

CPUs? Multiple user groups on single system

Can’t “hard-wire” the system for your purposes Wide-area access requirements

Users not in same lab or even continent

Page 13: Large-Scale Computing with Grids

Jeff Templon – Seminar, KVI, 2004.02.17 - 13

Solution using Grids

Large Scales: 50k CPUs, petabytes of data Assemble 50k+ CPUs and petabytes of mass storage Don’t need to be in the same place!

Large Dynamic Range: bursty usage patterns When you need less than you have, others use excess capacity When you need more, use others’ excess capacities

Multiple user groups on single system “Generic” grid software services (think web server here)

Wide-area access requirements Public Key Infrastructure for authentication & authorization

Page 14: Large-Scale Computing with Grids

How do Grids Work?

Public Key Infrastructure&

Generic Grid Software Services

Page 15: Large-Scale Computing with Grids

Jeff Templon – Seminar, KVI, 2004.02.17 - 15

Public Key Infrastructure – Single Login

Based on asymmetric cryptography: public/private key pairs Need network of trusted Certificate Authorities (e.g. Verisign) CAs issue certificates to users, computers, or services

running on computers (typically valid for two years) Certificates carry

an “identity” /O=dutchgrid/O=users/O=kvi/CN=Nasser Kalantar a public key identity of the issuing CA a digital signature (transformation on cert text using CA private

key) Need to tighten this part up, takes too long

Page 16: Large-Scale Computing with Grids

What is a Grid?

(what are these Generic Grid Software Services?)

Page 17: Large-Scale Computing with Grids

Jeff Templon – Seminar, KVI, 2004.02.17 - 17

C = DS = =Grid software service(like http server)

InformationSystem

CC

C

C

C

C

C

Information System is Central Nervous System of Grid

Info system defines grid

Page 18: Large-Scale Computing with Grids

Jeff Templon – Seminar, KVI, 2004.02.17 - 18

C = DS = =Grid software service

I.S.

CC

C

C

C

C

C

DS

DS

DSDSDS

DS

D.M.S

Data Grid

Page 19: Large-Scale Computing with Grids

Jeff Templon – Seminar, KVI, 2004.02.17 - 19

C = DS = =Grid software service

I.S.

CC

C

C

C

C

C

DS

DS

DSDSDS

DS

D.M.S

Computing Task Submission

W.M.S.

proxy + command;(data);

Get fresh, detailed info

Coarse Requirements

Candidate Clusters

Page 20: Large-Scale Computing with Grids

Jeff Templon – Seminar, KVI, 2004.02.17 - 20

DS

DS

DSDSDS

DS

List of best locations

C = DS = =Grid software service

I.S.

C

D.M.S

Computing Task Execution

W.M.S.

proxy + command;(data);

logger

Where is m

y data?

proxy

Find DMS

Page 21: Large-Scale Computing with Grids

Jeff Templon – Seminar, KVI, 2004.02.17 - 21

DSDSDS

C = DS = =Grid software service

I.S.

C

D.M.S

Computing Task Execution

W.M.S.

logger

How to contact O.D.S.?

Where do I put the data?

proxy + data

Register outputDone

Page 22: Large-Scale Computing with Grids

Jeff Templon – Seminar, KVI, 2004.02.17 - 22

Can do a bit more with data…

Page 23: Large-Scale Computing with Grids

Jeff Templon – Seminar, KVI, 2004.02.17 - 23

C = DS = =Grid software service

I.S.

CC

C

C

C

C

C

DS

DS

DSDSDS

DS

D.M.S

Computing Task Submission with data

W.M.S.

proxy + command;(data file spec);

Get fresh, detailed info

Coarse Requirements

Candidate Clusters

Find copies of data

Page 24: Large-Scale Computing with Grids

Jeff Templon – Seminar, KVI, 2004.02.17 - 24

DS

DS

DSDSDS

DS

List of best locations

C = DS = =Grid software service

I.S.

C

D.M.S

Computing Task Execution

W.M.S.

proxy + command;(data file spec);

logger

Where is m

y data?

Find DMS

Local access

Page 25: Large-Scale Computing with Grids

Jeff Templon – Seminar, KVI, 2004.02.17 - 25

Grids are working now

See it in action

Page 26: Large-Scale Computing with Grids

Jeff Templon – Seminar, KVI, 2004.02.17 - 26Applications

D0 Run II Data Reprocessing

Biomedical Image Comparison

Ozone Profile Computation, Storage, & Access

Page 27: Large-Scale Computing with Grids

Jeff Templon – Seminar, KVI, 2004.02.17 - 27

Grid Research Topics “push” vs. “pull” model of operation (you just saw “push”)

We know how to do fine-grained priority & access control in “push” We think “pull” is more robust operationally

Software distribution HEP has BIG programs with many versions … “pre” vs. “zero”

install Data distribution

Similar issues (software is just executable data!) A definite bottleneck in D0 reprocessing

(1 Gbit/100/2/2=modem!) Speedup limited to 92% of theoretical About 1 hr CPU lost per job

Optimization in general is hard, and is it worth it????

Page 28: Large-Scale Computing with Grids

Jeff Templon – Seminar, KVI, 2004.02.17 - 28

Page 29: Large-Scale Computing with Grids

Jeff Templon – Seminar, KVI, 2004.02.17 - 29

Didn’t talk about

Grid projects European Data Grid – ends this month, 10 M€ / 3y VL-E – ongoing, 20 M€ / 5y EGEE – starts in April, 30 M€ / 2y

Gridifying HEP software & computing activities Transition from supercomputing to Grid computing

(funding) Integration with other technologies

Submit, monitor, control computing work from your PDA or mobile!

Page 30: Large-Scale Computing with Grids

Jeff Templon – Seminar, KVI, 2004.02.17 - 30

Conclusions

Grids are working now for workload management in HEP Big developments in next two years with large-scale

deployments of LCG & EGEE projects Seems Grids are catching on (IBM / Globus alliance) Lots of interesting stuff to try out!!!

Page 31: Large-Scale Computing with Grids

Jeff Templon – Seminar, KVI, 2004.02.17 - 31

What’s There Now? Job Submission

Marriage of Globus, Condor-G, EDG Workload Manager Latest version reliable at 99% level

Information System New System (R-GMA) – very good information model, implementation

still evolving Old System (MDS) – poor information model, poor architecture, but

hacks allow 99% “uptime” Data Management

Replica Location Service – convenient and powerful system for locating and manipulating distributed data; mostly still user-driven (no heuristics)

Data Storage “Bare gridFTP server” – reliable but mostly suited to disk-only mass

store SRM – no mature implementations

Page 32: Large-Scale Computing with Grids

Jeff Templon – Seminar, KVI, 2004.02.17 - 32

Grids are InformationInformation

System

InformationSystem

A-Grid

B-Grid