open science grid and applications bockjoo kim u of florida @ kisti on july 5, 2007

53
Open Science Grid and Applications Bockjoo Kim U of Florida @ KISTI on July 5, 2007

Post on 19-Dec-2015

222 views

Category:

Documents


1 download

TRANSCRIPT

Open Science Gridand

Applications

Bockjoo KimU of Florida

@ KISTI on July 5, 2007

2 07/05/2007

An Overview of OSG

3 07/05/2007

What is an OSG?

• A scientific grid consortium and project

• Rely on the commitments of the participants

• Share common goals and vision other projects

• An evolution of Grid3

• Provides benefit to large scale science in the US

4 07/05/2007

Driving Principles for OSG

Simple and flexible

Built from the bottom up

Coherent but heterogeneous

Performing and persistent

Maximize eventual commonality

Principles apply end-to-end

5 07/05/2007

Virtual Organization in OSG

Self Operated Research Vos: 15

Collider Detector at Fermilab (CDF)

Compact Muon Solenoid (CMS)

CompBioGrid (CompBioGrid)

D0 Experiment at Fermilab (DZero)

Dark Energy Survey (DES)

Functional Magnetic Resonance Imaging (fMRI)

Geant4 Software Toolkit (geant4)

Genome Analysis and Database Update (GADU)

International Linear Collider (ILC)

Laser Interferometer Gravitational-Wave Observatory (LIGO)

nanoHUB Network for Computational Nanotechnology (NCN) (nanoHUB)

Sloan Digital Sky Survey (SDSS)

Solenoidal Tracker at RHIC (STAR)

Structural Biology Grid (SBGrid)

United States ATLAS Collaboration (USATLAS)

Campus Grids: 5.

Georgetown University Grid (GUGrid)

Grid Laboratory of Wisconsin (GLOW)

Grid Research and Education Group at Iowa (GROW)

University of New York at Buffalo (GRASE)

Fermi National Accelerator Center (Fermilab)

Regional Grids: 4

NYSGRID

Distributed Organization for Scientific and Academic Research (DOSAR)

Great Plains Network (GPN)

Northwest Indiana Computational Grid (NWICG)

OSG Operated VOs: 4

Engagement (Engage)

Open Science Grid (OSG)

OSG Education Activity (OSGEDU)

OSG Monitoring & Operations

6 07/05/2007

Timeline

1999 2000 2001 2002 20052003 2004 2006 2007 2008 2009

PPDG

GriPhyN

iVDGL

Trillium Grid3 OSG(DOE)

(DOE+NSF)(NSF)

(NSF)

Campus, regional grids

LHC operationsLHC construction, preparation

LIGO operation LIGO preparation

European Grid + Worldwide LHC Computing Grid

OSG Consortium

7 07/05/2007

Levels of Participation

Participating in the OSG ConsortiumUsing the OSGSharing Resources on OSG=> Either or both with minimal entry threshold

Becoming a StakeholderAll (large scale) users & providers are stakeholders

Determining the Future of OSGCouncil Members determine the Future

Taking on Responsibility for OSG OperationsOSG Project is responsible for OSG Operations

8 07/05/2007

OSG Architectureand

How to Use OSG

9 07/05/2007

OSG : A Grid of Sites/Facilities

IT Departments at Universities & National Labs make their hardware resources available via OSG interfaces.

— CE: (modified) pre-ws GRAM— SE: SRM for large volume, gftp &

(N)FS for small volume

Today’s scale:— 20-50 “active” sites (depending on

definition of “active”)— ~ 5000 batch slots— ~ 1000TB storage— ~ 10 “active” sites with shared

10Gbps or better connectivity

Expected Scale for End of 2008 ~50 “active” sites ~30-50,000 batch slots Few PB of storage ~ 25-50% of sites with shared

10Gbps or better connectivity

10 07/05/2007

OSG Components: Compute Element

From ~20 CPU Department Computers

to 10,000 CPU Super Computers

Jobs run under anylocalbatch system

OSGgateway machine

+ services

the network & other OSG resources

• Globus GRAM interface (Pre-WS) which supports many different local batch system

• Priorities and policies : Through VO role mapping, Batch queue priority setting according to Site policites and priorities.

OSG Base (OSG 0.6.0)OSG Environment/PublicationOSG Monitoring/AccountingEGEE Interop

11 07/05/2007

Disk Areas in an OSG site

Shared filesystem as applications area at site.— Read only from compute cluster.— Role based installation via GRAM.

Batch slot specific local work space.— No persistency beyond batch slot lease.— Not shared across batch slots.— Read & write access (of course).

SRM/gftp controlled data area.— “persistent” data store beyond job boundaries.— Job related stage in/out.— SRM v1.1 today.— SRM v2.2 expected in Q2 2007 (space reservation).

12 07/05/2007

OSG Components: Storage Element

From 20 GBytes Disk CacheTo 4 Petabyte Robotic

Tape Systems

AnyShared Storage, e.g., dCache

OSG SE gateway

the network & other OSG resources

• Storage Services - access storage through storage resource manager (SRM) interface and GridFtp

• (Typically) VO oriented: Allocation of shared storage through agreements between site and VO(s) facilitated by OSG

gsiftp://mygridftp.nowhere.edusrm://myse.nowhere.edu( srm protocol ~ https protocol)

13 07/05/2007

Authentication and Authorization

OSG Responsibilities— X509 based middleware— Accounts may be dynamic/static, shared/FQAN-specific

VO Responsibilities— Instantiate VOMS— Register users & define/manage their roles

Site Responsibilities— Choose security model (what accounts are supported)— Choose VOs to allow— Default accept of all users in VO but individuals or groups

within VO can be denied.

14 07/05/2007

User Management

User obtains CERT from CA that is vetted by TAGPMA User registers with VO and is added to VOMS of VO.

— VO responsible for registration of VOMS with OSG GOC.— VO responsible for users to sign AUP.— VO responsible for VOMS operations.

— VOMS shared for ops on multiple grids globally by some VOs.— Default OSG VO exists for new communities & single PIs.

Sites decide which VOs to support (striving for default admit)— Site populates GUMS daily from VOMSes of all VOs— Site chooses uid policy for each VO & role

— Dynamic vs static vs group accounts

User uses whatever services the VO provides in support of users — VOs generally hide grid behind portal

Any and all support is responsibility of VO— Helping its users— Responding to complains from grid sites about its users.

15 07/05/2007

Resource Management

• Many resources are owned or statically allocated to one user community– The institutions which own resources typically have ongoing

relationships with (a few) particular user communities (VOs)

• The remainder of an organization’s available resources can be “used by everyone or anyone else”– Organization can decide against supporting particular VOs.– OSG staffs are responsible for monitoring and, if needed, managing

this usage

• Our challenge is to maximize good - successful - output from the whole system

16 07/05/2007

Applications and Runtime Model

• Condor-G client• Pre-WS or WS Gram as site gateway• Priority through VO role and policy, mitigate by site policy• User specific portion that comes with the job• VO specific portion is preinstalled and published• CPU access policies vary from site to site

Ideal runtime ~ O(hours)Small enough to not loose too much due to preemption

policies.Large enough to be efficient despite long scheduling times of

grid middleware.

17 07/05/2007

Simple Workflow

Install Application Software at site(s)— VO admin install via GRAM.— VO users have read only access from batch slots.

“Download” data to site(s)— VO admin move data via SRM/gftp.— VO users have read only access from batch slots.

Submit job(s) to site(s)— VO users submit job(s)/DAG via condor-g.— Jobs run in batch slots, writing output to local disk.— Jobs copy output from local disk to SRM/gftp data area.

Collect output from site(s)— VO users collect output from site(s) via SRM/gftp as part of DAG.

18 07/05/2007

Late Binding(A Strategy)

Grid is a hostile environment:Scheduling policies are unpredictable

Many sites preempt, and only idle resources are free

Inherent diversity of Linux variantsNot everybody is truthful in their advertisement

Submit “pilot” jobs instead of user jobs Bind user to pilot only after batch slot at a site is

successfully leased, and “sanity checked”. Re-bind user jobs to new pilot upon failure.

19 07/05/2007

OSG Activies

20 07/05/2007

OSG Activity Breakdown

• Software (UW)– provide a software stack that meets the needs of OSG sites, OSG VOs and OSG operation while supporting interoperability with other national and inter-national cyber infrastructures

• Integration (UC) – Verify, test and evaluate the OSG software• Operation (IU) – Coordinate the OSG sites, monitor the facility, and maintain

and operate centralized services • Security (FNAL) – Define and evaluate procedures and software stack to

prevent un-authorized activities and minimize interruption in service due to security concerns

• Troubleshooting (UIOWA) – help sites and VOs to identify and resolve unexpected behavior of the OSG software stack

• Engagement – (RENCI) Identify VOs and sites that can benefit from joining the OSG and “hold their hand” while becoming a productive member of the OSG community

• Resource Management (UF) - Manages resource• Facility Management (UW) – Overall facility coordination.

21 07/05/2007

OSG Facility Management Activies

• Led by Miron Livny(Wisconsin, Condor)

• Help sites join the facility and enable effective guaranteed and opportunistic usage of their resources by remote users

• Help VOs join the facility and enable effective guaranteed and opportunistic harnessing of remote resources

• Identify (through active engagement) new sites and VOs

22 07/05/2007

OSG Software Activies

• Package the Virtual Data Toolkit(Led by Wisconsin Condor Team)– Requires local building and testing of all components– Tools for incremental installation– Tools for verification of configuration– Tools for functional testing

• Integration of the OSG stack– Verification Testbed (VTB)– Integration Testbed (ITB)

• Deployment of the OSG stack– Build and deploy Pacman caches

23 07/05/2007

OSG Software Release Process

Input from stakeholders and OSG directors

VDT Release

OSG Integration Testbed Release

OSG Production Release

Test on OSG Validation Testbed

24 07/05/2007

How Many Softwares?

05

101520253035404550

Jan-02Jul-02Jan-03Jul-03Jan-04Jul-04Jan-05Jul-05Jan-06Jul-06Jan-07

Number of major components

VDT 1.1.x VDT 1.2.x VDT 1.3.x VDT 1.4.0 VDT 1.5.x VDT 1.6.x

VDT 1.0Globus 2.0bCondor-G 6.3.1

VDT 1.1.8Adopted by LCG

VDT 1.1.11Grid2003 VDT 1.2.0

VDT 1.3.0

VDT 1.3.9For OSG 0.4

VDT 1.6.1For OSG 0.6.0

VDT 1.3.6For OSG 0.2

More dev releases

Both added and removed software

15 Linux-like platforms supported

~45 components on 8 platforms built

25 07/05/2007

OSG Security Activies

• Infrastructure X509 certificate based

• Operational security a priority

• Exercise incident response

• Prepare signed agreements, template policies

• Audit, assess and train

• Infrastructure X509 certificate based

• Operational security a priority

• Exercise incident response

• Prepare signed agreements, template policies

• Audit, assess and train

26 07/05/2007

Operations & Troubleshooting Activities

• Well established Grid Operations Center at Indiana University

• Users support distributed, including [email protected] community support.• Site coordinator supports team of sites.

– Accounting and Site Validation required services of sites.

• Troubleshooting(U Iowa) looks at targetted end to end problems– Partnering with LBNL Troubleshooting work for auditing and

forensics.

• Well established Grid Operations Center at Indiana University

• Users support distributed, including [email protected] community support.• Site coordinator supports team of sites.

– Accounting and Site Validation required services of sites.

• Troubleshooting(U Iowa) looks at targetted end to end problems– Partnering with LBNL Troubleshooting work for auditing and

forensics.

27 07/05/2007

OSG and Related Grids

28 07/05/2007

Campus Grids

• Sharing across compute clusters is a change and a challenge for many Universities.

• OSG, TeraGrid, Internet2, Educause working together

• Sharing across compute clusters is a change and a challenge for many Universities.

• OSG, TeraGrid, Internet2, Educause working together

29 07/05/2007

OSG and TeraGridComplementary and interoperating infrastructuresComplementary and interoperating infrastructures

TeraGrid OSGNetworks supercomputer centers. Includes small to large clusters and

organizations.

Based on Condor & Globus s/w stack built at Wisconsin Build and Test.

Based on Same versions of Condor & Globus in the Virtual Data Toolkit.

Development of User Portals/Science Gateways.

Supports jobs/data from TeraGrid science gateways.

Currently relies mainly on remote login.

No login access. Many sites expect VO attributes in the proxy certificate

Training covers OSG and TeraGrid usage.

30 07/05/2007

International Activities

• Interoperate with Europe for large physics users.– Deliver the US based infrastructure for the World Wide Large

Hadron Collider (LHC) Grid Collaboration (WLCG) in support of the LHC experiments.

• Include off-shore sites when approached.

• Help bring common interfaces and best practices to the standards forums.

• Interoperate with Europe for large physics users.– Deliver the US based infrastructure for the World Wide Large

Hadron Collider (LHC) Grid Collaboration (WLCG) in support of the LHC experiments.

• Include off-shore sites when approached.

• Help bring common interfaces and best practices to the standards forums.

31 07/05/2007

Applicationsand

Status of Utilization

32 07/05/2007

Particile Physics and Computing

Science DriverEvent rate = Luminosity x Crossection

LHC Revolution starting in 2008— Luminosity x 10— Crossection x 150 (e.g. top-quark)

Computing Challenge 20PB in first year of running ~ 100MSpecInt2000 ~ close to 100,000 cores

33 07/05/2007

CMS Experiment

Germany

CMS Experiment(P-P Collision Particle Physics Experiment)

Taiwan UKItaly

Data & jobs moving locally, regionally & globally within CMS grid.

Transparently across grid boundaries from campus to globus.

Florida

USA@FNAL

CERN

Caltech

Wisconsin

UCSD

France

Purdue

MIT

UNL

OSG

EGEE

34 07/05/2007

CMS Data Analysis

35 07/05/2007

Opportunistic Resource Use

• In Nov ‘06 D0 asked to use 1500-2000 CPUs for 2-4 months for re-processing of an existing dataset (~500 million events) for science results for the summer conferences in July ‘07.

• The Executive Board estimated there were currently sufficient opportunistically available resources on OSG to meet the request; We also looked into the local storage and I/O needs.

• The Council members agreed to contribute resources to meet this request.

• In Nov ‘06 D0 asked to use 1500-2000 CPUs for 2-4 months for re-processing of an existing dataset (~500 million events) for science results for the summer conferences in July ‘07.

• The Executive Board estimated there were currently sufficient opportunistically available resources on OSG to meet the request; We also looked into the local storage and I/O needs.

• The Council members agreed to contribute resources to meet this request.

36 07/05/2007

D0 Throughput

D0 Event Throughput

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Week in 2007

CIT_CMS_T2 FNAL_DZEROOSG_2 FNAL_FERMIGRIDFNAL_GPFARM GLOW GRASE-CCR-U2MIT_CMS MWT2_IU NebraskaNERSC-PDSF OSG_LIGO_PSU OU_OSCER_ATLASOU_OSCER_CONDOR Purdue-RCAC SPRACEUCSDT2 UFlorida-IHEPA UFlorida-PGUSCMS-FNAL-WC1-CE

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Week in 2007

CIT_CMS_T2 FNAL_DZEROOSG_2 FNAL_FERMIGRIDFNAL_GPFARM GLOW GRASE-CCR-U2MIT_CMS MWT2_IU NebraskaNERSC-PDSF OSG_LIGO_PSU OU_OSCER_ATLASOU_OSCER_CONDOR Purdue-RCAC SPRACEUCSDT2 UFlorida-IHEPA UFlorida-PGUSCMS-FNAL-WC1-CE

D0 OSG CPUHours / Week

37 07/05/2007

Lessons Learned from D0 Case

• Consortium members contributed significant opportunistic resources as promised.– VOs can use a significant number of sites they “don’t own” to

achieve a large effective throughput.

• Combined teams make large production runs effective.

• How does this scale?– how we going to support multiple requests that

oversubcribe the resources? We anticipate this may happen soon.

• Consortium members contributed significant opportunistic resources as promised.– VOs can use a significant number of sites they “don’t own” to

achieve a large effective throughput.

• Combined teams make large production runs effective.

• How does this scale?– how we going to support multiple requests that

oversubcribe the resources? We anticipate this may happen soon.

38 07/05/2007

Use Case by Other Disciplines

• Rosetta@Kuhlman lab(protein research): in production across ~15 sites since April

• Weather Research Forecast: MPI job running on 1 OSG site; more to come

• CHARMM molecular dynamic simulation to the problem of water penetration in staphylococcal nuclease

• Genome Analysis and Database Update system (GADU): portal across OSG & TeraGrid. Runs Blast.

• NanoHUB at Purdue: Biomoca and Nanowire production.

• Rosetta@Kuhlman lab(protein research): in production across ~15 sites since April

• Weather Research Forecast: MPI job running on 1 OSG site; more to come

• CHARMM molecular dynamic simulation to the problem of water penetration in staphylococcal nuclease

• Genome Analysis and Database Update system (GADU): portal across OSG & TeraGrid. Runs Blast.

• NanoHUB at Purdue: Biomoca and Nanowire production.

39 07/05/2007

OSG Usage By Numbers

39 Virtual Communities

6 VOs with >1000 jobs max.(5 particle physics & 1 campus

grid)

4 VOs with 500-1000 max.(two outside physics)

10 VOs with 100-500 max(campus grids and physics)

40 07/05/2007

Running Jobs During Last Year

41 07/05/2007

Jobs Running at Sites>1k max 5 sites>0.5k max 10 sites>100 max 29 sitesTotal: 47 sites

Many small sites, or withmostly local activity.

42 07/05/2007

CMS Xfer on OSG in June ‘06

All CMS sites have exceeded 5TB per day in June 2006.Caltech, Purdue, UCSD, UFL, UW exceeded 10TB/day.

450MByte/sec

43 07/05/2007

CPUHours/Day on OSG During 2007

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

1/1/071/8/071/15/071/22/071/29/072/5/072/12/072/19/072/26/073/5/073/12/073/19/073/26/074/2/074/9/074/16/074/23/074/30/075/7/075/14/075/21/075/28/07

AGLT2 ASGC_OSG BNL_OSG BNL_PANDA CIT_CMS_T2 FIU-PGFNAL_CDFOSG_1 FNAL_CDFOSG_2 FNAL_FERMIGRID FNAL_GPFARM GLOW GRASE-CCR-U2GRASE-GENESEO-OSG GROW-PROD HEPGRID_UERJ IPAS_OSG Lehigh Coral MIT_CMSNebraska NERSC-PDSF OSG_LIGO_PSU OU_OCHEP_SWT2 OU_OSCER_ATLAS OU_OSCER_CONDORPurdue-Lear Purdue-RCAC SPRACE STAR-BNL STAR-WSU TTU-ANTAEUSUC_ATLAS_MWT2 UCRHEP UCSDT2 UFlorida-IHEPA UFlorida-PG USCMS-FNAL-WC1-CEUSCMS-FNAL-WC1-CE2 UTA_SWT2 UTA-DPCC UWMilwaukee Vanderbilt

44 07/05/2007

Summary

• OSG Facility utilization is steadily being increased– ~2-4500 jobs all the time– HEP, Astro, Nuclear Phys. but also Bio/Eng/Med

• Constant effort/troubleshooting is being poured to make OSG usable, robust and performant.

• Show use to other sciences.• Trying to bring campus into a pervasive distributed

infrastructure.• Bring research into a ubiquitous appreciation of the

value of (distributed, opportunistic) computation• Educate people to utilize the resources

• OSG Facility utilization is steadily being increased– ~2-4500 jobs all the time– HEP, Astro, Nuclear Phys. but also Bio/Eng/Med

• Constant effort/troubleshooting is being poured to make OSG usable, robust and performant.

• Show use to other sciences.• Trying to bring campus into a pervasive distributed

infrastructure.• Bring research into a ubiquitous appreciation of the

value of (distributed, opportunistic) computation• Educate people to utilize the resources

45 07/05/2007

Out of Bound Slides

46 07/05/2007

Principle: Simple and Flexible

The OSG architecture will follow the principles of symmetry and recursion

wherever possible

This principle guides us in our approaches to - Support for hierachies of and property inheritance of VOs.- Federations and interoperability of grids (grids of grids).- Treatment of policies and resources.

47 07/05/2007

Principle: Coherent but heterogeneous

The OSG architecture is VO based.

Most services are instantiated within the

context of a VO.

This principle guides us in our approaches to

- Scope of namespaces & action of services.

- Definition of and services in support of an OSG-wide VO.

- No concept of “global” scope.

- Support for new and dynamic VOs should be light-weight.

48 07/05/2007

OSG Security Activies (continues…)

User VO Site

Jobs

VO infra.

DataStorage

CE

W W W

WWWWWWWW

WWWW

I trust it is the VO (or agent)

I trust it is the user

I trust it is the user’s job

I trust the job is for the VO

49 07/05/2007

Principle: Bottom up/Persistency

All services should function and operate in the local environment when

disconnected from the OSG environment.

This principle guides us in our approaches to - The Architecture of Services. E.G. Services are required to manage their own state, ensure their internal state is consistent and report their state accurately.

- Development and execution of applications in a local context, without an active connection to the distributed services.

50 07/05/2007

Principle: Commonality

OSG will provide baseline services and a reference implementation.

The infrastructure will support support incremental upgrades

The OSG infrastructure should have minimal impact on a Site. Services that must run with superuser privileges will be minimized

Users are not required to interact directly with resource providers.

51 07/05/2007

Scale needed in 2008/2009

• 20-30 Petabyte tertiary automated tape storage at 12 centers world-wide physics and other scientific collaborations.

• High availability (365x24x7) and high data access rates (1GByte/sec) locally and remotely.

• Evolving and scaling smoothly to meet evolving requirements.

• E.g. for a single experiment

• 20-30 Petabyte tertiary automated tape storage at 12 centers world-wide physics and other scientific collaborations.

• High availability (365x24x7) and high data access rates (1GByte/sec) locally and remotely.

• Evolving and scaling smoothly to meet evolving requirements.

• E.g. for a single experiment

52 07/05/2007

OSG Software Concerns

• How quickly (and at what FTE cost) can we patch the OSG stack and redeploy it?– Critical for security patches– Very important for stability and QoS

• How dependable is our software? – Focus on testing (at all phases), troubleshooting of

deployed software and careful adoption of new software

• Functionality of our software– Close consultation with stakeholders

• Impact on other cyber-infrastructures– Critical for interoperability

53 07/05/2007

OSG Software Providers

• OSG doesn’t write software, but gets it from providers– Condor Project– Globus Alliance

• Globus, MyProxy, GSISSH

– EGEE • VOMS, CEMon, Fetch-CRL…

– OSG Extensions • Gratia

– Various open source projects• Apache, MySQL, and many more