grid scheduling through service-level agreement karl czajkowski the globus project...

Download Grid Scheduling through Service-Level Agreement Karl Czajkowski The Globus Project

Post on 26-Dec-2015




0 download

Embed Size (px)


  • Slide 1
  • Grid Scheduling through Service-Level Agreement Karl Czajkowski The Globus Project
  • Slide 2
  • 2 Overview l Introduction to Grid Environments l The Resource Management Problem Cross-domain applications Resource owner goals vs. application goals l An Open Architecture to Manage Resources Service-Level Agreement (SLA) GRAM and Managed Services l Related and Ongoing Work
  • Slide 3
  • 3 Grid Resource Environment l Distributed users and resources l Variable resource status l Variable grouping and connectivity l Decentralized scheduling/policy R R R R R R ? ? R R R R RR R RR ? ? R R R R R dispersed users VO-AVO-B network
  • Slide 4
  • 4 Social/Policy Conflicts l Application Goals Users: deadlines and availability goals Applications: need coordinated resources l Localized Resource Owner Goals Policies towards users Optimization goals l Community Goals Emerge As: An aggregate user/application? A virtual resource? Both!
  • Slide 5
  • 5 Data-Intensive Example l Concurrent resource requirements Large scale storage, computing, network, graphics l Datapath involves autonomous domains
  • Slide 6
  • 6 Early Co-Allocation in Grids l SF-Express (1997-8) Real-time simulation 12+ supercomputers, 1400 processors l Required advance reservation Brokered by telephone! Globus DUROC software to sync startup Over 45 minutes to recover from failure l In use today in MPICH-G2 (MPI library)
  • Slide 7
  • 7 Traditional Scheduling l Closed-System Model Presumption of global owner/authority Sandboxed applications with no interactions Toss job over the fence and wait l Utilization as Primary Metric Deep batch queues allow tighter packing No incentives for matching user schedule l Sub-cultures Counter Site Policies Users learn tricks for gaming their site
  • Slide 8
  • 8 An Open Negotiation Model l Resources in a Global Context Advertisement and negotiation Normalized remote client interface Resource maintains autonomy l Users or Agents Bridge Resources Drive task submission and provisioning Coordinate acts across domains l Community-based Mediation Coordination for collective interest
  • Slide 9
  • 9 Community Scheduling Example l Individual users Require service Have application goals l Community schedulers Broker service Aggregate scheduling l Individual resources Provide service Have policy autonomy Serve above clients
  • Slide 10
  • 10 Negotiation Phases l Discovery What resources are relevant to interest? Finds service providers l Monitoring Whats happening to them now? Compare service providers l Service-Level Agreement Will they provide what I need? The core Resource Management problem Process can iterate due to adaptation
  • Slide 11
  • 11 Service-Level Agreement l Three kinds of SLA Task submission (do something) Resource reservation (pre-agreement) Lazy task/resource Binding (apply resv.) l Simple protocol for negotiating SLAs Basic 2-party negotiation >Support for basic offer/accept pattern >Optional counter-offer patterns >Variable commitment phase for stricter promises Client may maintain multiple 2-party SLAs
  • Slide 12
  • 12 Many Types of Service l Must support service heterogeneity Resources >Hardware: disks, CPU, memory, networks, display >Logical: accounts, services >Capabilities: space, throughput Tasks >Data: stored file, data read/write >Compute: execution, suspended/swapped job l SLAs bear embedded term languages Isolate domain-specific details
  • Slide 13
  • 13 Domain Extension: File Transfer l Single goal Reliable deadline transfer l Specialized scheduler Brokers basic services Synthesizes new service >Fault-handling logic l Distributed resources Storage space Storage bandwidth Network bandwidth
  • Slide 14
  • 14 Technical Challenges l Complex Security Requirements l Global Scalability Similar ideals to Internet Interoperable infrastructure Policy-configurable for social needs l Permanence or Evolve in Place Cannot take World off-line for service Over time: upgrade, extend, adapt Accept heterogeneity
  • Slide 15
  • 15 GRAM2 JobCPUDisk Application Domain-specific SLA Incremental SLAs Information Service Local resource managers SLA implementation Planner Concrete SLA Coordinator Monitor & Discover GRAM Architecture
  • Slide 16
  • 16 WS-Agreement l New standardization effort l Generalizes GRAM ideas Service-oriented architecture Resource becomes Service Provider Tasks become Negotiated Services SLAs presented as Agreement services l Still supports extensible domain terms
  • Slide 17
  • 17 WS-Agreement Entities
  • Slide 18
  • 18 WS-Agreement Adds Management
  • Slide 19
  • 19 Virtualized Providers
  • Slide 20
  • 20 Agreement-based Jobs l Agreement represents queue entry Commitment with job parameters etc. l Agreement Provider i.e. Job scheduler/Queuing system Management interface to service provider l Service Provider i.e. scheduled resource (compute nodes) l Service is the Job computation
  • Slide 21
  • 21 Advance Reservation for Jobs l Schedule-based commitment of service Requires schedule based SLA terms l Optional Pre-Agreement (RSLA) Agreement to facilitate future Job Agreement Characterizes virtual resource needed for Job May not need full job terms l Job Agreement almost as usual May exploit Pre-Agreement >Reference existing promise of resource schedule May get schedule commitment in one shot >Directly include schedule terms >(Can think of as atomic advance reserve/claim)
  • Slide 22
  • 22 Need for Complex Description l 128 physical nodes Physical topology >Interconnect >RAM, disk size Subject of RSLA l Single MPI job Subject of TSLA May reference RSLAs l Quality requirements Real-time parameters >CPU, disk performance Subject of BSLA
  • Slide 23
  • 23 MDS Resource Models (History)
  • Slide 24
  • 24 Future Models l Service behavioral descriptions Unified service term model Capture user/application requirements Capture provider capabilities l Core meta-language Facilitates planner/decision designs Extends with domain concepts Extensible negotiability mark-up >Capture range of negotiability for variable terms >Capture importance of terms (required/optional) >Capture cost of options (fees/penalties)
  • Slide 25
  • 25 SLA Types in Depth l Resource SLA (RSLA), i.e. reservation A promise of resource availability >Client must utilize promise in subsequent SLAs l Task SLA (TSLA), i.e. execution A promise to perform a task >Complex task requirements >May reference an RSLA (implicit binding) l Binding SLA (BSLA), i.e. claim Binds a resource capability to a TSLA >May reference an RSLA (otherwise obtain implicitly) >May be created lazily to provision the task
  • Slide 26
  • 26 Resource Lifecycle l S0: Start with no SLAs l S1: Create SLAs TSLA or RSLA l S2: Bind task/resource Explicit BSLA Implicit provider schedule l S3: Active task Resource consumption l Backtrack to S0 On task completion On expiration On failure
  • Slide 27
  • 27 Incremental Negotiation l RSLA: reserve resources for future use l Resources change state due to SLAs and scheduler decisions l TSLA: submit task to scheduler l BSLA: bind reservation to task
  • Slide 28
  • 28 Linking SLAs for Complex Case l Dependent SLAs nest intrinsically BSLA2 defined in terms of RSLA2 and TSLA4 l Chained SLAs simplify negotiation Optionally link destruction/reclamation TSLA1 RSLA1 BSLA1 TSLA2 TSLA3 Stage out Stage in time BSLA2 RSLA2 TSLA4 Net 30 GB for /scratch/tmpuser1/foo/* files Complex job 50 GB in /scratch filesystem account tmpuser1
  • Slide 29
  • 29 Related Work l Academic Contemporaries Condor Matchmaking Economy-based Scheduling Work-flow Planning l Commercial Scheduler Examples Many examples for traditional sites >Several generalized for the enterprise Platform Computing >LSF scaled to lots of jobs >MultiCluster for site-to-site resource sharing IBM eWLM >Goal-based provisioning of transactional flows
  • Slide 30
  • 30 Condor Matchmaking l At heart: a scheduling algorithm l Heuristics for pairing job with resource Match symmetric Classified Ads Great for bulk/commodity matching l Closed system view Subsumes resource through lease Sandboxed job environment Favor vertical integration over generality Tuned high-throughput system
  • Slide 31
  • 31 Condor on GRAM l Condor already uses GRAM two ways 1.GRAM treats Condor as local scheduler 2.Condor uses GRAM to access resource l Condor maps to SLA architecture Advertise resource ClassAd Submit job ClassAd (as TSLA) Matchmaker is a Community Scheduler Need SLA scalability to be practical
  • Slide 32
  • 32 Future Work l SLA interaction with policy SLA negotiation subject to policy >One SLA affects another, e.g. RSLA subdivision >One client more important than another SLA implemented by low-level policies >Domain-specific SLA maps to resource SLAs >Resource SLAs map to resource control mechanisms l Resource characterization Advertisement of resources: options, cost Interoperable capability languages
  • Slide 33
  • 33 Conclusion l Generic SLA management Comp