service level agreements (slas) feb 2015 ics-aci · obtain serious stakeholder input and guidance...

42
Service Level Agreements Service Level Agreements (SLAs) Feb 2015 ICS-ACI

Upload: others

Post on 16-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

Service Level Agreements

Service Level Agreements (SLAs) Feb 2015 ICS-ACI

Page 2: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

1

ICS –ACI Goals

Providing high quality advanced computing and storage services for research with:

Open and transparent policies

Clearly defined services, service levels and expectations

Uniform access across the university-wide research community including those with data-intensive processing needs

Measurement-based approach to managing “quality of service”

Faculty governance through ICS-ACI Coordinating Committee

Service Level Agreements (SLAs) are the instrument to explain these services to the Penn State community

Understand that this is a transition and work to insure a smoother transition by working with PI’s on a case-by-case basis, as needed

Page 3: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

2

Penn State Model Aspects

Provost is making significant investments so that CyberInfrastructure (CI) services can scale university wide and advance our research mission

Model takes into account the entire CI environment which includes high performance computing resources, high-bandwidth redundant network, fast and flexible storage, a comprehensive software stack, and a variety of services including operations, back-up, technical consulting, and training

Model is meant to be sustainable and extensible with approx. 1/5 of the capacity fully subsidized by the provost for exploratory research and free open access

Of the remaining capacity, approx. 65% will be subsidized by the Provost and 35% cost-recovered through funds under faculty control

The underlying systems and services capacities are not static and can be expanded and refined through input from the ICS-ACI Coordinating Committee and university-wide Research CI Governance Structures

Page 4: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

3

How ICS-ACI Faculty Governance Works

ICS-ACI Coordinating Committee consists of research faculty and IT staff with broad university representation to:

Serve as the forum to develop, review, and update ICS-ACI policies, processes, and effective delivery of services as the needs of the university and community change (e.g. the ICS-ACI SLAs will be reviewed at least once a year)

Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed

Create an exceptions process to handle policy-edge cases in a quick, efficient, open and uniform manner

Provide conduits to disseminate information, such as services, service levels and expectations, to the university research community

Serve as the mechanism to refer pressing or unresolved issues to the Research CI Governance Committee

Page 5: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

4

Organization of this Document

1. Glossary and Definition of Terms – Pages 5-8

2. Overview of ICS-ACI computing System – Pages 9-10

3. High Level PI View – Pages 11-12

4. User Experience Batch Job Submission – Pages 13-17

5. ICS-ACI Models – Pages 18-21

6. Q&A About ICS-ACI Models – Pages 22-32

7. ICS-ACI Storage Models – Pages 32-35

8. Additional Information – Pages 36-41

Page 6: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

5

1. Glossary and Definition of Terms

Page 7: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

6

1. Glossary and Definition of Terms - I

ICS-ACI – Institute for CyberScience - Advanced Cyber Infrastructure

Service Level Agreement (SLA) - Agreement between ICS and Research PI in relation to research ICS-ACI resources, e.g. access, storage, computing

ACI-b – ICS-ACI sub-system configured to execute jobs submitted to a variety of queues, i.e. batch processing

ACI-u – ICS-ACI User specific “Development/Test” interactive subsystem where PI’s may specify a system configuration for user-specific interactive sessions including root access and user-defined software stack

ICS-ACI-Burst – Queue to allow usage of computing resources in the ACI-b subsystem above a PI’s physical allocation when needed for a short time period

ICS-ACI-Guaranteed – Queue providing access to the ACI-b subsystem within a guaranteed time, provided that the request is within a PI’s physical allocation

ICS-ACI-Open – Queue to provide user access to idle ACI-b computing resources during times when all jobs are running and idle resources remain available

Batch – Executing or processing of a series of programs (jobs) on a system without manual intervention

Page 8: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

7

Core – Data processing unit within a server. The total number of cores per server is dependent upon the vendor’s architecture of the server.

Core Allocation – Amount of physical computing resources purchased by or granted to a user through ICS-ACI plans

F&A – Facilities and Administration charge, referred to as “indirect” or “overhead”

Force Majeure - unforeseeable circumstances that prevent fulfillment of a contract, (e.g. Natural Disaster, Fire)

GPFS – General Parallel File System

Group – A self-defined set of multiple users--for example students and researchers in a faculty member's lab. Such rights as access to storage and allocation of resources can be delegated in an organized fashion by the PI.

Group Storage – Dedicated disk space for storing group-related data or research

Guaranteed Response Time – The maximum time that it takes for a job to start execution after submission to a queue

Home Directory – A user’s dedicated disk space for storing personal files, directories and programs. Directory that a user is taken to after logging into the system.

Legacy Systems – Pre-2015 ICS computing systems, such as the Lion-X clusters

1. Glossary and Definition of Terms - II

Page 9: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

8

Login Nodes – Front end servers used to login to the ICS-ACI computing system

NAS – Networked Attached Storage

PI – Principal or Primary Investigator. Person, such as faculty, who is authorized to direct all of his or her research ICS-ACI resources, e.g. access, storage, computing

Pre-emption – The act of pausing or stopping a job that is currently processing in order to fulfill terms and conditions to other users under service level agreements

Scratch Directory – Disk space dedicated for temporary storage of data

System – The computing engine along with the software, storage, network, and peripheral devices that are necessary to make the computer function, e.g. ICS-ACI

Subsystem - A unit or device that is part of a larger system, e.g. ACI-b

User - A person, such as a student or faculty member, who has a User Account authorizing use the ICS-ACI resources.

User Account - The authorization agreement under which a user is entitled to access a

computing system. For details, see 'User Account Policies' on the ICS-ACI website.

Wall Time – A queue parameter that is set to define the maximum allowable execution time for a job once it has started

Work Directory – Disk space dedicated to a user for research data storage

1. Glossary and Definition of Terms - III

Page 10: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

9

2. Overview of ICS-ACI computing System

Page 11: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

10

“b”atch Systems (ACI-b) – Systems configured to execute jobs submitted to a variety of

Queues i.e. batch processing.

“u”ser-specific “Development/Test” Interactive Systems (ACI-u) – Systems in which

PI's may specify a system configuration for user-specific interactive sessions, including root access and user-defined software stacks.

“i”nteractive Systems (ACI-i) – Systems configured as a common GUI interactive system

for testing, small jobs, and pre/post processing

2. Overview of ICS-ACI computing System

GPFS NAS Tape

Standard Memory Cores

High Memory Cores

Interactive Cores

Login Nodes

ACI-b, ACI-u ACI-i

To Legacy SystemsTo PSU Research

Network

Page 12: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

11

3. High Level PI View

Page 13: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

12

PI gets core allocation via one of two models (detailed later)

Explore model

GReaT subscription model

PI can flexibly choose how he/she wants to compute

Standard Memory Cores or Large Memory Cores

Batch Job Submission to 1 of 3 different queues that gives users a choice depending on their time-criticality or usage model needs.

• Guaranteed Response Time Queue

• Burst Queue

• Open Queue

Use Interactive computing resources

Enter the User-specified custom environment

Example: Big Data Use Case: PI’s can use the User-specified custom environment and/or large memory computing resources

PI empowered to:

Choose model and Define groups

3. High Level PI View

Page 14: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

13

4. User Experience Batch Job Submission

Page 15: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

14

4. User Experience – I: Batch Job Submission to the ACI-b Subsystem

User with a specified core allocation can choose to submit jobs to one or both of following queues that operate on the ACI-b subsystem:

GUARANTEED Queue - users submits jobs within their core allocation, the jobs will begin within a specified Guaranteed Response Time

BURST Queue – user submits jobs within 4x of their core allocation, the jobs will follow the “burst” allocation

If a user wants to submit jobs over 4x of his or her core allocation, the PI may request exceptions to these limits through a transparent faculty-governed exception process. The exception process, defined later, will use best efforts to resolve exceptions in a quick (sometimes automated), efficient, open and uniform manner

NOTE: NO wall-time limit on Guaranteed or Burst provided user has allocation remaining

Anyone with an account on ICS-ACI can submit jobs to the following queue that operates on the ACI-b subsystem:

ICS-ACI OPEN Queue - provides leveraged access to idle ACI-b computing resources during times when all jobs are running and idle resources remain available

Page 16: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

15

4. User Experience – II: Examples of ACI-b Batch Jobs

Assume a PI with a 20 core allocation

Example 1 - PI needs 20 cores for 720 hours – PI submits job to Guaranteed Queue and job will start

begin within a specified Guaranteed Response Time

Example 2 - PI needs to meet a proposal deadline and needs a job of 40 cores for 180 hours – PI

submits job to Burst Queue and job will follow “bursting” allocations

Example 3 - PI has an unbalanced workflow and needs to submit jobs twice a month requiring 40

cores for 90 hours – PI submits jobs to Burst Queue and jobs will follow “bursting” allocations

Example 4 - PI needs 100 cores for 72 hours – Job requires an exception and will follow the exceptions

process

Page 17: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

16

4. User Experience - III: How Bursting Allocations Work

PI’s core allocation is converted into a 90-day Core Hour Allocation Example – PI with 20 cores

90-day Core Hour Allocation: 20 cores * 24 hours * 90 days = 43,200 core hours

Each PI has a sliding 90-day usage period

On day 1, PI has a 100% of his/her Job Priority in the BURST Queue

As PI uses allocation, a “Daily Core Hours Used” total is calculated Example – PI runs a job of 40 cores for 10 hours

Daily Core Hour Used: 40 cores * 10 hours = 400 core hours

PI’s Job Priority % is reduced based upon usage during the 90-day period First 1/3 usage decays at a faster rate than the remaining 2/3 usage

Once a PI has exhausted 100% of his or her 90-day Core Hour Allocation:

PI will not have any additional access to the GUARANTEED or BURST Queue

Start time based upon system availability (not guaranteed)

Capacity (max): 4x core allocation – but users can request exceptions to capacity limits through a transparent faculty-governed exception process

Page 18: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

Day 1: PI has a 100% of highest priority in the BURST Queue & PI submits a job that executes and consumes 1500 core-hours

1500 cores-hours are reduced from his/her available allocation (9000-1500=7500) & new priority is re-calculated =67%

Day 6: PI has a 67% of highest priority in the BURST Queue & PI submits a job that executes and consumes 1500 core-hours

1500 cores-hours are reduced from his/her available allocation (7500-1500=6000) & new priority is re-calculated =33%

Day 11: PI has a 33% of highest priority in the BURST Queue & PI submits a job that executes and consumes 3000 core-hours

3000 cores-hours are reduced from his/her available allocation (6000-3000=3000) & new priority is re-calculated =16%

Day 16: PI has a 16% of highest priority in the BURST Queue & PI submits a job that executes and consumes 3000 core-hours

PI has consumed all of his/her 90-day allocation

He/She will not have any additional access to the GUARANTEED or BURST Queue (Can use OPEN Queue)

Day 17-91: OPEN Queue only for compute (NOTE: Any compute in the OPEN Queue does NOT count toward allocation)

Day 92 (red bracket): “Day 1’s” 1500 compute hours “roll-off”

1500 cores-hours are added back into his/her available allocation (0+1500=1500) & new priority is re-calculated =8%

Day 97 (green bracket): PI has a 8% of highest priority in the BURST Queue and 0 jobs submitted, “Day 6’s” 1500 hours “roll-off”

1500 cores-hours are added back into his/her available allocation (1500+1500=3000) & new priority is re-calculated =16%

Day 102 (purple bracket): PI has a 16% of highest priority in the BURST Queue and 0 jobs submitted, “Day 11’s” 3000 hours “roll-off”

3000 cores-hours are added back into his/her available allocation (3000+3000 =6000) & new priority is re-calculated =33%

Day 107 (blue bracket): PI has a 33% of highest priority in the BURST Queue and 0 jobs submitted, “Day 16’s” 3000 hours “roll-off”

3000 cores-hours are added back into his/her available allocation (6000+3000 =9000) & new priority is re-calculated = 100%

NOTE: First 1/3 usage decreases job priority at a faster rate than the remaining 2/3 usage

4. User Experience - IV: ExamplePI with 90-day allocation of 9000 Core-hours using Bursting in Rolling 90-day Usage Period(s)

17

Time

Available Allocation

% Priority

Used Allocation

90-day Usage Period(s)

100%

9000 hours

1500 hours

67%

7500 hours

1500 hours

33%

6000 hours

3000 hours

0%

0 Hours (Compute thru ICS-ACI Open

only)

0 hours

16%

3000 hours

3000 hours

16%

0 hours

100%

9000 hours

0 hours

Day 1 Day 6 Day 11 Days 17-91Day 16 Day 97 Day 107

3000 hours

8%

0 hours

Day 92

1500 hours

33%

3000 hours

0 hours

Day 102

Page 19: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

18

5. ICS-ACI Models

Page 20: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

19

5. ICS-ACI Models – I:ICS-ACI GReaT Model Plan Description

Provides access to the ACI-b and ACI-u systems within a guaranteed time while providing the flexibility to burst

For ACI-b systems, PI will be given a batch job queue for any PI-specified users. Submitted jobs will begin within a specified Guaranteed Response Time provided that the request is within a PI’s physical allocation. The current Guaranteed Response Time is 1 hour, however quicker response times are expected.

For ACI-u systems, PI's may specify a system configuration for user-specific interactive sessions, including root access and user-defined software stacks. Begin immediately provided the request does not exceed the PI’s core allocation.

Burst access available for PI's requiring capacity exceeding their core allocation

Plan Specifics

PI's charge core-years to grant and/or general funds (F&A WAIVED)

Available in units of 2 cores* (½ Node minimum)

3-, 4-, 5-year ICS-ACI GReaT plan - Acquire 2 cores, get 1 in ICS-ACI match These plans include 5TB of group storage for the equivalent term at no additional cost

1-, 2-year ICS-ACI GReaT plan – No(1)/Partial(2) match, but lower cost per core* - ICS-ACI recommends purchasing full node access for potential increased performance

Page 21: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

20

5. ICS-ACI Models – II:ICS-ACI GReaT Plan Cost Model

Notes: (F&A WAIVED)

(1) Current Node Config.: Standard Memory – 20 cores/node; High Memory – 40 cores/node

(2) Cores Available in units of 2 cores* (½ Node minimum purchase)

(3) Includes ICS-ACI match of 1 core per every 2 cores purchased (3, 4, and 5 year plans only) (2 year plan has a partial match)

(4) Plans are also available in (per core including match): 2-yr (Std-$330/High-$630)

4-yr (Std-$528/High-$1028)5-yr (Std-$660/High-$1260)

* - ICS-ACI recommends purchasing full node access for potential increased performance

Plan Term Node

Type1

Effective

Direct Cost2,3

(per Core)

Effective

Total Cost3 for

1 current node1

(total cores)

ICS-ACI

GReaT4

3-year Std-Mem $396 $7,920 (20)

3-year High-Mem $756 $30,240 (40)

1-year Std-Mem $198 $3,960 (20)

1-year High-Mem $378 $15,120 (40)

Page 22: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

21

5. ICS-ACI Models – III:ICS-ACI Explore Model Plan Description

Provides special access to the ACI-b and ACI-u systems for PI's who need specific, deterministic and guaranteed system capacity to conduct medium to large scale experiments/research

Examples – “proof-of-concept” for a proposal/paper, strategic exploratory research

Service levels are identical to the ICS-ACI GReaT model in terms of guarantees and bursting.

Small scale experiments/research are NOT suited for ICS-ACI Explore, however, PI's have access to the ICS-ACI-OPEN Queue for this research

Plan Specifics No charge to PI's, 100% Provost-sponsored service

Allocation durations are in terms of days (30 day min. – 6 month max.)

Available in units of 2 cores (10 core minimum)

Allocations made upon approval through a simple faculty-governed application/review process (TBD – modeled after other PSU institutes)

Page 23: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

22

6. Q&A About ICS-ACI Models

Page 24: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

23

For use cases that go beyond ICS-ACI policies, There is a need for a process to resolve issues in a quick, efficient, open and uniform manner

ICS-ACI Exceptions Process Characteristics:

Exceptions classifications will set clear limits for various functional areas (e.g. wall time limits, bursting, usage, recurring user exception requests)

Classifications: Acceptable, Discussion, Negotiation, Unacceptable

“Exceptions-process” mailbox and log will document the exception and its classification along with any recommended solutions

As needed, a small ICS-ACI faculty exceptions committee, drawn from the ICS-ACI Coordinating Committee, will quickly review exceptions and extend resolutions

Exceptions mailbox and logs will be reviewed periodically with ICS-ACI faculty exceptions committee and ICS-ACI coordinating committee

Exceptions limits will be revisited and revised based upon operational experience

6. Questions – I:How does ICS-ACI Exceptions Process Work?

Page 25: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

24

Classification Impact to the user community and/or systems under normal operations (Example)

Resolution Process

Acceptable Minimal to No impact(one-time 8x burst request -24-hr wall-time)

ICS-ACI system approval automatically/semi-automatically with documented action

Discussion Moderate impact (multiple 20x burst requests –96-hr wall time)

ICS approves with documented reasoning

Negotiation Major impact (one-time 100x burst request -192-hr wall-time)

ICS makes resolution suggestions and sends to “faculty exceptions committee” for determination.Committee determines response

Unacceptable Extensive impact(one-time 1000x burst request 384-hr wall time)

ICS denies request with documented reasoning

6. Questions – I (cont.):ICS-ACI Exceptions Process Classifications

Page 26: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

25

6. Questions – II: How does suspending of jobs work?

Jobs begun in the GUARANTEED/BURST Queue and user-specific interactive sessions will never be suspended/killed*

Jobs begun in the OPEN Queue, which provides leveraged access to idle computing resources during times when all jobs are running and idle resources remain available, may be suspended if those resources are needed for pending jobs in the GUARANTEED Queue or pending requests for user-specific interactive sessions

Jobs suspended in the OPEN Queues will be resumed when resources become available and before new pending OPEN jobs begin

In extreme instances, multiple suspended jobs may leave systems in a disordered state; only at that point will these suspended jobs be killed for system stability

A killed job will automatically be resubmitted to its queue

* Exception – jobs that harm the system or other jobs (e.g. run-away jobs, comprised accounts)

ICS-ACI will be developing user training to show users:

How to use techniques, such as check-pointing, that will save results periodically reducing impact of a killed job

How to create code that uses techniques that will store the state and move partial results to resources allowing correct resumption of jobs

Page 27: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

26

6. Questions – III:How does ICS-ACI GReaT Costs Compare?

(1) Direct Cost – F&A applied as specified

(2) Effective direct cost per node to the PI. The 3-year period is for the CI service and not on the system life components of which will be refreshed in approx. 4.5 year periods and is factored into 3-year service cost

(3) Amazon costs extrapolated from 1 year/3 year per 32 vCPU node (4:1 over-commit) to match 20 physical core node

(4) eBuy costs of a single rack-mount node server with equivalent specs

(5) Uncollected F&A with improper management of fabrication projects. Some PI's had a higher cost

3 Year Costs1 Comparison per Computing Node

Computing

Node Type

ICS-ACI GReaT2 Amazon3 PSU eBuy4 Former RCC5

Standard Memory (20 cores/node) $7,920 $10,060 $10,216 $6,800-$10,0005

High Memory (40 cores/node) $30,240 N/A $42,000 N/A

F&A WAIVED Yes No No

Page 28: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

27

6. Questions – IV:How does the value of ICS-ACI GReaT Compare?

Plan Characteristics

ICS-ACI

GReaT

Subscription model with guaranteed response times and bursting. Approx. 1/5 of system for no-cost computing; Offers: Both

Standard Memory (256GB) & High Memory (1TB) nodes w/traditional batch (HPC) and new user-specific interactive systems with

virtualization capability; Separate, expanded storage Pools (GPFS and NFS); Re-engineered for high performance; Faster, more

reliable networking; Newly established i-ASK Center for support; Coordinated user development, training & technical consulting

AmazonSubscription model with guaranteed response time and bursting. Offers: Only lower Standard memory (60GB) nodes;

No Infiniband for high performance parallel computing; External to PSU storage; Limited support

PSU eBuy

Capital purchase model that allows PI to keep the system for as long as it runs, but has to include the personnel effort for

system operations. PI has to purchase or get access to a chassis, rack, cabling, power, storage, back-up, infrastructure and

network, interconnect (e.g. IB), software stack, and other services as needed

Former RCC

Shared Capital purchase model with no guaranteed response times and unlimited bursting. No open service level agreements

and usage models which is inappropriate for university-wide community of researchers, e.g., of the 4,000 current users: 4 used

10% of system; 75 used 25% of system. Inappropriate scheduling, jobs run with increased capabilities, such as wall-time, even

after usage allocation is exhausted. Offers separate clusters with only lower Standard Memory, except Lion-XV, w/traditional

batch (HPC) only; No high memory nodes for “big data” workloads; Scratch and other storage pools in a single file system;

Ad-hoc user training. No additional charge if system ran for 4+ years

Page 29: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

28

6. Questions – V:How does ICS-ACI GReaT Compare with Other Univ.?

Comparison to a couple of institutions is shown in the next slide ---NOTE: there is substantial variations in:

Levels of subsidy and services

System properties – including the ability to serve compute vs data intensive work loads

Access for exploratory research

Open access that leverages idle computing resources during times when all jobs are running and idle resources remain available

A thorough study will likely be undertaken by working groups through university-wide Research CI Governance that has been recently approved by the Provost

Professors Rob Hume and Scott Bennett are chairing the committee to implement the university-wide Research CI Governance structure

Page 30: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

Topic Penn State U of Michigan Purdue

Model Type Subscription Subscription – Over-sold Subscription

Std. Memory Node -3-yr Costs (20 cores/node) (Shared Memory/node)

$7,920 (256 GB/Node) $8,438 (80 GB/Node)

$13,121 with F&A

$7,389 with college subsidy + F&A

$3,000 (64 GB/Node)

High Memory Node -3-yr Costs (40 cores/node) (Shared Memory/node)

$30,240 (1 TB/Node) $34,300 (1 TB/Node)

$53,337 with F&A

$29,781 with college subsidy + F&A

By request

F&A charged WAIVED Yes -On external funds Waived

Included in the Costs Cores, Storage, Network, Backup, Infrastructure, All HW/SW, +Classroom support

Cores, Storage, Network, Electric, Infrastructure, HW-only support

Cores Only

Time Period & Min-Size 1-5 years, ½ Node Min 1-month, 4 Cores Min 5-year, 1 Node Min

Guaranteed Response Time Yes – 1hr No – tries to keep it short Yes – 4hr

Bursting Yes – 4x + exception for larger, no limits on wall-time

No – But can buy more cores for bursting months

Yes – Unlimited cores with a 4-hr wall-time

Fully Subsidized Access (Exploratory/Open )

Yes , Approx. 1/5 of system No Yes – separate very small 256 core system, w/limits

Compute Type Traditional HPC & newvirtualized/interactive systems (cloud)

Traditional HPC only Traditional HPC only

Storage (H-Home, W-Work, G-Group, S-Scratch) Costs for extra Storage

NFS, H-10GB, W-128GB, S-1M Files, G-5TBAdd’l $100/1 TB per year

Additional, costs not known ISOLON, H-10GB, W/G – 100GBAdd’l $150/1TB per year

Subsidy by University 100% Explore/Open, 65% subscription

75% (90% with college subsidy) 90% or higher

Page 31: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

30

6. Questions – VI:How do I monitor my usage?

ICS plans to launch a web portal to show PI's their system usage data

Broken down by Queue, Jobs, Core Hours, Configuration, and User

Access to remaining core hours and sliding 30-day 90-day usage period

Annual and total usage

ICS plans to send PI's e-mail notifications

When usage hits 50%, 75%, and 100% for the 30-day 90-day usage period

Monthly report of usage similar to the data found in the Web Portal

Page 32: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

31

6. Questions – VII: What if I don’t have a core allocation or have used all of my core hour allocation?

ICS-ACI OPEN Queue provides leveraged access to idle ACI-b computing resources during times when all jobs are running and idle resources remain available

Jobs and interactive sessions begun under the ICS-ACI Open plan are delivered through a shared allocation with these characteristics:

All users are at an equal priority for access to idle resources.

Jobs and interactive sessions are started and will continue to run only when a sufficient number of idle cores are available.

Jobs and interactive sessions may be suspended or killed when demand for resources allocated to other Queues exceeds the supply.

There are no guarantees on completion time.

ICS-ACI OPEN Queue Specifics

No charge to PI's, 100% Provost-sponsored service

ICS-ACI Open capacity (max per user account): 100 jobs pending in the shared queue, 20 cores executing jobs at a given point in time, 24-hour job wall-times, 24-hour interactive sessions

Users may request exceptions to these capacity limits through a transparent faculty-governed exception process

Page 33: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

32

6. Questions – VIII:What if I want to “Try-ACI”?

PI can obtain one-time trial access to the ACI-b and ACI-u systems to make sure that the computing environment meets his/her research goals.

Once granted, service levels will be identical to the ICS-ACI GReaT Plan

Plan Specifics

No charge to PI's, 100% Provost-sponsored service

PI should request the number of cores that will be requested once the trial period expires Approval based upon system availability and capacity limits

Capacity (max): 30 days from PI-indicated start time

Users may request exceptions to these capacity limits through a transparent faculty-governed exception process

Page 34: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

33

7. ICS-ACI Storage Models

Page 35: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

34

7. ICS-ACI Storage Models – I: Overview

By default, user accounts come with three storage areas, “Home”, “Work”, and “Scratch”, attached to the ICS-ACI cluster.

Group storage is available per the Group Storage Plans.

Capacity and capabilities of ICS-ACI storage

*ICS-ACI uses a high performance parallel GPFS scratch storage system that is available for each user of the cluster. Scratch space is intended for temporary data required between program runs. Files are not backed up and non-recoverable, including accidental deletion. The integrity of the scratch storage components is accomplished via a redundant disk system. All efforts are made to maintain integrity of the file system however there may be circumstances beyond our control that could result in the loss of data.

Removal Policy – files should be present for only 30 days from creation date. Users having files existing for longer than 45 days from creation date will be sent a reminder at 45, 52, 59 days to move the data. Files existing at 60 days beyond creation date will be purged from the system.

If files are needed for longer than 30 days or require back-up, they should be placed in Group storage.

Storage Directory Default Capacity CapabilitiesHome 10Gb NFS with Backup/Recovery

Work 128Gb NFS with Backup/Recovery

Scratch 1 million files* GPFS with no Backup*

Group 5Tb blocks NFS with Backup/Recovery and

dual mount capability

Page 36: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

35

7. ICS-ACI Storage Models – II: ICS-ACI “Group” Model Plan Description

For users who require storage in addition to the 5TB block of Group Storage that comes with the 3-,4-, and 5-year ICS-ACI GReaT Plan

Provides flexibility for storage

Available for group access and may be segregated into sub-level block partitioning by the PI (e.g. Level 1 – PI; Level 2 – PI Labs)

Available to be configured as Dual Mount Group Storage* (e.g. mount externally as a drive on your computer and internally in the ICS-ACI)

*Minimum connectivity requirements apply to dual-mount, e.g. research network

Plan Specifics

Charge plan to grant or general funds

Available in blocks of 5Tb - Comes with back-up/recovery

Current plans (Total Costs – F&A WAIVED)

$7,500 per 25Tb* block for 3 years (4-yr: $10,000 5-yr: $12,500)

$1,500 per 5Tb block for 3 years (4-yr: $2,000 5-yr: $2,500)

*Sub-level block partitioning and Dual Mount available on 25Tb blocks

Page 37: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

36

8. Additional Info

Page 38: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

37

8. Additional Information – I: Terms & Conditions Summary

All SLAs come with:

SLA cover sheet indicating services, capacity, and service levels

Terms and Conditions that must be agreed to (copies available)

Key Terms and Conditions (T&C): ICS-ACI is governed by the ICS-ACI Coordinating Committee

User Accounts come with common and discipline-specific software stack. Additional common software packages will be installed if it meets established parameters or approved through a faculty-governed exception process.

PI-specific software applications, packages and libraries will be provided, installed, and maintained, and supported by the PI in accordance with vendor licensing agreements and export controls, as necessary. PI's must verify package safety.

Support via the newly created ICS-ACI Solutions and Knowledge (i-ASK) Center

New faculty-governed ICS-ACI policies (e.g., for User Accounts and Data Retention) are being developed and will be posted on the ICS website

ICS-ACI will maintain the systems as a highly available resource (97.8% uptime goal)

ICS-ACI will refresh system life components and software every 4.5 years (approx.) to ensure latest computer architectures and performance.

ICS-ACI will use best efforts to meet T&C except in cases of Force Majeure

Page 39: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

38

8. Additional Info – II: New Plans Under Development for 2015

Hosting Plan - This plan will allow PI's to place their own hardware that

conforms to ICS-ACI specifications, within ICS-ACI clusters, leveraging staff, networking, storage, power and software.

Services Plan - This plan will allow PI's to purchase ICS-ACI technical

services, such as programming and consulting.

Archival Plan - This plan will allow PI's to purchase ICS-ACI archival services

for their research data. (Handles PI use cases that don’t fit other PSU services)

Cloud Burst Plan* - This plan will allow PI's to burst jobs to cloud providers,

such as Amazon, through ICS-ACI.

User Training and Development

Proposal letters, templates, text and other tools (e.g. calculator)

* Availability - CY ‘16

Page 40: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

39

8. Additional Info – III: System Specifications

Availability of Nodes Deployed in Early S’15 Plan % of

System

Total Nodes(by Core Type)

Standard Memory High Memory

ICS-ACI GReaT(35% Cost Recovery, 65% Provost Funded)

70-80% 168-192 16.8-19.2

ICS-ACI Explore & OPEN(100% Provost Funded)

20-30% 48-72 4.8-7.2

Total 100% 240 24

Page 41: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

40

8. Additional Info IV –SLA Roll-out

After a friendly test and use period in February/March, it is anticipated that the ICS-ACI computing resources will be available to the entire research community in April 2015. If you are in need of ICS-ACI resources and would like to discuss SLAs, please contact [email protected]

ICS is currently working on proposal letters, templates, text and other tools, such as calculators, for Service Level Agreements (SLAs) and if you are interested, please contact [email protected]

Questions and comments can be addressed to [email protected]

Page 42: Service Level Agreements (SLAs) Feb 2015 ICS-ACI · Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed Create

41

Old RCC Plans: Under the former RCC research computing model, some favored users received a lot more than they paid for while others were treated very inequitably.

The New ICS-ACI Plans: (Please see SLA details at ics.psu.edu)

Developed to scale university-wide and serve new research communities, including those with data-intensive processing needs

New and explicit rules intended to provide much fairer access to computing resources

Allocation and priority are tied directly to core usage and Service Level Agreements

Guaranteed Queue, Burst Queue and ICS-ACI OPEN Queue will be configured to insure conformity to Service Level Agreements and provide equitable access

Realistic limits are imposed, but

Burst Capability is automatically provided

Major exceptions can be requested via an open and transparent exceptions process

Includes quick and automated smaller limit exceptions which will speed-up approvals

Plan specifications will be reviewed after 12 months’ experience

We realize this is a transition and to insure a smoother transition,we are happy to work with all PI’s on a case-by-case basis

8. Additional Info V –New ICS-ACI Plans Summary