service level agreements (slas) feb 2015 ics-aci · obtain serious stakeholder input and guidance...
TRANSCRIPT
Service Level Agreements
Service Level Agreements (SLAs) Feb 2015 ICS-ACI
1
ICS –ACI Goals
Providing high quality advanced computing and storage services for research with:
Open and transparent policies
Clearly defined services, service levels and expectations
Uniform access across the university-wide research community including those with data-intensive processing needs
Measurement-based approach to managing “quality of service”
Faculty governance through ICS-ACI Coordinating Committee
Service Level Agreements (SLAs) are the instrument to explain these services to the Penn State community
Understand that this is a transition and work to insure a smoother transition by working with PI’s on a case-by-case basis, as needed
2
Penn State Model Aspects
Provost is making significant investments so that CyberInfrastructure (CI) services can scale university wide and advance our research mission
Model takes into account the entire CI environment which includes high performance computing resources, high-bandwidth redundant network, fast and flexible storage, a comprehensive software stack, and a variety of services including operations, back-up, technical consulting, and training
Model is meant to be sustainable and extensible with approx. 1/5 of the capacity fully subsidized by the provost for exploratory research and free open access
Of the remaining capacity, approx. 65% will be subsidized by the Provost and 35% cost-recovered through funds under faculty control
The underlying systems and services capacities are not static and can be expanded and refined through input from the ICS-ACI Coordinating Committee and university-wide Research CI Governance Structures
3
How ICS-ACI Faculty Governance Works
ICS-ACI Coordinating Committee consists of research faculty and IT staff with broad university representation to:
Serve as the forum to develop, review, and update ICS-ACI policies, processes, and effective delivery of services as the needs of the university and community change (e.g. the ICS-ACI SLAs will be reviewed at least once a year)
Obtain serious stakeholder input and guidance such that “best practices” for research cyberinfrastructure can be deployed
Create an exceptions process to handle policy-edge cases in a quick, efficient, open and uniform manner
Provide conduits to disseminate information, such as services, service levels and expectations, to the university research community
Serve as the mechanism to refer pressing or unresolved issues to the Research CI Governance Committee
4
Organization of this Document
1. Glossary and Definition of Terms – Pages 5-8
2. Overview of ICS-ACI computing System – Pages 9-10
3. High Level PI View – Pages 11-12
4. User Experience Batch Job Submission – Pages 13-17
5. ICS-ACI Models – Pages 18-21
6. Q&A About ICS-ACI Models – Pages 22-32
7. ICS-ACI Storage Models – Pages 32-35
8. Additional Information – Pages 36-41
5
1. Glossary and Definition of Terms
6
1. Glossary and Definition of Terms - I
ICS-ACI – Institute for CyberScience - Advanced Cyber Infrastructure
Service Level Agreement (SLA) - Agreement between ICS and Research PI in relation to research ICS-ACI resources, e.g. access, storage, computing
ACI-b – ICS-ACI sub-system configured to execute jobs submitted to a variety of queues, i.e. batch processing
ACI-u – ICS-ACI User specific “Development/Test” interactive subsystem where PI’s may specify a system configuration for user-specific interactive sessions including root access and user-defined software stack
ICS-ACI-Burst – Queue to allow usage of computing resources in the ACI-b subsystem above a PI’s physical allocation when needed for a short time period
ICS-ACI-Guaranteed – Queue providing access to the ACI-b subsystem within a guaranteed time, provided that the request is within a PI’s physical allocation
ICS-ACI-Open – Queue to provide user access to idle ACI-b computing resources during times when all jobs are running and idle resources remain available
Batch – Executing or processing of a series of programs (jobs) on a system without manual intervention
7
Core – Data processing unit within a server. The total number of cores per server is dependent upon the vendor’s architecture of the server.
Core Allocation – Amount of physical computing resources purchased by or granted to a user through ICS-ACI plans
F&A – Facilities and Administration charge, referred to as “indirect” or “overhead”
Force Majeure - unforeseeable circumstances that prevent fulfillment of a contract, (e.g. Natural Disaster, Fire)
GPFS – General Parallel File System
Group – A self-defined set of multiple users--for example students and researchers in a faculty member's lab. Such rights as access to storage and allocation of resources can be delegated in an organized fashion by the PI.
Group Storage – Dedicated disk space for storing group-related data or research
Guaranteed Response Time – The maximum time that it takes for a job to start execution after submission to a queue
Home Directory – A user’s dedicated disk space for storing personal files, directories and programs. Directory that a user is taken to after logging into the system.
Legacy Systems – Pre-2015 ICS computing systems, such as the Lion-X clusters
1. Glossary and Definition of Terms - II
8
Login Nodes – Front end servers used to login to the ICS-ACI computing system
NAS – Networked Attached Storage
PI – Principal or Primary Investigator. Person, such as faculty, who is authorized to direct all of his or her research ICS-ACI resources, e.g. access, storage, computing
Pre-emption – The act of pausing or stopping a job that is currently processing in order to fulfill terms and conditions to other users under service level agreements
Scratch Directory – Disk space dedicated for temporary storage of data
System – The computing engine along with the software, storage, network, and peripheral devices that are necessary to make the computer function, e.g. ICS-ACI
Subsystem - A unit or device that is part of a larger system, e.g. ACI-b
User - A person, such as a student or faculty member, who has a User Account authorizing use the ICS-ACI resources.
User Account - The authorization agreement under which a user is entitled to access a
computing system. For details, see 'User Account Policies' on the ICS-ACI website.
Wall Time – A queue parameter that is set to define the maximum allowable execution time for a job once it has started
Work Directory – Disk space dedicated to a user for research data storage
1. Glossary and Definition of Terms - III
9
2. Overview of ICS-ACI computing System
10
“b”atch Systems (ACI-b) – Systems configured to execute jobs submitted to a variety of
Queues i.e. batch processing.
“u”ser-specific “Development/Test” Interactive Systems (ACI-u) – Systems in which
PI's may specify a system configuration for user-specific interactive sessions, including root access and user-defined software stacks.
“i”nteractive Systems (ACI-i) – Systems configured as a common GUI interactive system
for testing, small jobs, and pre/post processing
2. Overview of ICS-ACI computing System
GPFS NAS Tape
Standard Memory Cores
High Memory Cores
Interactive Cores
Login Nodes
ACI-b, ACI-u ACI-i
To Legacy SystemsTo PSU Research
Network
11
3. High Level PI View
12
PI gets core allocation via one of two models (detailed later)
Explore model
GReaT subscription model
PI can flexibly choose how he/she wants to compute
Standard Memory Cores or Large Memory Cores
Batch Job Submission to 1 of 3 different queues that gives users a choice depending on their time-criticality or usage model needs.
• Guaranteed Response Time Queue
• Burst Queue
• Open Queue
Use Interactive computing resources
Enter the User-specified custom environment
Example: Big Data Use Case: PI’s can use the User-specified custom environment and/or large memory computing resources
PI empowered to:
Choose model and Define groups
3. High Level PI View
13
4. User Experience Batch Job Submission
14
4. User Experience – I: Batch Job Submission to the ACI-b Subsystem
User with a specified core allocation can choose to submit jobs to one or both of following queues that operate on the ACI-b subsystem:
GUARANTEED Queue - users submits jobs within their core allocation, the jobs will begin within a specified Guaranteed Response Time
BURST Queue – user submits jobs within 4x of their core allocation, the jobs will follow the “burst” allocation
If a user wants to submit jobs over 4x of his or her core allocation, the PI may request exceptions to these limits through a transparent faculty-governed exception process. The exception process, defined later, will use best efforts to resolve exceptions in a quick (sometimes automated), efficient, open and uniform manner
NOTE: NO wall-time limit on Guaranteed or Burst provided user has allocation remaining
Anyone with an account on ICS-ACI can submit jobs to the following queue that operates on the ACI-b subsystem:
ICS-ACI OPEN Queue - provides leveraged access to idle ACI-b computing resources during times when all jobs are running and idle resources remain available
15
4. User Experience – II: Examples of ACI-b Batch Jobs
Assume a PI with a 20 core allocation
Example 1 - PI needs 20 cores for 720 hours – PI submits job to Guaranteed Queue and job will start
begin within a specified Guaranteed Response Time
Example 2 - PI needs to meet a proposal deadline and needs a job of 40 cores for 180 hours – PI
submits job to Burst Queue and job will follow “bursting” allocations
Example 3 - PI has an unbalanced workflow and needs to submit jobs twice a month requiring 40
cores for 90 hours – PI submits jobs to Burst Queue and jobs will follow “bursting” allocations
Example 4 - PI needs 100 cores for 72 hours – Job requires an exception and will follow the exceptions
process
16
4. User Experience - III: How Bursting Allocations Work
PI’s core allocation is converted into a 90-day Core Hour Allocation Example – PI with 20 cores
90-day Core Hour Allocation: 20 cores * 24 hours * 90 days = 43,200 core hours
Each PI has a sliding 90-day usage period
On day 1, PI has a 100% of his/her Job Priority in the BURST Queue
As PI uses allocation, a “Daily Core Hours Used” total is calculated Example – PI runs a job of 40 cores for 10 hours
Daily Core Hour Used: 40 cores * 10 hours = 400 core hours
PI’s Job Priority % is reduced based upon usage during the 90-day period First 1/3 usage decays at a faster rate than the remaining 2/3 usage
Once a PI has exhausted 100% of his or her 90-day Core Hour Allocation:
PI will not have any additional access to the GUARANTEED or BURST Queue
Start time based upon system availability (not guaranteed)
Capacity (max): 4x core allocation – but users can request exceptions to capacity limits through a transparent faculty-governed exception process
Day 1: PI has a 100% of highest priority in the BURST Queue & PI submits a job that executes and consumes 1500 core-hours
1500 cores-hours are reduced from his/her available allocation (9000-1500=7500) & new priority is re-calculated =67%
Day 6: PI has a 67% of highest priority in the BURST Queue & PI submits a job that executes and consumes 1500 core-hours
1500 cores-hours are reduced from his/her available allocation (7500-1500=6000) & new priority is re-calculated =33%
Day 11: PI has a 33% of highest priority in the BURST Queue & PI submits a job that executes and consumes 3000 core-hours
3000 cores-hours are reduced from his/her available allocation (6000-3000=3000) & new priority is re-calculated =16%
Day 16: PI has a 16% of highest priority in the BURST Queue & PI submits a job that executes and consumes 3000 core-hours
PI has consumed all of his/her 90-day allocation
He/She will not have any additional access to the GUARANTEED or BURST Queue (Can use OPEN Queue)
Day 17-91: OPEN Queue only for compute (NOTE: Any compute in the OPEN Queue does NOT count toward allocation)
Day 92 (red bracket): “Day 1’s” 1500 compute hours “roll-off”
1500 cores-hours are added back into his/her available allocation (0+1500=1500) & new priority is re-calculated =8%
Day 97 (green bracket): PI has a 8% of highest priority in the BURST Queue and 0 jobs submitted, “Day 6’s” 1500 hours “roll-off”
1500 cores-hours are added back into his/her available allocation (1500+1500=3000) & new priority is re-calculated =16%
Day 102 (purple bracket): PI has a 16% of highest priority in the BURST Queue and 0 jobs submitted, “Day 11’s” 3000 hours “roll-off”
3000 cores-hours are added back into his/her available allocation (3000+3000 =6000) & new priority is re-calculated =33%
Day 107 (blue bracket): PI has a 33% of highest priority in the BURST Queue and 0 jobs submitted, “Day 16’s” 3000 hours “roll-off”
3000 cores-hours are added back into his/her available allocation (6000+3000 =9000) & new priority is re-calculated = 100%
NOTE: First 1/3 usage decreases job priority at a faster rate than the remaining 2/3 usage
4. User Experience - IV: ExamplePI with 90-day allocation of 9000 Core-hours using Bursting in Rolling 90-day Usage Period(s)
17
Time
Available Allocation
% Priority
Used Allocation
90-day Usage Period(s)
100%
9000 hours
1500 hours
67%
7500 hours
1500 hours
33%
6000 hours
3000 hours
0%
0 Hours (Compute thru ICS-ACI Open
only)
0 hours
16%
3000 hours
3000 hours
16%
0 hours
100%
9000 hours
0 hours
Day 1 Day 6 Day 11 Days 17-91Day 16 Day 97 Day 107
3000 hours
8%
0 hours
Day 92
1500 hours
33%
3000 hours
0 hours
Day 102
18
5. ICS-ACI Models
19
5. ICS-ACI Models – I:ICS-ACI GReaT Model Plan Description
Provides access to the ACI-b and ACI-u systems within a guaranteed time while providing the flexibility to burst
For ACI-b systems, PI will be given a batch job queue for any PI-specified users. Submitted jobs will begin within a specified Guaranteed Response Time provided that the request is within a PI’s physical allocation. The current Guaranteed Response Time is 1 hour, however quicker response times are expected.
For ACI-u systems, PI's may specify a system configuration for user-specific interactive sessions, including root access and user-defined software stacks. Begin immediately provided the request does not exceed the PI’s core allocation.
Burst access available for PI's requiring capacity exceeding their core allocation
Plan Specifics
PI's charge core-years to grant and/or general funds (F&A WAIVED)
Available in units of 2 cores* (½ Node minimum)
3-, 4-, 5-year ICS-ACI GReaT plan - Acquire 2 cores, get 1 in ICS-ACI match These plans include 5TB of group storage for the equivalent term at no additional cost
1-, 2-year ICS-ACI GReaT plan – No(1)/Partial(2) match, but lower cost per core* - ICS-ACI recommends purchasing full node access for potential increased performance
20
5. ICS-ACI Models – II:ICS-ACI GReaT Plan Cost Model
Notes: (F&A WAIVED)
(1) Current Node Config.: Standard Memory – 20 cores/node; High Memory – 40 cores/node
(2) Cores Available in units of 2 cores* (½ Node minimum purchase)
(3) Includes ICS-ACI match of 1 core per every 2 cores purchased (3, 4, and 5 year plans only) (2 year plan has a partial match)
(4) Plans are also available in (per core including match): 2-yr (Std-$330/High-$630)
4-yr (Std-$528/High-$1028)5-yr (Std-$660/High-$1260)
* - ICS-ACI recommends purchasing full node access for potential increased performance
Plan Term Node
Type1
Effective
Direct Cost2,3
(per Core)
Effective
Total Cost3 for
1 current node1
(total cores)
ICS-ACI
GReaT4
3-year Std-Mem $396 $7,920 (20)
3-year High-Mem $756 $30,240 (40)
1-year Std-Mem $198 $3,960 (20)
1-year High-Mem $378 $15,120 (40)
21
5. ICS-ACI Models – III:ICS-ACI Explore Model Plan Description
Provides special access to the ACI-b and ACI-u systems for PI's who need specific, deterministic and guaranteed system capacity to conduct medium to large scale experiments/research
Examples – “proof-of-concept” for a proposal/paper, strategic exploratory research
Service levels are identical to the ICS-ACI GReaT model in terms of guarantees and bursting.
Small scale experiments/research are NOT suited for ICS-ACI Explore, however, PI's have access to the ICS-ACI-OPEN Queue for this research
Plan Specifics No charge to PI's, 100% Provost-sponsored service
Allocation durations are in terms of days (30 day min. – 6 month max.)
Available in units of 2 cores (10 core minimum)
Allocations made upon approval through a simple faculty-governed application/review process (TBD – modeled after other PSU institutes)
22
6. Q&A About ICS-ACI Models
23
For use cases that go beyond ICS-ACI policies, There is a need for a process to resolve issues in a quick, efficient, open and uniform manner
ICS-ACI Exceptions Process Characteristics:
Exceptions classifications will set clear limits for various functional areas (e.g. wall time limits, bursting, usage, recurring user exception requests)
Classifications: Acceptable, Discussion, Negotiation, Unacceptable
“Exceptions-process” mailbox and log will document the exception and its classification along with any recommended solutions
As needed, a small ICS-ACI faculty exceptions committee, drawn from the ICS-ACI Coordinating Committee, will quickly review exceptions and extend resolutions
Exceptions mailbox and logs will be reviewed periodically with ICS-ACI faculty exceptions committee and ICS-ACI coordinating committee
Exceptions limits will be revisited and revised based upon operational experience
6. Questions – I:How does ICS-ACI Exceptions Process Work?
24
Classification Impact to the user community and/or systems under normal operations (Example)
Resolution Process
Acceptable Minimal to No impact(one-time 8x burst request -24-hr wall-time)
ICS-ACI system approval automatically/semi-automatically with documented action
Discussion Moderate impact (multiple 20x burst requests –96-hr wall time)
ICS approves with documented reasoning
Negotiation Major impact (one-time 100x burst request -192-hr wall-time)
ICS makes resolution suggestions and sends to “faculty exceptions committee” for determination.Committee determines response
Unacceptable Extensive impact(one-time 1000x burst request 384-hr wall time)
ICS denies request with documented reasoning
6. Questions – I (cont.):ICS-ACI Exceptions Process Classifications
25
6. Questions – II: How does suspending of jobs work?
Jobs begun in the GUARANTEED/BURST Queue and user-specific interactive sessions will never be suspended/killed*
Jobs begun in the OPEN Queue, which provides leveraged access to idle computing resources during times when all jobs are running and idle resources remain available, may be suspended if those resources are needed for pending jobs in the GUARANTEED Queue or pending requests for user-specific interactive sessions
Jobs suspended in the OPEN Queues will be resumed when resources become available and before new pending OPEN jobs begin
In extreme instances, multiple suspended jobs may leave systems in a disordered state; only at that point will these suspended jobs be killed for system stability
A killed job will automatically be resubmitted to its queue
* Exception – jobs that harm the system or other jobs (e.g. run-away jobs, comprised accounts)
ICS-ACI will be developing user training to show users:
How to use techniques, such as check-pointing, that will save results periodically reducing impact of a killed job
How to create code that uses techniques that will store the state and move partial results to resources allowing correct resumption of jobs
26
6. Questions – III:How does ICS-ACI GReaT Costs Compare?
(1) Direct Cost – F&A applied as specified
(2) Effective direct cost per node to the PI. The 3-year period is for the CI service and not on the system life components of which will be refreshed in approx. 4.5 year periods and is factored into 3-year service cost
(3) Amazon costs extrapolated from 1 year/3 year per 32 vCPU node (4:1 over-commit) to match 20 physical core node
(4) eBuy costs of a single rack-mount node server with equivalent specs
(5) Uncollected F&A with improper management of fabrication projects. Some PI's had a higher cost
3 Year Costs1 Comparison per Computing Node
Computing
Node Type
ICS-ACI GReaT2 Amazon3 PSU eBuy4 Former RCC5
Standard Memory (20 cores/node) $7,920 $10,060 $10,216 $6,800-$10,0005
High Memory (40 cores/node) $30,240 N/A $42,000 N/A
F&A WAIVED Yes No No
27
6. Questions – IV:How does the value of ICS-ACI GReaT Compare?
Plan Characteristics
ICS-ACI
GReaT
Subscription model with guaranteed response times and bursting. Approx. 1/5 of system for no-cost computing; Offers: Both
Standard Memory (256GB) & High Memory (1TB) nodes w/traditional batch (HPC) and new user-specific interactive systems with
virtualization capability; Separate, expanded storage Pools (GPFS and NFS); Re-engineered for high performance; Faster, more
reliable networking; Newly established i-ASK Center for support; Coordinated user development, training & technical consulting
AmazonSubscription model with guaranteed response time and bursting. Offers: Only lower Standard memory (60GB) nodes;
No Infiniband for high performance parallel computing; External to PSU storage; Limited support
PSU eBuy
Capital purchase model that allows PI to keep the system for as long as it runs, but has to include the personnel effort for
system operations. PI has to purchase or get access to a chassis, rack, cabling, power, storage, back-up, infrastructure and
network, interconnect (e.g. IB), software stack, and other services as needed
Former RCC
Shared Capital purchase model with no guaranteed response times and unlimited bursting. No open service level agreements
and usage models which is inappropriate for university-wide community of researchers, e.g., of the 4,000 current users: 4 used
10% of system; 75 used 25% of system. Inappropriate scheduling, jobs run with increased capabilities, such as wall-time, even
after usage allocation is exhausted. Offers separate clusters with only lower Standard Memory, except Lion-XV, w/traditional
batch (HPC) only; No high memory nodes for “big data” workloads; Scratch and other storage pools in a single file system;
Ad-hoc user training. No additional charge if system ran for 4+ years
28
6. Questions – V:How does ICS-ACI GReaT Compare with Other Univ.?
Comparison to a couple of institutions is shown in the next slide ---NOTE: there is substantial variations in:
Levels of subsidy and services
System properties – including the ability to serve compute vs data intensive work loads
Access for exploratory research
Open access that leverages idle computing resources during times when all jobs are running and idle resources remain available
A thorough study will likely be undertaken by working groups through university-wide Research CI Governance that has been recently approved by the Provost
Professors Rob Hume and Scott Bennett are chairing the committee to implement the university-wide Research CI Governance structure
Topic Penn State U of Michigan Purdue
Model Type Subscription Subscription – Over-sold Subscription
Std. Memory Node -3-yr Costs (20 cores/node) (Shared Memory/node)
$7,920 (256 GB/Node) $8,438 (80 GB/Node)
$13,121 with F&A
$7,389 with college subsidy + F&A
$3,000 (64 GB/Node)
High Memory Node -3-yr Costs (40 cores/node) (Shared Memory/node)
$30,240 (1 TB/Node) $34,300 (1 TB/Node)
$53,337 with F&A
$29,781 with college subsidy + F&A
By request
F&A charged WAIVED Yes -On external funds Waived
Included in the Costs Cores, Storage, Network, Backup, Infrastructure, All HW/SW, +Classroom support
Cores, Storage, Network, Electric, Infrastructure, HW-only support
Cores Only
Time Period & Min-Size 1-5 years, ½ Node Min 1-month, 4 Cores Min 5-year, 1 Node Min
Guaranteed Response Time Yes – 1hr No – tries to keep it short Yes – 4hr
Bursting Yes – 4x + exception for larger, no limits on wall-time
No – But can buy more cores for bursting months
Yes – Unlimited cores with a 4-hr wall-time
Fully Subsidized Access (Exploratory/Open )
Yes , Approx. 1/5 of system No Yes – separate very small 256 core system, w/limits
Compute Type Traditional HPC & newvirtualized/interactive systems (cloud)
Traditional HPC only Traditional HPC only
Storage (H-Home, W-Work, G-Group, S-Scratch) Costs for extra Storage
NFS, H-10GB, W-128GB, S-1M Files, G-5TBAdd’l $100/1 TB per year
Additional, costs not known ISOLON, H-10GB, W/G – 100GBAdd’l $150/1TB per year
Subsidy by University 100% Explore/Open, 65% subscription
75% (90% with college subsidy) 90% or higher
30
6. Questions – VI:How do I monitor my usage?
ICS plans to launch a web portal to show PI's their system usage data
Broken down by Queue, Jobs, Core Hours, Configuration, and User
Access to remaining core hours and sliding 30-day 90-day usage period
Annual and total usage
ICS plans to send PI's e-mail notifications
When usage hits 50%, 75%, and 100% for the 30-day 90-day usage period
Monthly report of usage similar to the data found in the Web Portal
31
6. Questions – VII: What if I don’t have a core allocation or have used all of my core hour allocation?
ICS-ACI OPEN Queue provides leveraged access to idle ACI-b computing resources during times when all jobs are running and idle resources remain available
Jobs and interactive sessions begun under the ICS-ACI Open plan are delivered through a shared allocation with these characteristics:
All users are at an equal priority for access to idle resources.
Jobs and interactive sessions are started and will continue to run only when a sufficient number of idle cores are available.
Jobs and interactive sessions may be suspended or killed when demand for resources allocated to other Queues exceeds the supply.
There are no guarantees on completion time.
ICS-ACI OPEN Queue Specifics
No charge to PI's, 100% Provost-sponsored service
ICS-ACI Open capacity (max per user account): 100 jobs pending in the shared queue, 20 cores executing jobs at a given point in time, 24-hour job wall-times, 24-hour interactive sessions
Users may request exceptions to these capacity limits through a transparent faculty-governed exception process
32
6. Questions – VIII:What if I want to “Try-ACI”?
PI can obtain one-time trial access to the ACI-b and ACI-u systems to make sure that the computing environment meets his/her research goals.
Once granted, service levels will be identical to the ICS-ACI GReaT Plan
Plan Specifics
No charge to PI's, 100% Provost-sponsored service
PI should request the number of cores that will be requested once the trial period expires Approval based upon system availability and capacity limits
Capacity (max): 30 days from PI-indicated start time
Users may request exceptions to these capacity limits through a transparent faculty-governed exception process
33
7. ICS-ACI Storage Models
34
7. ICS-ACI Storage Models – I: Overview
By default, user accounts come with three storage areas, “Home”, “Work”, and “Scratch”, attached to the ICS-ACI cluster.
Group storage is available per the Group Storage Plans.
Capacity and capabilities of ICS-ACI storage
*ICS-ACI uses a high performance parallel GPFS scratch storage system that is available for each user of the cluster. Scratch space is intended for temporary data required between program runs. Files are not backed up and non-recoverable, including accidental deletion. The integrity of the scratch storage components is accomplished via a redundant disk system. All efforts are made to maintain integrity of the file system however there may be circumstances beyond our control that could result in the loss of data.
Removal Policy – files should be present for only 30 days from creation date. Users having files existing for longer than 45 days from creation date will be sent a reminder at 45, 52, 59 days to move the data. Files existing at 60 days beyond creation date will be purged from the system.
If files are needed for longer than 30 days or require back-up, they should be placed in Group storage.
Storage Directory Default Capacity CapabilitiesHome 10Gb NFS with Backup/Recovery
Work 128Gb NFS with Backup/Recovery
Scratch 1 million files* GPFS with no Backup*
Group 5Tb blocks NFS with Backup/Recovery and
dual mount capability
35
7. ICS-ACI Storage Models – II: ICS-ACI “Group” Model Plan Description
For users who require storage in addition to the 5TB block of Group Storage that comes with the 3-,4-, and 5-year ICS-ACI GReaT Plan
Provides flexibility for storage
Available for group access and may be segregated into sub-level block partitioning by the PI (e.g. Level 1 – PI; Level 2 – PI Labs)
Available to be configured as Dual Mount Group Storage* (e.g. mount externally as a drive on your computer and internally in the ICS-ACI)
*Minimum connectivity requirements apply to dual-mount, e.g. research network
Plan Specifics
Charge plan to grant or general funds
Available in blocks of 5Tb - Comes with back-up/recovery
Current plans (Total Costs – F&A WAIVED)
$7,500 per 25Tb* block for 3 years (4-yr: $10,000 5-yr: $12,500)
$1,500 per 5Tb block for 3 years (4-yr: $2,000 5-yr: $2,500)
*Sub-level block partitioning and Dual Mount available on 25Tb blocks
36
8. Additional Info
37
8. Additional Information – I: Terms & Conditions Summary
All SLAs come with:
SLA cover sheet indicating services, capacity, and service levels
Terms and Conditions that must be agreed to (copies available)
Key Terms and Conditions (T&C): ICS-ACI is governed by the ICS-ACI Coordinating Committee
User Accounts come with common and discipline-specific software stack. Additional common software packages will be installed if it meets established parameters or approved through a faculty-governed exception process.
PI-specific software applications, packages and libraries will be provided, installed, and maintained, and supported by the PI in accordance with vendor licensing agreements and export controls, as necessary. PI's must verify package safety.
Support via the newly created ICS-ACI Solutions and Knowledge (i-ASK) Center
New faculty-governed ICS-ACI policies (e.g., for User Accounts and Data Retention) are being developed and will be posted on the ICS website
ICS-ACI will maintain the systems as a highly available resource (97.8% uptime goal)
ICS-ACI will refresh system life components and software every 4.5 years (approx.) to ensure latest computer architectures and performance.
ICS-ACI will use best efforts to meet T&C except in cases of Force Majeure
38
8. Additional Info – II: New Plans Under Development for 2015
Hosting Plan - This plan will allow PI's to place their own hardware that
conforms to ICS-ACI specifications, within ICS-ACI clusters, leveraging staff, networking, storage, power and software.
Services Plan - This plan will allow PI's to purchase ICS-ACI technical
services, such as programming and consulting.
Archival Plan - This plan will allow PI's to purchase ICS-ACI archival services
for their research data. (Handles PI use cases that don’t fit other PSU services)
Cloud Burst Plan* - This plan will allow PI's to burst jobs to cloud providers,
such as Amazon, through ICS-ACI.
User Training and Development
Proposal letters, templates, text and other tools (e.g. calculator)
* Availability - CY ‘16
39
8. Additional Info – III: System Specifications
Availability of Nodes Deployed in Early S’15 Plan % of
System
Total Nodes(by Core Type)
Standard Memory High Memory
ICS-ACI GReaT(35% Cost Recovery, 65% Provost Funded)
70-80% 168-192 16.8-19.2
ICS-ACI Explore & OPEN(100% Provost Funded)
20-30% 48-72 4.8-7.2
Total 100% 240 24
40
8. Additional Info IV –SLA Roll-out
After a friendly test and use period in February/March, it is anticipated that the ICS-ACI computing resources will be available to the entire research community in April 2015. If you are in need of ICS-ACI resources and would like to discuss SLAs, please contact [email protected]
ICS is currently working on proposal letters, templates, text and other tools, such as calculators, for Service Level Agreements (SLAs) and if you are interested, please contact [email protected]
Questions and comments can be addressed to [email protected]
41
Old RCC Plans: Under the former RCC research computing model, some favored users received a lot more than they paid for while others were treated very inequitably.
The New ICS-ACI Plans: (Please see SLA details at ics.psu.edu)
Developed to scale university-wide and serve new research communities, including those with data-intensive processing needs
New and explicit rules intended to provide much fairer access to computing resources
Allocation and priority are tied directly to core usage and Service Level Agreements
Guaranteed Queue, Burst Queue and ICS-ACI OPEN Queue will be configured to insure conformity to Service Level Agreements and provide equitable access
Realistic limits are imposed, but
Burst Capability is automatically provided
Major exceptions can be requested via an open and transparent exceptions process
Includes quick and automated smaller limit exceptions which will speed-up approvals
Plan specifications will be reviewed after 12 months’ experience
We realize this is a transition and to insure a smoother transition,we are happy to work with all PI’s on a case-by-case basis
8. Additional Info V –New ICS-ACI Plans Summary