lessons learned from large scale lsf scalability - indico - cern
TRANSCRIPT
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Sebastien Goasguen, Belmiro Rodrigues Moreira, Ewan Roche, Ulrich Schwickerath, Romain Wartel
HEPIX2010, Cornell University
Lessons learned from large scale LSF scalability tests
PES
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
PES Outline
Why we did it
How we did it
Lessons learned:While doing itFrom the actual measurements
Lessons learned from large scale LSF scalability tests - 3
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Why scalability tests ?
Facts about the CERN batch farm lxbatchSize:~ 3,500 physical hosts
~ 40 hardware types
~ 25,000 CPU cores
~ 80 processing queues
~ 180,000 jobs/day
~ 22,000 jobs total per time
~ 20,000 registered users
User jobs: Chaotic and unpredictable mixture of CPU or I/O intensive jobs
mostly independent sequential jobs, few multi-threaded
Currently supported OS systems: SLC5 (64bit) and SLC4 (32/64bit)
Lessons learned from large scale LSF scalability tests - 4
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Why scalability tests ?
Growth rate over the past years
>50% dedicated
One LSF instance
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Why scalability tests ?
LSF limits
Vendor information (Platform priv. Comm.) (*):Communication layer ~20,000 hostsBatch system layer ~5,000-6,000 hostsUp to ~500,000 jobs
Previous LSF scalability tests at CERNFocus on batch queries and jobsFocus on fail-over system layoutLimitations on WN not testable
(*) mostly based on simulations
Lessons learned from large scale LSF scalability tests - 6
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Why scalability tests ?
We already start to see slower response times at our current scale when there are many concurrent queries
Virtualizing the batch worker nodes has a big impact on the number of resources the batch system needs to manage
– 8:1 ratio now, even more in the future?
The use of virtualization enabled us to test LSF up to a larger scale than ever before
– up to 16,000 worker nodes
Lessons learned from large scale LSF scalability tests - 7
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
How we did it
Temporary access to 480 new physical worker nodes
2 x XEON L5520 @ 2.27GHz CPUs (2 x 4 cores)
24GB RAM
~1TB local disk space
HS06 rating ~97
SLC5 XEN (1 rack with KVM for testing)
Up to 44 VMs per hypervisor (on private IPs)
Lessons learned from large scale LSF scalability tests - 8
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
How we did it
By making use of our internal Cloud infrastructure !
Opportunity to test and probe in addition:
The general infrastructure
The dynamic extending and shrinking of the batch farm
The worker node behavior at large scale (software setup)
The image distribution system
The image provisioning systems
Lessons learned from large scale LSF scalability tests - 9
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
How we did it
Initial limitations to protect infrastructure No lemon monitoring of the virtual machines
No AFS support in the virtual machines
Rapid coming and going of clients, issue with call backs
Use of ssh keys to access VMs instead of krb5
No attempt to quattor-manage individual VMs
Lessons learned from large scale LSF scalability tests - 10
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Lessons learned
General infrastructure:
DNS update rate exhausted when registering VMs
Throttle the creation/destruction rate
High LDAP query rate
Fixed batch worker node tools (pool cleaner)
Switched off quota on the VMs
May need more ldap servers in the future
Lessons learned from large scale LSF scalability tests - 11
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Lessons learned
LSF itself …
Main issue: dynamic resizing of the farm
Close collaboration with Platform Computing
Tests beyond of what is officially supported
Several iterations with bug fixes on
Mbatchd
Sbatchd
Lim
Lessons learned from large scale LSF scalability tests - 12
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Lessons learned
LSF software: main issues found/fixed Mbatchd performance issue with dynamic hosts
Internal reconfigurations slow down the system
Some mbatchd crashes, specifically while removing hosts
Internal 9999 limit on number of hosts in use
Large memory footprint of mbatchd
Will hopefully be addressed in future LSF versions
Lessons learned from large scale LSF scalability tests - 13
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Lessons learned
LSF master hardware upgrade
Started with 24GB RAM dual Nehalem CPU server
Replaced by 48GB RAM dual Nehalem CPU server
Mbatchd forking problems with >12,000 VMs
Enough to reach ~16,900 VMs and ~400,000 jobs
Lessons learned from large scale LSF scalability tests - 14
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Lessons learned
LSF scalability test layout: the measurement grid
0 – 400,000 jobs
50,000 jobs steps
Use of job arrays à 1,000 jobs each
Short micro benchmark jobs
Most tests done with dispatching enabled
0 – 15,000 worker nodes
Bin size ~2,500 worker nodes
Initially XEN, later XEN + KVM
ISF and ONE used as provisioning systems
Lessons learned from large scale LSF scalability tests - 15
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Lessons learned
LSF scalability test layout: probing the system
Data requisition from a monitoring script which
Ran on the master and some clients in a loopRecorded the
Time stampNumber of nodes and number of jobsAverage LSF communication response timeAverage LSF batch command response time
Lessons learned from large scale LSF scalability tests - 16
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Lessons learned
4 844172 340 508 676 101232
6088
116144 200
228256
284312 368
396424
452480 536
564592
620648 704
732760
788816 872
900928
956984 1040
10681096
11241152
11801208
12361264
12921320
13481376
14041432
14601488
15161544
15721600
1628
02000400060008000
10000120001400016000
4 844172 340 508 676 101232
6088
116144 200
228256
284312 368
396424
452480 536
564592
620648 704
732760
788816 872
900928
956984 1040
10681096
11241152
11801208
12361264
12921320
13481376
14041432
14601488
15161544
15721600
1628
0
100000
200000
300000
400000
500000
Nod
esT
otal
jobs
4 844172 340 508 676 101232
6088
116144 200
228256
284312 368
396424
452480 536
564592
620648 704
732760
788816 872
900928
956984 1040
10681096
11241152
11801208
12361264
12921320
13481376
14041432
14601488
15161544
15721600
1628
0
2000
4000
6000
8000
10000
12000
14000
16000
4 844172 340 508 676 101232
6088
116144 200
228256
284312 368
396424
452480 536
564592
620648 704
732760
788816 872
900928
956984 1040
10681096
11241152
11801208
12361264
12921320
13481376
14041432
14601488
15161544
15721600
1628
0
10
20
30
40
50
60
Res
pon s
e tim
eR
unni
ng jo
bs
Lessons learned from large scale LSF scalability tests - 17
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Lessons learned
LSF communication layer: lsid (client view)
Behaves as expected
Fast response time
Some increase > 12,500VM
Lessons learned from large scale LSF scalability tests - 18
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Lessons learned
LSF communication layer: lshosts, client viewReturns a list of hosts in the system
Behaves as expected
Small dependency on job number
Lessons learned from large scale LSF scalability tests - 19
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Lessons learned Batch system layer: job submission with bsub (client)
Statistics + binning ?
T0 requires t<1s !!!
Issues >10,000 nodes
Lessons learned from large scale LSF scalability tests - 20
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Lessons learned
Batch system layer: job queries (client view)
Increase with jobs
Increase with worker nodes
Lessons learned from large scale LSF scalability tests - 21
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Lessons learned Scheduling performance: average job slot usage
Above 10,000 nodes up to 4%-5% empty job slots
Ratio of Running jobs and
Job slots
Lessons learned from large scale LSF scalability tests - 22
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Scheduling performance
Ratio of Running jobs and
Job slots
1% free
4% free
Broadening distributions
Lessons learned from large scale LSF scalability tests - 23
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Summary
Thanks to our IaaS infrastructure we have brought LSF to its limits
Up to 20x the number of jobs Up to 4-5x the number of worker nodes
The results are consistent with the expectations from the vendors statements
The results have been made available to Platform and discussed with them in detail