building large scale fabrics – a summary
DESCRIPTION
Building Large Scale Fabrics – A Summary. Marcel Kunze, FZK . Observation. Everybody seems to need unprecedented amount of CPU, Disk and Network b/w Trend to PC based computing fabrics and commodity hardware LCG (CERN), L. Robertson CDF (Fermilab), M. Neubauer D0 (FermiLab), I. Terekhov - PowerPoint PPT PresentationTRANSCRIPT
Building Large Scale Fabrics – A Summary
Marcel Kunze, FZK
ACAT 2002 Moscow Marcel Kunze - FZK
ObservationEverybody seems to need unprecedented amount of CPU, Disk and Network b/wTrend to PC based computing fabrics and commodity hardware
LCG (CERN), L. Robertson CDF (Fermilab), M. Neubauer D0 (FermiLab), I. Terekhov Belle (KEK), P. Krokovny Hera-B (DESY), J. Hernandez Ligo, P. Shawhan Virgo, D. Busculic AMS, A.KlimentovConsiderable savings in cost wrt. RISC based farm:Not enough ‘bang for the buck’ (M. Neubauer)
ACAT 2002 Moscow Marcel Kunze - FZK
AMS02 Benchmarks
Executive time of AMS “standard” job compare to CPU clock1) V.Choutko, A.Klimentov AMS note 2001-11-01
1)
Brand, CPU , Memory
Intel PII dual-CPU 450 MHz, 512 MB RAM
OS/Compiler
RH Linux 6.2 / gcc 2.95
“Sim”
1
“Rec”
1
Intel PIII dual-CPU 933 MHz, 512 MB RAM RH Linux 6.2 / gcc 2.95 0.54 0.54
Compaq, Quad α-ev67 600 MHz, 2 GB RAM RH Linux 6.2 / gcc 2.95 0.58 0.59
AMD Athlon,1.2GHz, 256 MB RAM RH Linux 6.2 / gcc 2.95 0.39 0.34
Intel Pentium IV 1.5GHz, 256 MB RAM RH Linux 6.2 / gcc 2.95 0.44 0.58
Compaq dual-CPU PIV Xeon 1.7GHz, 2GB RAM RH Linux 6.2 / gcc 2.95 0.32 0.39
Compaq dual α-ev68 866MHz, 2GB RAM Tru64 Unix/ cxx 6.2 0.23 0.25
Elonex Intel dual-CPU PIV Xeon 2GHz, 1GB RAM RH Linux 7.2 / gcc 2.95 0.29 0.35
AMD Athlon 1800MP, dual-CPU 1.53GHz, 1GB RAM RH Linux 7.2 / gcc 2.95 0.24 0.23
8 CPU SUN-Fire-880, 750MHz, 8GB RAM Solaris 5.8/C++ 5.2 0.52 0.45
24 CPU Sun Ultrasparc-III+, 900MHz, 96GB RAM RH Linux 6.2 / gcc 2.95 0.43 0.39
Compaq α-ev68 dual 866MHz, 2GB RAM RH Linux 7.1 / gcc 2.95 0.22 0.23
ACAT 2002 Moscow Marcel Kunze - FZK
SWITCH
C-IXP
WHO
TEN-155
KPNQwest
RENATER National ResearchNetworks
Mission Oriented Link & USLIC
Public
IN2P3
JEG (Japan)GenesisProject
CERN
Fabrics and Networks: Commodity Equipment
Needed for LHC at CERN in 2006: Storage
Raw recording rate 0.1 – 1 GB/secAccumulating at 5-8 PetaBytes/year
10 PetaBytes of diskProcessing
200’000 of today’s (2001) fastest PCs
Networks5-10 Gbps between main Grid
nodesDistributed computing effort to
avoid congestion: 1/3 at CERN 2/3 elsewhere
ACAT 2002 Moscow Marcel Kunze - FZK
PC Cluster 5(Belle)
1U serverPentium III 1.2GHz256 CPU(128 nodes)
ACAT 2002 Moscow Marcel Kunze - FZK
PC Cluster 6 Blade server: LP Pentium III 700MHz
40CPU (40 nodes)
3U
ACAT 2002 Moscow Marcel Kunze - FZK
Disk Storage
ACAT 2002 Moscow Marcel Kunze - FZK
IDE Performance
ACAT 2002 Moscow Marcel Kunze - FZK
Basic QuestionsCompute farms contain several 1000s of computing elementsStorage farms contain 1000s of disk drives
How to build scalable systems ?How to build reliable systems ?How to operate and maintain large fabrics ?How to recover from errors ?
EDG deals with the issue (P. Kunszt)IBM deals with the issue (N. Zheleznykh) Project Eliza: Self healing clusters
Several ideas and tools are already on the market
ACAT 2002 Moscow Marcel Kunze - FZK
Storage Scalability
Difficult to scale up to systems of 1000s of components and keep single system image:NFS-Automounter, Symbolic links etc.(M.Neubauer, CAF: ROOTD does not need this and allows for direct worldwide access to distributed files w/o mounts)
Scalability in size and throughput by means of storage virtualisationAllows to set up non-TCP/IP based systems to handle multi-GB/s
ACAT 2002 Moscow Marcel Kunze - FZK
Virtualisation of Storage
InternetIntranet
Data Servers mount virtual storage as SCSI-Device
Storage Area Network(FCAL, InfiniBand,…)
InputLoad balancing
switch
Shared Data Access
(Oracle, PROOF)
Scalability
200 MB/s sustained
ACAT 2002 Moscow Marcel Kunze - FZK
Storage Elements(M. Gasthuber)
PNFS = Perfectly Normal FileSystem Store MetaData with the Data 8 hierarchies of file tags
Migration of data (hierarchical storage systems): dCache Development of DESY and FermiLab ACLs, Kerberos, ROOT-aware Web-Monitoring Cached as well as direct tape access Fail-safe
ACAT 2002 Moscow Marcel Kunze - FZK
Necessary admin. Tools(A. Manabe)
System (SW) Installation /update Dolly++ (Image cloning)
Configuration Arusha (http://ark.sourceforge.net) LCFGng (http://www.lcfg.org)
Status Monitoring/ System Health Check CPU/memory/disk/network utilization: Ganglia*1,plantir*2
(Sub-)system service sanity check: Pikt*3/Pica*4/cfengine*1 http://ganglia.sourceforge.net *2 http://www.netsonde.com*3 http://pikt.org *4 http://pica.sourceforge.net/wtf.html
Command Execution WANI: WEB base remote command executer
ACAT 2002 Moscow Marcel Kunze - FZK
WANI is implemented on `Webmin’ GUI
Command input
Node selection
Start
ACAT 2002 Moscow Marcel Kunze - FZK
Command execution result
Host name
Results from 200nodesin 1 Page
ACAT 2002 Moscow Marcel Kunze - FZK
Click here
Stderr output
Click here
Stdout output
ACAT 2002 Moscow Marcel Kunze - FZK
CPU Scalability
The current tools scale up to ~1000 CPUs(In the previous example 10000 CPUs would require to check 50 pages)Autonomous operation requiredIntelligent self-healing clusters
ACAT 2002 Moscow Marcel Kunze - FZK
Resource Scheduling
Problem: How to access local resources from the Grid ?Local batch queues vs. Global batch queues Extension of Dynamite (Amsterdam university) to
work with Globus: Dynamite-G (I. Shoshmina)
Open Question: How do we deal with interactive applications on the Grid ?
ACAT 2002 Moscow Marcel Kunze - FZK
ConclusionsA lot of tools existA lot of work needs yet to be done in the Fabric area in order to get reliable, scalable systems