an agent based, dynamic service system to monitor, control ... · an agent based, dynamic service...
TRANSCRIPT
Iosif Legrand June 20071
IosifIosif LegrandLegrandCalifornia Institute of Technology
An Agent Based, Dynamic Service System to Monitor,An Agent Based, Dynamic Service System to Monitor,Control and Optimize Distributed SystemsControl and Optimize Distributed Systems
June 2007
Iosif Legrand June 200722
The The MonALISAMonALISA FrameworkFramework
An Agent Based, Dynamic Service System to Monitor,An Agent Based, Dynamic Service System to Monitor,Control and Optimize Distributed SystemsControl and Optimize Distributed Systems
MonALISA is a Dynamic, Distributed Service System capable to collect any type of information from different systems, to analyze it in near real time and to provide support for automated control decisions and global optimization of workflows in complex grid systems. The MonALISA system is designed as an ensemble of autonomous multi-threaded, self-describing agent-based subsystems which are registered as dynamic services, and are able to collaborate and cooperate in performing a wide range of monitoring tasks. These agents can analyze and process the information, in a distributed way, and to provide optimization decisions in large scale distributed applications.An agent-based architecture provides the ability to invest the system with increasing degrees of intelligence; to reduce complexity and make global systems manageable in real time. For an effective use of distributed resources, these services provide adaptability and self-organization
Iosif Legrand June 20073
The The MonALISAMonALISA ArchitectureArchitecture
3
Regional or Global High Level Regional or Global High Level Services, Services, Repositories & ClientsRepositories & Clients
Secure and reliable communicationSecure and reliable communicationDynamic load balancing Dynamic load balancing Scalability & ReplicationScalability & ReplicationAAA for ClientsAAA for Clients
Distributed Dynamic Distributed Dynamic Registration and DiscoveryRegistration and Discovery--based on a lease based on a lease mechanism and remote eventsmechanism and remote events
JINI-Lookup Services Secure & Public
MonALISA services
Proxies
HL services
Agents
Network of
Distributed System for gathering and Distributed System for gathering and analyzing information based on analyzing information based on mobile agents: mobile agents: Customized aggregation, Triggers,Customized aggregation, Triggers,ActionsActions
Fully Distributed System with no Single Point of Failure
Iosif Legrand June 20074
MonALISAMonALISA service & Data Handlingservice & Data Handling
4
Data Store
Data CacheService & DB
Configuration Control (SSL)
Predicates & Agents
Data (via ML Proxy)
Applications Clients or Higher Level
Services
WS Clients andservice
WebService
WSDLSOAP
LookupService
LookupService
RegistrationDiscovery
Postgres
AGENTSAGENTSFILTERS / TRIGGERSFILTERS / TRIGGERS
Monitoring ModulesMonitoring ModulesCollects any type of information Dynamic Loading
Push and Pull
Iosif Legrand June 20075
Monitoring Grid sites, Running Jobs, Monitoring Grid sites, Running Jobs, Network Traffic, and ConnectivityNetwork Traffic, and Connectivity
5
TOPOLOGY
JOBS
ACCOUNTING
Running Jobs
Iosif Legrand June 20076
ApMon – Application Monitoring
0
10
20
30
40
50
60
70
0 1000 2000 3000 4000 5000 6000
Messages per second
Mon
ALI
SA C
PU U
sage
(%)
6
MonALISAService
MonALISAService
ApMon
ApMon
APPLICATION
APPLICATION
MonitoringData
UDP/XDR
Mbps_out: 0.52Status: reading
App. Monitoring
MB_inout: 562.4
ApMonConfig
parameter1: valueparameter2: value
App. Monitoring
...
Time;IP;procIDMonitoring
Data
UDP/XDR
MonitoringData
UDP/XDR
load1: 0.24processes: 97
System Monitoring
pages_in: 83
MonALISA
hostsConfig Servlet dynamic
reloading
ApMon configuration generated automatically by a servlet / CGI script
No Lost Packages
Lightweight library of APIs (C, C++, Java, Perl, Python) that can be used to send any information to MonALISA ServicesHigh comm. performance FlexibleComplete Sys. Monitoring
Iosif Legrand June 20077
Monitoring the Execution of JobsMonitoring the Execution of Jobsand the Time Evolutionand the Time Evolution
7
SPLIT JOBSSPLIT JOBS
LIFELINES for JOBS
Job Job
Job1
Job2
Job3Job31
Job32
Summit a Job
DAG
Iosif Legrand June 200788
End User / Client AgentLISALISA-- LocalhostLocalhost Information Service AgentInformation Service Agent
AuthorizationService discoveryLocal detection of the hardware and software configurationComplete end-system monitoring: Per-process load, I/O and network throughputs, End-to-end performance measurements Will act as an active listener for all events related with the requests generated by its local applications.Can execute agents to re-configure the system and rollback
Iosif Legrand June 20079
LISA – Network Monitoring
Network monitoring module – monitors the network performance of the local workstation (network interfaces, traffic patterns, TCP/IP running stack parameters) and it also able to configure TCP/IP parameters for optimizing the networking performances. It can use IPERF, WEB 100 and other network monitoring tools.
Iosif Legrand June 200710
LISA- Provides an Efficient Integration for Distributed Systems and Applications
LISALookupService
LookupService
Discovery
Registration
Best Service
MonALISA
Application Service
MonALISA
Application Service
MonALISA
Application Service
MonALISA
Application Service
It is using external services to identify the real IP of the end system, its network ID and ASDiscovers MonALISA services and can select, based on service attributes, different applications and their parameters (location, AS, functionality, load … )
Based on information such as AS number or location, it determines a list with the best possible services. Registers as a listener for other service attributes (eg. number of connected clients). Continuously monitors the network connection with several selected services and provides the best one to be used from the client’s perspective. Measures network quality, detects faults and informs upper layer services to take appropriate decisions
Iosif Legrand June 200711
Monitoring Internet2 backbone NetworkMonitoring Internet2 backbone Network
11
Test for a Land Speed Record Test for a Land Speed Record ~ 7 ~ 7 Gb/sGb/s in a single TCP stream in a single TCP stream from Geneva to Caltechfrom Geneva to Caltech
Iosif Legrand June 200712
Monitoring USLHCnet
Operations & management assisted by agentOperations & management assisted by agent--based softwarebased softwareUsed on the new CIENA equipment used for network Used on the new CIENA equipment used for network managmentmanagment
Iosif Legrand June 200713
USLHCNet Integrated Traffic
Iosif Legrand June 200714
The UltraLight Network
BNL ESnet IN /OUT
Iosif Legrand June 200715
Monitoring The GLORIAD Ring
Iosif Legrand June 200716
Monitoring Network Topology Monitoring Network Topology Latency, RoutersLatency, Routers
16
NETWORKS
AS
ROUTERS
Iosif Legrand June 200717
Available Bandwidth MeasurementsAvailable Bandwidth MeasurementsEmbedded Embedded PathloadPathload module.module.
17
Iosif Legrand June 200718
Monitoring and Controlling Optical Switches
18
Port power monitoring
Controlling
Iosif Legrand June 200719
Monitoring Optical Switches
Iosif Legrand June 200720
ALICE : Job status & traffic - real-time map
20
Iosif Legrand June 200721
Local and Global Decision Framework
Based on monitoring information, Based on monitoring information, actions can be taken inactions can be taken in
ML ServiceML ServiceML Global Services / RepositoryML Global Services / Repository
Actions can be triggered byActions can be triggered byValues above/below given Values above/below given thresholdsthresholdsAbsence/presence of valuesAbsence/presence of valuesCorrelation between multiple valuesCorrelation between multiple values
Decisions typesDecisions typesAlertsAlerts
oo ee--mailmailoo Instant messagingInstant messagingoo RSS FeedsRSS Feeds
External commandsExternal commandsEvent loggingEvent loggingGlobal Optimization Services Global Optimization Services
21
Global ML
Services
ML Service
ML Service
Actions based onglobal information
Actions based onlocal information
• Traffic• Connectivity• Jobs• Hosts• Apps
• Temperature• Humidity• A/C Power• …
Sensors Local decisions
Global decisions
Global ML
Services
Iosif Legrand June 200722
Alerts and actionsAlerts and actions
22
MySQL daemon is automatically restartedwhen it runs out of memoryTrigger: threshold on VSZ memory usage
ALICE Production jobs queue is automaticallykept full by the automatic resubmissionTrigger: threshold on the number of aliprod waiting jobs
Administrators are kept up-to-date on the services’ statusTrigger: presence/absence of monitored information
Iosif Legrand June 200723
Monitoring Video Conference System: Reflectors and Communication Topology
Iosif Legrand June 200724
Creating a Dynamic, Global, Minimum Spanning Tree to optimize the connectivity
∑∈
=Tuv
uvwTw),(
)),(()(
A weighted connected graph G = (V,E) with nvertices and m edges. The quality of connectivity between any two reflectors is measured every 2s.Building in near real time a minimum-spanning tree T
Iosif Legrand June 200725
EVO: LISA Detects the Best Reflector for each Client and MonALISA Agents keep the reflectors connected in a MST
Dynamic Discovery of Dynamic Discovery of Reflectors Reflectors Creates and maintains, Creates and maintains, in realin real--time, the optimal time, the optimal connectivity between connectivity between reflectors (MST) based reflectors (MST) based on periodic network on periodic network measurements. measurements. Detects and monitor the Detects and monitor the User configuration, its User configuration, its hardware, the hardware, the connectivity and its connectivity and its performance.performance.Dynamically connects Dynamically connects the client to the best the client to the best reflector reflector Provides secure Provides secure administration. administration. It is using alarm triggers It is using alarm triggers to notify unexpected to notify unexpected events events
Iosif Legrand June 200726
“On-Demand”, Dynamic Path Allocation
26
Internet
A
>FDT A/fileX B/path/
OS path availableConfiguring interfacesStarting Data Transfer
Monitor
Control
TL1
Optical Switch
MonALISAService
MonALISA Distributed Service System
BOSAgent
Active light path
Regula
r IP pa
thReal time monitoring
APPLICATION
LISA AGENTLISA sets up
- Network Interfaces- TCP stack- Kernel parameters- Routes
LISA APPLICATION“use eth1.2, …”
LISALISAAgentAgent
DATA
CREATES AN END TO END PATH < 1s
Detects errors and automatically recreate theDetects errors and automatically recreate thepath in less than the TCP timeout path in less than the TCP timeout
Iosif Legrand June 20072727
FDT – Fast Data Transfer
FDT is an application for efficient data transfers.Easy to use. Written in java and runs on all major platforms. It is based on an asynchronous, multithreaded system which is using the NIO library and is able to:
stream continuously a list of files use independent threads to read and write on each physical devicetransfer data in parallel on multiple TCP streams, when necessaryuse appropriate size of buffers for disk IO and networking resume a file transfer session
Iosif Legrand June 20072828
FDT – Fast Data Transfer
Pool of buffers Kernel Space
Pool of buffers Kernel Space
Data Transfer Sockets / Channels
Independent threads per device
Restore the files frombuffers
Control connection / authorization
Iosif Legrand June 20072929
FDT – Memory to Memory Tests in WAN
CPUs Dual Core Intel Xenon @ 3.00 GHz, 4 GB RAM, 4 x 320 GB SATA Disks Connected with 10Gb/s Myricom
Iosif Legrand June 20073030
FDT – Memory to Memory Tests in WAN
C1-NY -> C1-GVA
Iosif Legrand June 20073131
FDT – Memory to Memory Tests in WAN
C1-NY <-> C1-GVA
Iosif Legrand June 200732
Memory-to-memory in transfers in WAN in both directions with two pairs of servers:
C1-NY -> C1-GVA
C2-NY <- C2-GVA
Iosif Legrand June 200733
Disk -to- Disk transfers in WAN
Page_INC1-NY -> C1-GVA
Read and writes on 4 SATA disks in parallel on each server
Mean traffic ~ 210 MB/s~ 0.75 TB per hour
MB
/s
CERN ->CALTECH
Read and writes on 2 RAID Controllers in parallel on each server
Mean traffic ~ 545 MB/s~ 2 TB per hour
Iosif Legrand June 200734
CERNGeneva
CALTECHPasadena
Starlight
Manlan
USLHCnet
Internet2
Controlling Optical Planes Automatic Path Recovery
“Fiber cut” simulationsThe traffic moves from one transatlantic line to the other oneFDT transfer (CERN – CALTECH) continues uninterruptedTCP fully recovers in ~ 20s
12
34
FDT Transfer
4 Fiber cuts simulations
Iosif Legrand June 200735
Bandwidth Challenge at SC2005
151 Gbs
~ 500 TB Total in 4h
Iosif Legrand June 200736
FDT Used at SC 2006Entire management was done with LISA & MonALISA
Iosif Legrand June 200737
Official BWCOfficial BWC Hyper BWCHyper BWC
SC2006
Iosif Legrand June 200738
Data Collection and Interfacing with Other Tools
MonALISA is interfaced with many monitoring tools and is capable to collect information from different applications:Computing Nodes / Farms (system information , network traffic… )
SNMP, Ganglia, dedicated scripts Routers , Switches , Optical Switches
SNMP, NetFlow, SFlow, TL1, MRTG, WSEnd to End Network performance
Pathload, IPERF, Pipes, Abing, ABping …Batch Queuing Systems
LSF, PBS, Condor, NQS, Grid Job Manager Applications
Root, Xrootd, CRAB, RRD, VRVS /EVO, …
Iosif Legrand June 200739
Communities using Communities using MonALISAMonALISA
39
Major CommunitiesALICEOSGCMSSTARVRVS LGC RUSSIA SE Europe GRID APAC Grid UNAM Grid (Mx)ITU
ABILENE ULTRALIGHT GLORIADLHC Net RoEduNETEnlightened
--
VRVSVRVSALICE
ABILENABILENEE
VRVSVRVS
OSGOSG
Demonstrated at:
Telecom World
WSIS 2003
SC 2004
Internet2 2005
TERENA 2005
IGrid 2005
SC 2005
CHEP 2006
CENIC 2006 Innovation Award for High-Performance Applications
MonALISA TodayRunning 24 X 7
at ~340 SitesCollecting ~ 1 000 000parameters in near real-time Update rate of 20,000 parameter updates per second Monitoring
12,000 computers> 100 WAN Links
Thousands of Grid jobs running concurrently
Iosif Legrand June 20074040
The MonALISA Architecture Provides:Distributed Distributed Registration and DiscoveryRegistration and Discovery for Services and Applications. for Services and Applications. Monitoring all aspects of complex systems :Monitoring all aspects of complex systems :
System information for computer nodes and clusters System information for computer nodes and clusters Network monitoring, topology, end to end performanceNetwork monitoring, topology, end to end performanceMonitoring the performance of Applications, Jobs or services Monitoring the performance of Applications, Jobs or services The End User Systems, its performance The End User Systems, its performance Environment; Video streaming Environment; Video streaming
Can Can interact with any other servicesinteract with any other services to provide in near realto provide in near real--time customized time customized information based on monitoring datainformation based on monitoring dataSecure, remote Secure, remote administrationadministration for services and applications for services and applications Agents to supervise applicationsAgents to supervise applications, trigger alarms, restart or reconfigure them, , trigger alarms, restart or reconfigure them, and to notify other services when certain conditions are detecand to notify other services when certain conditions are detected.ted.The The MonALISAMonALISA framework is used framework is used to develop higher level decision servicesto develop higher level decision services, , implemented as a distributed network of communicating agents, toimplemented as a distributed network of communicating agents, to perform perform global optimization tasks. global optimization tasks. Graphical User InterfacesGraphical User Interfaces to visualize complex informationto visualize complex information
http://monalisa.caltech.edu