gloriad's new measurement and monitoring system

45
GLORIAD's New Measurement and Monitoring System for Addressing Individual Customer-based Performance across a Global Network Fabric APAN Meeting January 16, 2013 Greg Cole Principal Investigator GLORIAD [email protected]

Upload: ed-dodds

Post on 12-May-2015

251 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: GLORIAD's New Measurement and Monitoring System

GLORIAD's New Measurement and Monitoring System for Addressing Individual Customer-based Performance across a Global Network Fabric

APAN Meeting

January 16, 2013

Greg ColePrincipal [email protected]

Page 2: GLORIAD's New Measurement and Monitoring System

PresentationDuring the past year, GLORIAD has been working on a new system for measuring and monitoring global network infrastructure focused less on "links" and more on addressing needs of individual users. To accomplish its goal of actively improving global infrastructure for individual customers, the new system is designed to:

(1) understand the network needs and requirements of a global customer base by actively studying utilization; (2) identify poor performance of individual applications by constantly (and in near-real-time) analyzing information on such per-flow metrics as load, packet loss, jitter and routing asymmetries; (3) mitigate poor performance of applications by identifying fabric weaknesses (4) build richly visual analysis applications such as GLORIAD-Earth and the new GloTOP to help make sense of the enormous volume of data.

To realize this new model of measurement and monitoring (focused less on links and more on individual customers), GLORIAD has recently moved from its netflow-based system (used since 1998 and storing approximately 1 million records per day) to a new, much more detailed system – collecting, storing and analyzing 200-400 million network utilization records per day – based on deployment of open-source Argus software (www.qosient.com/argus). The talk will focus on the benefits and the technical challenges of this new and actively evolving work.

Page 3: GLORIAD's New Measurement and Monitoring System

International infrastructure (circuits) ..

“No GLIF no GLORIAD”

Partners: SURFnet, NORDUnet, CSTnet (China), e-ARENA (Russia), KISTI (Korea), CANARIE (Canada), SingaREN, ENSTInet (Egypt), Tata Inst / Fund Rsrch/Bangalore Science Community, NLR/Internet2/NLR/NASA/FedNets, CERN/LHC

Sponsors: US NSF ($18.5M 1998-2015), Tata ($6M), USAID ($3.5M 2011-2013) all Intl partners (~$240M 1998-2015)

History: 1994 US-Russia Friends and Partners; 1996 US-Russia Civic Networking; 1997 US-Russia MIRnet; 2004 GLORIAD; 2009 GLORIAD/Taj; 2011 GLORIAD/Africa

Page 4: GLORIAD's New Measurement and Monitoring System

Thank you GLORIAD-US Team(Anita, Harika, Karen, Kim, Naveen, Predrag, Susie)

Page 5: GLORIAD's New Measurement and Monitoring System

GLORIAD Metrics

UtilizationPerformance

OperationsSecurity

(“you can’t manage [or improve] what you can’t measure” – quoting a wise NSF program official))

(“it’s all about ‘situational awareness’ and instrumenting towards that goal” )

Page 6: GLORIAD's New Measurement and Monitoring System

Utilization Monitoring(in (near) real-time)

“Top Talkers”Protocol utilizationApplication identification/utilizationTraffic Analysis (DNS, etc.)Real-time (alerts, etc.)Historical timeline analysis

Page 7: GLORIAD's New Measurement and Monitoring System

Utilization Monitoring

0"

20,000"

40,000"

60,000"

80,000"

100,000"

120,000"

140,000"

160,000"

180,000"

200,000"

2004)01"

2004)04"

2004)07"

2004)10"

2005)01"

2005)04"

2005)07"

2005)10"

2006)01"

2006)04"

2006)07"

2006)10"

2007)01"

2007)04"

2007)07"

2007)10"

2008)01"

2008)04"

2008)07"

2008)10"

2009)01"

2009)04"

2009)07"

2009)10"

2010)01"

2010)04"

2010)07"

2010)10"

2011)01"

2011)04"

2011)07"

2011)10"

2012)01"

Gigabytes*

GLORIAD*U2liza2on,*2004;2012*by*Country*Source*of*Traffic*

Other"

Poland"

Sweden"

Romania"

Netherlands"

Brazil"

Italy"

France"

Taiwan"

Norway"

Great"Britain"(UK)"

Hong"Kong"

Singapore"

Germany"

Switzerland"

Russian"FederaOon"

Canada"

Korea"(South)"

China"

United"States"

!(500.0)!

!'!!!!

!500.0!!

!1,000.0!!

!1,500.0!!

!2,000.0!!

!2,500.0!!

!3,000.0!!

!3,500.0!!

!4,000.0!!

!4,500.0!!

1999'06!

1999'08!

1999'10!

1999'12!

2000'02!

2000'06!

2000'08!

2000'10!

2000'12!

2001'02!

2001'04!

2001'06!

2001'08!

2001'10!

2001'12!

2002'02!

2002'04!

2002'06!

2002'10!

2002'12!

2003'02!

2003'04!

2003'06!

2003'08!

2003'10!

2003'12!

%"of"T

raffic"/"Mon

thly"

GLORIAD/MirNET"Traffic"1999=2003"

Russian!Federa:on!

United!States!

Germany!

Switzerland!

Taiwan!

Israel!

Great!Britain!(UK)!

Poland!

Netherlands!

France!

Japan!

Sweden!

Other!

Page 8: GLORIAD's New Measurement and Monitoring System

Utilization Monitoring(in (near) real-time)

Thank youCSTnet and KISTI (special thanks to

Tong, Haina, Chunjing, Jiangning, Xiaodan, Gang, Lei, Hui, Dongkyun,

Buseung)

Page 9: GLORIAD's New Measurement and Monitoring System

Utilization Monitoring(in (near) real-time)

Page 10: GLORIAD's New Measurement and Monitoring System

Utilization Monitoring(in (near) real-time)

Page 11: GLORIAD's New Measurement and Monitoring System

Utilization Monitoring(in (near) real-time)

Page 12: GLORIAD's New Measurement and Monitoring System

Utilization Monitoring(in (near) real-time)

Page 13: GLORIAD's New Measurement and Monitoring System

Utilization Monitoring(in (near) real-time)

Page 14: GLORIAD's New Measurement and Monitoring System

Utilization Monitoring(in (near) real-time)

Page 15: GLORIAD's New Measurement and Monitoring System

Performance Monitoring (in (near) real-time)

Key theme: we want to address real performance needs *before* users have to figure out who in the world to call about their “bad connection” - or before they decide that the “R&E Internet” is not adequate to their needs - i.e.,proactive performance mitigation (instead of reactive).

Another theme: we want to develop tools, technologies and experience that can be used throughout the global network fabric (local, campus, regional, national, international)- the real “home” for these tools will ultimately be the local network operators who live closest to the customers.

Page 16: GLORIAD's New Measurement and Monitoring System

New GloTop Application

Page 17: GLORIAD's New Measurement and Monitoring System

New dvNOC Application

Page 18: GLORIAD's New Measurement and Monitoring System

dvNOC System

Joint effort by US, China, Korea, Nordic teams (and, now, new GLORIAD/Taj partners)Based on solid measurement infrastructure, information management and information sharingFueled by the open-source Argus system of flow monitoring (5 second updates on all flows, 200-400 million flow-records/day; handles multi-G flow rates with room to spare)Focused on (1) understanding utilization, (2) improving performance systemically, (3) ensuring appropriate use, (4) distributing (decentralizing) operations and management of R&E networks

Page 19: GLORIAD's New Measurement and Monitoring System

Former Metrics Data Source

Page 20: GLORIAD's New Measurement and Monitoring System

“Taj” Measurement/Monitoring Update

Picture of GLORIAD/Taj new “nprobe” network measurement device. Hardware: Dell PowerEdge R410 Server - 8 core intel processor, 10GE Intel Fiber Card (ixgbe driver). Network utilization and performance measurement box - at 10G line speed designed to improve and extend open source nprobe netflow emitter software, emit extended netflow records including detailed information of packet retransmissions. Software base: Luca Deri’s nprobe.

The two screenshots above illustrate data generated from the Taj project’s new “nprobe” boxes deployed in Chicago and Seattle. The first illustrates top flows on the network; the second illustrates large flows suffering from poor performance (i.e., high packet retransmits). This data was formerly generated from GLORIAD’s packeteer system (limited to 1 Gbps circuit capacity).

2012 Transition to Argushttp://www.qosient.com/argus/We use Luca Deri’s enabling pf_ring underneath (and we’re also exploring freebsd’s netmap)

R

Page 21: GLORIAD's New Measurement and Monitoring System

ArgusFlexible open-source software packet sensors to

generate network flow records at line rate, for operations, performance and security.

Comprehensive, not statistical, bi-directional, with many flow models allowing you to track any

network traffic, not just 5-tuple IP traffic. Support for large scale collection, data

processing, storage and archiving, sharing, vizualization, with analytics, aggregation,

geospatial, netspatial analysis.

Page 22: GLORIAD's New Measurement and Monitoring System

Argus (author: Carter Bullard)

Page 23: GLORIAD's New Measurement and Monitoring System
Page 24: GLORIAD's New Measurement and Monitoring System

Current GLORIAD-US Deployment of Argus

KNOXVILLE RADIUM SERVERApple Xserver-1) Processors - 2 x 2.93GHz Quad-Core Intel Xeon2) Memory - 24GB (6x4GB)3) Hard drive - 1TB Serial ATA4) OS - 10.8Argus Analysis Tools (running on various (mostly apple)

SEATTLE ARGUS NODE

DELL R410 servers  -  1) Processors - 2 x Intel xeon X55670, 2.93GHz (Quad cores)2) Memory  - 8 GB (4 x 2GB) UDDIMMs 3) Hard drive - 500GB SAS 4) Intel 82599EB 10G NIC 5) OS - Centos 66) modified for PF_RING7) running argus daemon sending data to radium server in Knoxville

Seattle Force-10 Router

10G SPAN portCHICAGO ARGUS NODE

DELL R410 servers  -  1) Processors - 2 x Intel xeon X55670, 2.93GHz (Quad cores)2) Memory  - 8 GB (4 x 2GB) UDDIMMs 3) Hard drive - 500GB SAS 4) Intel 82599EB 10G NIC 5) OS - Centos 66) modified for PF_RING7) running argus daemon sending data to radium server in Knoxville

Chicago Force-10 Router

10G SPAN port

Page 25: GLORIAD's New Measurement and Monitoring System

Near-future GLORIAD-US Deployment of Argus

SEATTLE ARGUS NODE

DELL R410 servers  -  1) Processors - 2 x Intel xeon X55670, 2.93GHz (Quad cores)2) Memory  - 8 GB (4 x 2GB) UDDIMMs 3) Hard drive - 500GB SAS 4) Intel 82599EB 10G NIC 5) OS - Centos 66) modified for PF_RING7) running argus daemon sending data to radium server in Knoxville

Seattle Force-10 Router

10G SPAN portCHICAGO ARGUS NODE

DELL R410 servers  -  1) Processors - 2 x Intel xeon X55670, 2.93GHz (Quad cores)2) Memory  - 8 GB (4 x 2GB) UDDIMMs 3) Hard drive - 500GB SAS 4) Intel 82599EB 10G NIC 5) OS - Centos 66) modified for PF_RING7) running argus daemon sending data to radium server in Knoxville

Chicago Force-10 Router

10G SPAN portX X(use taps

instead)

(use taps

instead)

• Local Storage

• Local Analysis Hardware

• Ability to handle much more capacity

• Local Storage

• Local Analysis Hardware

• Ability to handle much more capacity

KNOXVILLE RADIUM SERVERApple Xserver-1) Processors - 2 x 2.93GHz Quad-Core Intel Xeon2) Memory - 24GB (6x4GB)3) Hard drive - 1TB Serial ATA4) OS - 10.8Argus Analysis Tools (running on various (mostly apple)

Big Farm of Cisco-providedBlade Servers

Fast AnalysisParallel Database Architecture

Page 26: GLORIAD's New Measurement and Monitoring System

Why all this power?• Preparing the data for this graph from 250G argus archive (which helped a large international R&E network systemically address a huge performance problem) took me 3 days with our current setup

• We want any of our partners to be able do this in 3 minutes (or less)

• We want “room” to better research the area of performance, operations and security analytics with our international partners

Page 27: GLORIAD's New Measurement and Monitoring System

But we’re still designing for lesser needs as well (targeting single 1G and

10G networks)

LinuxMacOSXFreeBSD

Page 28: GLORIAD's New Measurement and Monitoring System

Current Process

Chicago Argus Node

Seattle Argus Node

chained radium servers

Additional“ad hoc”

processing

racluster process

mysql archive of

top users (10 seconds)

Live Apps (glo-earth, glo-top, dvnoc)

rastream process (5 minutes)

Disk archive

(~300 million recs / day)

mySQL archive (1.8 million recs /

day)

Analysis Applications

Knoxville Radium Server

3 Mbps stream

3 Mbps stream

Page 29: GLORIAD's New Measurement and Monitoring System

New Process

Chicago Argus Node

Seattle Argus Node

chained radium servers

Additional“ad hoc”

processing

racluster process

mysql archive of

top users (10 seconds)

Live Apps (glo-earth, glo-top, dvnoc)

rastream process (5 minutes)

Disk archive

(~300 million recs / day)

mySQL archive (1.8 million recs /

day)

Analysis Applications

Knoxville Radium Server

3 Mbps stream

3 Mbps stream

Page 30: GLORIAD's New Measurement and Monitoring System

New Process (Dec/2012-Jan/2013)

32 core Cisco Blade Server (freeBSD) with 128G RAM, 5T RAID storage

“Farm” of Perl/POE/IKC Daemons Near-Realtime Analytics and Local Storage of Data“Top Users” DNS Analysis Bad Performers Link Analytics BGP Analysis ICMP Analysis Scan Analysis ...

Argus Nodes (for GLORIAD currently, Chicago and Seattle)

Argus Data (from Argus Nodes to a Core Radium Collector)

...

dvNOC ...GloTOP GLOEarth Ticketing System NOC Access

User Tools for Analysis, Operational Support and Visualization

Page 31: GLORIAD's New Measurement and Monitoring System

More detail ..

“Farm” of Perl/POE/IKC Daemons Near-Realtime Analytics and Local Storage of Data“Top Users” DNS Analysis Bad Performers Link Analytics BGP Analysis ICMP Analysis Scan Analysis ...

dvNOC ...GloTOP GLOEarth Web Reports NOC Access

User Tools for Analysis and Visualization

• Built with Runrev LiveCode

• Multi-platform (Mac, Windows, Linux, iOS, Android)

• Event-driven, graphic/media rich applications

• Perl POE event-loop, event-driven programming for “cooperative multi-tasking”

• IKC for inter-kernel communications between “animals”

• Daemonized (fast)

• Use MySQL (or any other) for long-term storage; SQLlite for local (fast) in-memory database

• Each “animal” on the “farm” is autonomous and very specialized

• Most read from a single argus RABINS stream

Page 32: GLORIAD's New Measurement and Monitoring System

All of the software, tools, data specifications, etc. are being

“Github’d”

(right thing to do (argus, perl, mysql, sqlite are all open)

and

we want people to help us ..)

Page 33: GLORIAD's New Measurement and Monitoring System

GLORIAD github

Page 34: GLORIAD's New Measurement and Monitoring System

“Operationalizing” this Data

Page 35: GLORIAD's New Measurement and Monitoring System

New dvNOC Application

Page 36: GLORIAD's New Measurement and Monitoring System

“REQUEST TRACKER” FED BY DATA FROM MONITORING SYSTEMS HOMEhttp://

Page 37: GLORIAD's New Measurement and Monitoring System

Poor-Performance Analysis

Page 38: GLORIAD's New Measurement and Monitoring System

ChicagoSource

x.x.3.226 Destination x.x.244.210

Active monitoring system - My TraceRoute(MTR)

Harika Tandra, GLORIAD

Page 39: GLORIAD's New Measurement and Monitoring System

Active monitoring system • For each under-performing flow identified, MTR runs are

triggered to source and destination IPs

• Triggered in near-real-time to the flow detected. Thus, test packets are triggered in network conditions similar to those seen by the real traffic

• Combining the two gives approximate end-to-end performance

Harika Tandra, GLORIAD

Page 40: GLORIAD's New Measurement and Monitoring System

Example network graphs for a few end hosts in U.S.

Harika Tandra, GLORIAD

Page 41: GLORIAD's New Measurement and Monitoring System

Example network graphs for a few end hosts in China

Representation :• Graph node - router in paths discovered by MTR.• Rect. node - the end host.• Node label -

• 1st line - value of cost function• 2nd line - IP (anonymized)• 3rd line- Avg. %packet loss at the node.

• Color map ranges from Yellow through orange to red.• this graph is color mapped based on the ‘Avg.

%packet loss’ value. • Edges labels : ‘A-B’ where

• A => Total number of mtr runs through the parent to child node.

• B => Number of runs in which there was non-zero packet loss.

• Gray nodes are nodes which saw no packet loss.

Harika Tandra, GLORIAD

Page 42: GLORIAD's New Measurement and Monitoring System

Data ModelUse mySQL (partly for benefit of using the myisam heap tables (fast)) but also now using SQLite for local autonomous analysis engines (very fast, especially with :memory: tables)

Use BerkeleyDB for some things (tying perl hashes to disk data stores)

Two large databases

pflow - primary IP@s, ASNums, Domains, all large flows (1998 - current (~1.4 billion records)), support tables (ip mapping tables, ccodes, world regions, sci disciplines, protocols, services, etc.)

summary - various tables to enable fast search/retrieval of flow information

Experimenting with Argus rasql tools (powerful)

Using Argus ralabel (with geoip) for live labeling of all flow updates with country codes, asnums, lat/long, etc.

Looking at hadoop, others for parallel capabilities

Page 43: GLORIAD's New Measurement and Monitoring System

Key Database Management Tables

Page 44: GLORIAD's New Measurement and Monitoring System

Summary: Core TechnologiesArgus as passive monitor (formerly packeteer and then nprobe) running on top of pf_ring (or freebsd’s netmap)Mtr as active monitorMysql as underlying database (exploring alternatives now) along with SQLite and BerkeleyDBRunRev’s LiveCode for front-end client development (we formerly used Flash) (someday this should be html5 apps (?))Perl/POE/IKC for back-end “cooperative multitasking” server

Page 45: GLORIAD's New Measurement and Monitoring System

Summary

Work builds on efforts since 1999Argus offers a *lot* of advantages over netflow or sflowData management problem *is* solvableWe hope to encourage an open global, community effort to deploy common standards and tools addressing metrics for R&E network performance, operations and security