better management of large-scale, heterogeneous networks · telemetry challenges snmp object...
TRANSCRIPT
![Page 1: Better management of large-scale, heterogeneous networks · Telemetry challenges SNMP object collection growing with each platform generation o e.g., 100K objects on current platforms,](https://reader034.vdocuments.mx/reader034/viewer/2022042117/5e9593660b5c56015047b7cf/html5/thumbnails/1.jpg)
Better management of large-scale, heterogeneous networks toward a programmable management plane
Joshua George, Anees Shaikh Google Network Operations
www.openconfig.net
![Page 2: Better management of large-scale, heterogeneous networks · Telemetry challenges SNMP object collection growing with each platform generation o e.g., 100K objects on current platforms,](https://reader034.vdocuments.mx/reader034/viewer/2022042117/5e9593660b5c56015047b7cf/html5/thumbnails/2.jpg)
Management plane challenges
Rethinking telemetry -- efficient, large-scale monitoring
OpenConfig -- community-driven API development 3
2
1
Agenda
2
![Page 3: Better management of large-scale, heterogeneous networks · Telemetry challenges SNMP object collection growing with each platform generation o e.g., 100K objects on current platforms,](https://reader034.vdocuments.mx/reader034/viewer/2022042117/5e9593660b5c56015047b7cf/html5/thumbnails/3.jpg)
Management Plane Challenges
3
![Page 4: Better management of large-scale, heterogeneous networks · Telemetry challenges SNMP object collection growing with each platform generation o e.g., 100K objects on current platforms,](https://reader034.vdocuments.mx/reader034/viewer/2022042117/5e9593660b5c56015047b7cf/html5/thumbnails/4.jpg)
Challenges of managing a large-scale network
● more than 8M OIDs collected every 5 minutes
● more than 20K CLI commands issued and scraped every 5 minutes
● many tools, and multiple generations of software
Opportunity for significant OPEX savings: reduced outage impact, simplification of management stack, better scaling
4
● 20+ network device roles
● more than half dozen vendors, multiple platforms
● 4M lines of configuration files
● up to ~30K configuration changes per month
![Page 5: Better management of large-scale, heterogeneous networks · Telemetry challenges SNMP object collection growing with each platform generation o e.g., 100K objects on current platforms,](https://reader034.vdocuments.mx/reader034/viewer/2022042117/5e9593660b5c56015047b7cf/html5/thumbnails/5.jpg)
Management plane is way behind
● proprietary CLIs, lots of scripts
● imperative, incremental configuration
● lack of abstractions
● configuration scraping from devices
● SNMP monitoring -- not always “simple” and not often scalable
5
![Page 6: Better management of large-scale, heterogeneous networks · Telemetry challenges SNMP object collection growing with each platform generation o e.g., 100K objects on current platforms,](https://reader034.vdocuments.mx/reader034/viewer/2022042117/5e9593660b5c56015047b7cf/html5/thumbnails/6.jpg)
Configuration
• describes configuration data structure and content
• common modeling language: YANG
• multiple data encodings: protobuf, XML, JSON, ...
Topology
• describes structure of the network
• common modeling language: multiple
• data encoding: protobuf, ...
Telemetry
• describes monitoring data structure and attributes
• common modeling language: exploring YANG
• data delivery: RPC, protobuf inside UDP
Model-driven network management
6
![Page 7: Better management of large-scale, heterogeneous networks · Telemetry challenges SNMP object collection growing with each platform generation o e.g., 100K objects on current platforms,](https://reader034.vdocuments.mx/reader034/viewer/2022042117/5e9593660b5c56015047b7cf/html5/thumbnails/7.jpg)
Rethinking Network Telemetry
7
![Page 8: Better management of large-scale, heterogeneous networks · Telemetry challenges SNMP object collection growing with each platform generation o e.g., 100K objects on current platforms,](https://reader034.vdocuments.mx/reader034/viewer/2022042117/5e9593660b5c56015047b7cf/html5/thumbnails/8.jpg)
Telemetry solutions today What do we use? Often SNMP is the default choice. ● legacy implementations -- designed for limited processing and
bandwidth
● expensive discoverability -- re-walk MIBs to discover new elements
● no capability advertisement -- test OIDs to determine support
● rigid structure -- limited extensibility to add new data
● proprietary data -- require vendor-specific mappings and multiple requests to reassemble data
● protocol stagnation -- no absorption of current data modeling and transmission techniques
8
![Page 9: Better management of large-scale, heterogeneous networks · Telemetry challenges SNMP object collection growing with each platform generation o e.g., 100K objects on current platforms,](https://reader034.vdocuments.mx/reader034/viewer/2022042117/5e9593660b5c56015047b7cf/html5/thumbnails/9.jpg)
Telemetry challenges
● SNMP object collection growing with each platform generation o e.g., 100K objects on current platforms, expected to grow 3x over
next 2 generations o similar for object collection frequency
● Future devices continue to grow in density and drive this trend o scale limitations in data acquisition at high frequencies
● Near-real-time acquisition and access to monitoring data is a requirement for <insert buzzword here> o traffic management, tight control loops, fast recovery
9
![Page 10: Better management of large-scale, heterogeneous networks · Telemetry challenges SNMP object collection growing with each platform generation o e.g., 100K objects on current platforms,](https://reader034.vdocuments.mx/reader034/viewer/2022042117/5e9593660b5c56015047b7cf/html5/thumbnails/10.jpg)
I get it, you really don’t like SNMP…
10
but do you have a better idea?
![Page 11: Better management of large-scale, heterogeneous networks · Telemetry challenges SNMP object collection growing with each platform generation o e.g., 100K objects on current platforms,](https://reader034.vdocuments.mx/reader034/viewer/2022042117/5e9593660b5c56015047b7cf/html5/thumbnails/11.jpg)
Rethinking telemetry...reverse the flow ○ stream data continuously --
with incremental updates based on subscriptions
○ observe network state through a time-series data stream
○ devices programmed with a data model describing desired structure and content
○ efficient, secure transport protocols
11
![Page 12: Better management of large-scale, heterogeneous networks · Telemetry challenges SNMP object collection growing with each platform generation o e.g., 100K objects on current platforms,](https://reader034.vdocuments.mx/reader034/viewer/2022042117/5e9593660b5c56015047b7cf/html5/thumbnails/12.jpg)
Telemetry framework requirements ● network elements stream data to collectors (push model)
● data populated based on vendor-neutral models whenever possible
● utilize a publish/subscribe API to select desired data
● scale for next 10 years of density growth with high data freshness
o other protocols distribute load to hardware, so should telemetry
● utilize modern transport mechanisms with active development communities
o gRPC (HTTP/2), Thrift, etc.
o protocol buffer over UDP
12
![Page 13: Better management of large-scale, heterogeneous networks · Telemetry challenges SNMP object collection growing with each platform generation o e.g., 100K objects on current platforms,](https://reader034.vdocuments.mx/reader034/viewer/2022042117/5e9593660b5c56015047b7cf/html5/thumbnails/13.jpg)
Example telemetry configuration flow
13
Network Management System
Structured inventory and set of telemetry capabilities pushed to the NMS
Rules for typical monitoring
Tactical overrides from operators
Generate monitoring configuration
Publish to network element
gRPC endpoint
![Page 14: Better management of large-scale, heterogeneous networks · Telemetry challenges SNMP object collection growing with each platform generation o e.g., 100K objects on current platforms,](https://reader034.vdocuments.mx/reader034/viewer/2022042117/5e9593660b5c56015047b7cf/html5/thumbnails/14.jpg)
Example telemetry data flow
14
Collection Infrastructure
Send statistics every X seconds
Asynchronous event reporting
Operator requests for ad-hoc data.
3 types of telemetry events: ● Bulk time series data
○ All interface stats every 10 seconds.
● Event/edge driven updates ○ LSP A is now down.
● Operator request/response ○ Show me oper state for all
interfaces.
gRPC endpoint
![Page 15: Better management of large-scale, heterogeneous networks · Telemetry challenges SNMP object collection growing with each platform generation o e.g., 100K objects on current platforms,](https://reader034.vdocuments.mx/reader034/viewer/2022042117/5e9593660b5c56015047b7cf/html5/thumbnails/15.jpg)
Practical realization
● Streaming telemetry is beyond an idea stage, but is far from a final product
● Multiple vendor implementations now available for experimentation
● Development is ongoing -- now is the time to share your requirements and make your voice heard !
15
![Page 16: Better management of large-scale, heterogeneous networks · Telemetry challenges SNMP object collection growing with each platform generation o e.g., 100K objects on current platforms,](https://reader034.vdocuments.mx/reader034/viewer/2022042117/5e9593660b5c56015047b7cf/html5/thumbnails/16.jpg)
OpenConfig
16
![Page 17: Better management of large-scale, heterogeneous networks · Telemetry challenges SNMP object collection growing with each platform generation o e.g., 100K objects on current platforms,](https://reader034.vdocuments.mx/reader034/viewer/2022042117/5e9593660b5c56015047b7cf/html5/thumbnails/17.jpg)
OpenConfig ● Informal industry collaboration of network operators
● Focus: define vendor-neutral configuration and operational state models based on real operations o Adopted YANG data modeling language (RFC 6020)
● Participants: Apple, AT&T, BT, Comcast, Cox, Facebook, Google Level3, Microsoft, Verizon, Yahoo!
● Primary output is model code, published as open source via public github repo
● Ongoing interactions with standards and open source communities (e.g., IETF, ONF, ODL, ONOS)
17
![Page 18: Better management of large-scale, heterogeneous networks · Telemetry challenges SNMP object collection growing with each platform generation o e.g., 100K objects on current platforms,](https://reader034.vdocuments.mx/reader034/viewer/2022042117/5e9593660b5c56015047b7cf/html5/thumbnails/18.jpg)
Example configuration pipeline
configuration data vendor-neutral, validated
devices
18
OC YANG models
configuration generation
gRPC req
operators
intent API
“drain peering link”
update topology model
gRPC endpoint
![Page 19: Better management of large-scale, heterogeneous networks · Telemetry challenges SNMP object collection growing with each platform generation o e.g., 100K objects on current platforms,](https://reader034.vdocuments.mx/reader034/viewer/2022042117/5e9593660b5c56015047b7cf/html5/thumbnails/19.jpg)
Extending OpenConfig models
● base OpenConfig model as a starting point
● vendors can offer augmentations / deviations
● operators can add locally consumed extensions
base model
X vendor modifications
local modifications
extended model
19
![Page 20: Better management of large-scale, heterogeneous networks · Telemetry challenges SNMP object collection growing with each platform generation o e.g., 100K objects on current platforms,](https://reader034.vdocuments.mx/reader034/viewer/2022042117/5e9593660b5c56015047b7cf/html5/thumbnails/20.jpg)
OpenConfig releases and roadmap Data models (configuration and operational state) ● BGP and routing policy
o multiple vendor implementations in progress ● MPLS / TE consolidated model
o RSVP / TE and segment routing model as initial focus
● design patterns for operational state and model composition ● tools for translating YANG models to usable code artifacts
o e.g., pyangbind
20
Models in progress ● interfaces, system, optical transport, ...
![Page 21: Better management of large-scale, heterogeneous networks · Telemetry challenges SNMP object collection growing with each platform generation o e.g., 100K objects on current platforms,](https://reader034.vdocuments.mx/reader034/viewer/2022042117/5e9593660b5c56015047b7cf/html5/thumbnails/21.jpg)
Models must be composed to be useful
● model composition framework is critical missing piece from existing model-building efforts
21
![Page 22: Better management of large-scale, heterogeneous networks · Telemetry challenges SNMP object collection growing with each platform generation o e.g., 100K objects on current platforms,](https://reader034.vdocuments.mx/reader034/viewer/2022042117/5e9593660b5c56015047b7cf/html5/thumbnails/22.jpg)
Modeling operational state
22
Types of operational state data ● derived, negotiated, set by a protocol, etc. (negotiated BGP hold-time) ● operational state data for counters or statistics (interface counters) ● operational state data representing intended configuration (actual vs.
configured)
Clear benefits from using YANG to model both configuration and operational state in the same data model
● but … YANG focus has primarily been config, NETCONF-centric, lack of common conventions
![Page 23: Better management of large-scale, heterogeneous networks · Telemetry challenges SNMP object collection growing with each platform generation o e.g., 100K objects on current platforms,](https://reader034.vdocuments.mx/reader034/viewer/2022042117/5e9593660b5c56015047b7cf/html5/thumbnails/23.jpg)
Summary ● New networking paradigms like SDN focus mostly on control
o it’s time for the management plane to join the age of SDN
● Core principles: o model-driven management o streaming telemetry to scale monitoring and improve freshness o vendor-neutral, extensible APIs for managing devices
● Architecture and emerging vendor implementations of multi-mode telemetry solutions
● OpenConfig is a focused effort by operators to develop vendor-neutral models to define management APIs
23
Operators: get involved and push your vendors for support on your gear!
![Page 24: Better management of large-scale, heterogeneous networks · Telemetry challenges SNMP object collection growing with each platform generation o e.g., 100K objects on current platforms,](https://reader034.vdocuments.mx/reader034/viewer/2022042117/5e9593660b5c56015047b7cf/html5/thumbnails/24.jpg)
thank you !
![Page 25: Better management of large-scale, heterogeneous networks · Telemetry challenges SNMP object collection growing with each platform generation o e.g., 100K objects on current platforms,](https://reader034.vdocuments.mx/reader034/viewer/2022042117/5e9593660b5c56015047b7cf/html5/thumbnails/25.jpg)
gRPC: multi-platform RPC framework gRPC features ● load-balancing, app-level flow control, call-cancellation ● serialization with protobuf (efficient wire encoding) ● multi-platform, many supported languages ● open source, under active development
gRPC leverages HTTP/2 as its transport layer ● binary framing, header compression ● bidirectional streams, server push support ● connection multiplexing across requests and streams
25
http://grpc.io
![Page 26: Better management of large-scale, heterogeneous networks · Telemetry challenges SNMP object collection growing with each platform generation o e.g., 100K objects on current platforms,](https://reader034.vdocuments.mx/reader034/viewer/2022042117/5e9593660b5c56015047b7cf/html5/thumbnails/26.jpg)
Additional “observations” ● YANG and NETCONF should be decoupled -- each are
independently useful
● YANG needs to evolve more rapidly at this early phase, stabilize as real usage increases
● current YANG model versioning is not helpful -- treat models like software artifacts, not dated documents
● current standard models should be open for revisiting and revising
● should not rush to standardize more models until they are deployed and used in production
26 these are not necessarily OpenConfig consensus views
![Page 27: Better management of large-scale, heterogeneous networks · Telemetry challenges SNMP object collection growing with each platform generation o e.g., 100K objects on current platforms,](https://reader034.vdocuments.mx/reader034/viewer/2022042117/5e9593660b5c56015047b7cf/html5/thumbnails/27.jpg)
Current OpenConfig “process”
● initial models developed by OpenConfig
● extensive collaboration with vendors
● leverage existing work where possible
● publish models and docs
27
![Page 28: Better management of large-scale, heterogeneous networks · Telemetry challenges SNMP object collection growing with each platform generation o e.g., 100K objects on current platforms,](https://reader034.vdocuments.mx/reader034/viewer/2022042117/5e9593660b5c56015047b7cf/html5/thumbnails/28.jpg)
Intent-based configuration flow abstract configuration models
Config Model
Topology Model
configuration intent operators
declarative API configuration flow
configuration pusher NETCONF, RESTCONF, JSON-RPC, ...
...
authoritative config store
config generation
device-level configuration
standard models
vendor-neutral configuration models
generated configuration instances
config generation
authoritative config store application
NB APIs
Network OS
SB protocols
analagous SDN stack
28