1 hfr - tag high availability ravi narayanan ([email protected]) february 2002
TRANSCRIPT
1
HFR - TAG HFR - TAG High Availability High Availability
HFR - TAG HFR - TAG High Availability High Availability
Ravi NarayananRavi Narayanan([email protected])([email protected])
February 2002February 2002
2© 2002, Cisco Systems, Inc. www.cisco.com
Cisco HFRGOAL - High Availability
Cisco HFRGOAL - High Availability
Goal: Non-Stop Availability5- 9’s or Greater Availabiliity
What customers require:
Quick Recovery from defects,
High MTBF, Low MTTR/DPM,
Built in Redundancy
3© 2002, Cisco Systems, Inc. www.cisco.com
Cisco HFRA Five Nines Capable Router
Cisco HFRA Five Nines Capable Router
• Architecture
– Hardware
– Software
• Development Process
• Test Process
• Accounting, Logging & Alarms
• Conclusion
4© 2002, Cisco Systems, Inc. www.cisco.com
Hardware ArchitectureHardware Architecture
• Apply Prior Experience
• No Single Points of Failure
• Hardware Non Stop Forwarding (NSF)
• Automated Fault Injection
• Verify Architecture with Modeling
5© 2002, Cisco Systems, Inc. www.cisco.com
Apply Prior ExperienceApply Prior Experience
• ATM Switch Products
– Large Customer Frame Relay Network
– Many Years Measuring Availability
• GSR
– Now resets at RP/LC level (HFR provides finer granularity at component level)
– Routing NSF Developments Started
6© 2002, Cisco Systems, Inc. www.cisco.com
No Single Points of FailureNo Single Points of Failure
• Redundancy– Active Standby
* (D) RP, SC
– Loadsharing
* Fabric, Power, Cooling, Management Interconnect (out of band ethernet 1:1)
– Port Protection (Linecards/PLIMs)
• No outage on Upgrade of Fabric
• Graceful Degradation of Fabric
7© 2002, Cisco Systems, Inc. www.cisco.com
System Control Network
Gig EtherSwitch
Gig EtherSwitch
GE
Optional 10G
LC Chassis
LC
LC
RP
RP
LC Chassis
LC
LC
RP
RP
Fabric Chassis
S2
S2
SC
SC
FE
FE
FE
8© 2002, Cisco Systems, Inc. www.cisco.com
Graceful DegradationGraceful Degradation
8 of 8
S1
S1
S2
S2
S3S3
S3S3
. . .
. .
.
. .
.
. . .
. .
2 of 8
S1
S1
S2
S2
S3S3
S3S3. .
.
. .
.
. .
.
1 of 8
S1
S1
S2
S2
S3S3
S3S3
. .
.
. .
.
. .
.
Line CardOC192
12
8
. . .
Line CardOC192
12
8
. . .
9© 2002, Cisco Systems, Inc. www.cisco.com
Hardware Non Stop Forwarding
Hardware Non Stop Forwarding
• Reset Strategy
– Entire Board
– Individual Components on a Board
– CAM (HW forwarding database) Not reset unless desired
• Forwarding Strategy
– Metro - 176 PPEs forwarding using CAM
10© 2002, Cisco Systems, Inc. www.cisco.com
LC NSF Strategy
PLU TLU STATS
DISTRIB MUX
PPE0
PPE2
PPE175
TCAM
11© 2002, Cisco Systems, Inc. www.cisco.com
Automated Fault Injection Automated Fault Injection
• Designed into Hardware ASICs up front
• Makes testing easier and complete
• Off the shelf parts must have mechanism for injection
• System Test and Reliability tests use automated fault insertion testing mechanisms
• Fault insertion testing at all stages
– Bring up, Design Verification, component test, system test
• Ability to test multiple failure scenarios - in hardware & software
12© 2002, Cisco Systems, Inc. www.cisco.com
Verify Architecture With Modeling
Verify Architecture With Modeling
• Early modeling influenced architecture
– Memory soft error rates -> ECC
– Opticial error rates -> FEC-Reed Solomon
– Board level MTBF >= 100,000 hours - is a Cisco Requirement
• Parts count model
– Telcordia TR-332 standards, close vendor interaction
13© 2002, Cisco Systems, Inc. www.cisco.com
• Architecture
– Hardware
– Software
• Development Process
• Test Process
• Accounting, Logging & Alarms
• Conclusion
Cisco HFRA Five Nines Capable Router
Cisco HFRA Five Nines Capable Router
14© 2002, Cisco Systems, Inc. www.cisco.com
Software ArchitectureSoftware Architecture
• Protected Memory Microkernel
• Separation of Control and Data Plane
• Software Non Stop Forwarding
• Scalable Distributed System
• Health Monitoring
• No Outage on upgrades - Packaging and Release Strategy
15© 2002, Cisco Systems, Inc. www.cisco.com
Protected Memory Microkernel
Protected Memory Microkernel
• Every Process Has a Private Address Space - contains faults
• Enables Process Restartability
• Enables Board Failover
• Enables Hitless Software Upgrade
16© 2002, Cisco Systems, Inc. www.cisco.com
1:1 Card Redundancy1:1 Card Redundancy
Card 1
Process A
Process B
Process C
Process A
Process B
Process C
Checkpointing
Active Logical Slot 1 Standby Logical Slot 1
Card 2
“Active”Processes
Checkpointing
Checkpointing
“Standby”Processes
17© 2002, Cisco Systems, Inc. www.cisco.com
Active / Standby SwitchoverActive / Standby Switchover
Process A
Process CProcess B
System Mgr
Card 2
7
7
Active SC
LR Daemon
Process A
Process CProcess B
System Mgr
Card 1
1 6 10
3
5
8
RedCon RedCon
4 9
QSM
4
2
11
12
13
Process B’
14
18© 2002, Cisco Systems, Inc. www.cisco.com
Separation of Control and Data Plane
Separation of Control and Data Plane
• Redundancy in Control Plane
– All protocols support NSF over board fail over
• Port Protection in Data Plane
– SONET APS
– Link Bundling
19© 2002, Cisco Systems, Inc. www.cisco.com
Traffic Switchover- APSTraffic Switchover- APS
Line Card A
DRP
Line Card
Line Card
Line Card
APS Manager
FIB
FIB
FIBAPS Process
5
Line Card B
2
APS Process
1
3 3
4
5
5
6
Traffic before APS switch
Traffic after APS switch
Switching Fabric
20© 2002, Cisco Systems, Inc. www.cisco.com
Traffic Switchover - Bundled link
Traffic Switchover - Bundled link
Line Card
DRP
DRP
DRP
DRP
Bundled IF
FIB
FIB
FIB
2
34
4
4
Switching Fabric
Link Monitor
Line Card
1
5
Traffic before link failure
Traffic after link failure
Link Monitor
Mgr
21© 2002, Cisco Systems, Inc. www.cisco.com
Software Non Stop Forwarding
Software Non Stop Forwarding
• Architected with HW NSF
• Process Restartability
• Separation of Control and Data Planes
• Protocol Support (BGP, ISIS, OSPF, Multicast, MPLS), Support for HSRP, VRRP
22© 2002, Cisco Systems, Inc. www.cisco.com
BGP NSF
RPRP
LCLC
Fabric
BGPComponent
BPM
bRIB
LPTS/TCPConnections
to peers
SysDB
gRIB
BGPSpeaker
BGPSpeaker
BGPSpeaker
Gig
E
BCDL
FIB
HWFWD
Incremental updates to FIB
23© 2002, Cisco Systems, Inc. www.cisco.com
Non Stop Forwarding MPLSNon Stop Forwarding MPLS
• No impact on MPLS forwarding when one or more MPLS processes fail.
• No impact on MPLS forwarding when an active card from a pair of active/standby fails.
• Hitless software upgrade.
24© 2002, Cisco Systems, Inc. www.cisco.com
MPLS - NSF in ActionMPLS - NSF in Action
MPLS Control
MPLS Forwarding
IP
Forwarding
System
ServicesIP Network
Services
• If the control plane fails, the forwarding plane can continue to send traffic. Headless forwarding.
• Minimize the time forwarding remains headless.
25© 2002, Cisco Systems, Inc. www.cisco.com
MPLS ArchitectureMPLS Architecture
DRP
LC LC
Application: MPLS-TE Recovery: From systems services and check-poiniting
Label signaling: RSVP, LDP Recovery: From applications and neighbors
Infra: Label manager Recovery: From signaling layer
MPLS
Forwarding MPLS
Forwarding
Recovery: From Label Manager
26© 2002, Cisco Systems, Inc. www.cisco.com
MPLS Fast RerouteMPLS Fast Reroute
• Supports Node, Path, and Link Protections
• Controlled by the routers at ends of a failed link
– link protection is configured on a per link basis
• Uses nested LSPs (stack of labels)
– original LSP nested within link protection LSP
27© 2002, Cisco Systems, Inc. www.cisco.com
Scalable Distributed SystemScalable Distributed System
• Configuration and Operational Data Distributed Across System
– Allows system to scale, Logical Routers
– Fault containment and recovery (SysDB, IM, SC, dSC, d(LRSC) )
• Processing Distributed Across System
– Distributed RPs
– Enables faster convergence
28© 2002, Cisco Systems, Inc. www.cisco.com
Managing ConfigurationManaging Configuration
• Designated SC (dSC) - An owner plane concept, Verifies Rack numbering among SCs
• Co-ordinates image management and versioning
• Co-ordinates LR membership information
• System Elected: Deterministic election through reboot
– Backup Elected as well
• d(LRSC) extends similar concept to a Logical Router configuration in LR plane.
LC
SCRP
SCRP
LC
SCRP(
dSC)
SCRP
GigESC
SC
Fabric C
29© 2002, Cisco Systems, Inc. www.cisco.com
Managing Scaling/DistributionManaging Scaling/Distribution
LC LC DRP RP DRP LC LC
Local Local Local Local Local Local Local
Shared
30© 2002, Cisco Systems, Inc. www.cisco.com
Process DistributionProcess Distribution
LRd placed
Logical Router
RP
DRP DRP
sysmgr
sysmgr
sysmgr
sysmgrLRconfig
Rack Rack
AB C
.startup filesof placeableapplications
LRd
A A
A
B C B
RPCiscopre-config
sysdbshared placed
standby replicated processes
31© 2002, Cisco Systems, Inc. www.cisco.com
Health MonitoringHealth Monitoring
• Online Diagnostics
– Minimizes double faults at switchover time
• Detect failures before they become critical
– Standby RP/DRP, Fabric plane
– Hot tested spare units
–Alarm cards, Logging & Alarm system (LED A/N display, minor, major, critical alarms)
32© 2002, Cisco Systems, Inc. www.cisco.com
No outage on Software Upgrades
No outage on Software Upgrades
• Packaging model – Allows modular upgrade (sub package / package) and software patches (SMU) to key components and packages without affecting others.
• Software Release Strategy
– Takes into account upgrade timings and impacts on system availability
– Progressive upgrade path defined, Compatibility requirements taken into consideration.
Process Restartability with NSF is key Enabler
33© 2002, Cisco Systems, Inc. www.cisco.com
• Architecture
– Hardware
– Software
• Development Process
• Test Process
• Accounting, Logging & Alarms
• Conclusion
Cisco HFRA Five Nines Capable Router
Cisco HFRA Five Nines Capable Router
34© 2002, Cisco Systems, Inc. www.cisco.com
Development ProcessDevelopment Process
• ISO compliant
• Mandatory design/code reviews
• API versioning controlled by tools
• Strictly enforced package boundaries (Tools)
• Continual automated measurement/improvement
• HA culture throughout program
35© 2002, Cisco Systems, Inc. www.cisco.com
• Architecture
– Hardware
– Software
• Development Process
• Test Process
• Accounting, Logging & Alarms
• Conclusion
Cisco HFRA Five Nines Capable Router
Cisco HFRA Five Nines Capable Router
36© 2002, Cisco Systems, Inc. www.cisco.com
Software Test ProcessSoftware Test Process
• Test Hierarchy (Waterfall model)– Q Integrated Sanity System (QISS)
– Component and Feature Test
– Regression Test
– System Integration Test
– Early Field Trial (EFT) and Beta
• Test Operations– Test Automation and Formal Script Review
– Central Reporting (online system - TIMS, Dashboard)
– Test Planning and Formal Review
37© 2002, Cisco Systems, Inc. www.cisco.com
Software Test ToolsSoftware Test Tools
• IXIA – traffic generation & analyzer
• Agilent QA Robot – protocol conformance testing
• Agilent RouterTester – interface & protocol scalability
• REX – resource exhaustion
• CTF – component testing
• FIT – fault injection
• ATS – test scripting
• e-ARMS – test scheduler
• Pagent – packet generator
• RouteM – net emulation
• CFLOW – code coverage
• DDTS – defect tracking
• TIMS – test reporting
• Dashboard – test summary
3rd Part ToolsInternal Tools
38© 2002, Cisco Systems, Inc. www.cisco.com
Test ActivitiesTest Activities
• Up time/Longevity
• Boot time
• Interface Scalability
• Protocol Scalability
• Throughput
• Latency
• APS Protection
•Security Audit
• Fault Detection Time
• Fail over time
•Process restart/resync
• Online Insertion Removal (OIR)
• Hitless Software Upgrade (HSU)
• Hot Standby Route Processor (HSRP)
• Fault Manager (FM)
• SW/HW Fault Injection
• Process Deadlock Simulation
• Process Restartability w/NSF
• SONET APS, DPT
• Reliability & Availability
• Standard Conformance
• Interop w. IOS/JunOS
Test Measurements
MTTRTest Validation
39© 2002, Cisco Systems, Inc. www.cisco.com
• Architecture
– Hardware
– Software
• Development Process
• Test Process
• Accounting, Logging & Alarms
• Conclusion
Cisco HFRA Five Nines Capable Router
Cisco HFRA Five Nines Capable Router
40© 2002, Cisco Systems, Inc. www.cisco.com
ACCOUNTING & HAACCOUNTING & HA
• Netflow support
– Multiple / Distributed collectors
• Persistent storage of accounting data
– Across failovers
– Checkpointed continually
41© 2002, Cisco Systems, Inc. www.cisco.com
LOGGING & ALARM SYSTEMLOGGING & ALARM SYSTEM
• HA Attributes
– All bistate alarms checkpointed
–Alarms are sequenced and can be retrieved anytime
• Alarm Cards
– Alarm lights lit on failure conditions
– System wide storage of data
42© 2002, Cisco Systems, Inc. www.cisco.com
HFR - High AvailabilityHFR - High Availability(Bird’s Eyeview)(Bird’s Eyeview)
Goal: Non-Stop Availability
Result: Quick Recovery (low MTTR/DPM)
Physical redundancy Dual processors, Power, Fabric, Cooling, OIR
Logical redundancy/protection SONET APS, DPT, HSRP/VRRP, MPLS FRR, Layer 3 load balancing, link bundling
Hitless Software/Hardware UpgradesUpgrade software/hardware while router is in service
Non Stop ForwardingNo line card reboot upon processor fail over
Forward user data during RP fail over
Process Restartability/upgrade and NSF
43© 2002, Cisco Systems, Inc. www.cisco.com
ConclusionConclusion
• Target: 99.999% availability
• Availability modeling, availability design and fault injection testing incorporated as part of the development process
• Cisco uses HA analysis and modeling to identify the areas of improvements for future designs
• High availability (in some operational areas) will need close cooperation with customers and the required support process is being developed.
44© 1998, Cisco Systems, Inc.
45
Backup SlidesBackup SlidesBackup SlidesBackup Slides
46© 2002, Cisco Systems, Inc. www.cisco.com
Cisco’s HA ProductsCisco’s HA Products
Cisco is certifying a variety of its products for HA compliance.
• MSSBU: (PXM1, PXM45, AXSM)
• IP: GSR, ESR 10000 (, DSL (Austin), Fermi, HFR
• Optical: Monterey
• Cisco’s IOS has been certified for 99.999% Availability in many service provider environments
Cisco’s efforts for achieving High Availability are both platform oriented and cross-platform oriented.
47© 2002, Cisco Systems, Inc. www.cisco.com
IOS HA InitiativesIOS HA Initiatives
• RPR: Partial initialization of IOS in standby RP
• RPR+: Improves standby readiness over RPR (recognizes line cards and does not reset them on switchover)
• Single Line Card Reload: Problems in one VIP do not require an entire router reboot
• Fast reboot: Improves reboot time by 5 minutes
• Fast upgrade: Improves upgrade time by 5 minutes by pre-loading software onto standby
• Stateful switchover: Instant switchover to standby RP (includes non-stop forwarding routing protocol changes)
• In-service upgrade: Software upgrade without user impact
48© 2002, Cisco Systems, Inc. www.cisco.com
HFR SystemHFR SystemFabric ShelvesContains Fabric cards,System Controllers
Shelf controller
Shelf controller
Line Card ShelvesContains Route Processors, Line cards, System controllers
EMS(Full system view)
Out of band GE control bus to all shelf controllers
100m
Shelf controller
49© 2002, Cisco Systems, Inc. www.cisco.com
Software Test ProcessSoftware Test ProcessSoftware Test ProcessSoftware Test Process
• Tools for HA Testing
– REX (Resource Exhaustion Tool), CTF (Component Test Framework), measure how HFR HA features respond to different test conditions simulated by these tools.
• Test Restartability with Faults simulation
– memory failures, thread create failures, dependent process failures, multiple related processes failures, recovery on check point process failure, restartability under high CPU usage
• Test Hitless Software Upgrade
– Test under high resource/CPU utilization conditions
• Fault Manager Testing
– Check to see FM works properly under fault conditions
• MTTR Measurements
– Measure time to repair for most process/component failures
50© 2002, Cisco Systems, Inc. www.cisco.com
Specific Availability RequirementsSpecific Availability Requirements
Here is what I ask a BU to do (chronological):
• Create an availability model to gain perspective
• Reduce/remove single points of failure
• Design for over 100,000 hours MTBF
• Automate measurement of DPM
• Write online diagnostics on active and standby
• Write and execute network level availability test plan
• Perform fault insertion testing
• Write and test a troubleshooting guide
Arch
Design
Test
Field
51© 2002, Cisco Systems, Inc. www.cisco.com
Limit Headless Forwarding Time
Limit Headless Forwarding Time
• Check point data that cannot be recovered otherwise
• Dedicate MPLS process resources to the recovery of LSPs that are already established. Processing of any new configured LSP tunnels is temporarily suspended.
• Processing of new LSPs resumes when recovery completes.
52© 2002, Cisco Systems, Inc. www.cisco.com
TIMING GOALSTIMING GOALS
• Boot from Flash / TFTP (~3 min)
• Total Single Rack Bring up time (~5min)
• OIR Recovery Time (~30 to 60 secs)
• Uptime = 14 days before ship
• BGP Aggregation Convergence ~ 60 sec
• BGP Backbone Convergence ~ 3 min
• OSPF Convergence ~ 25 secs
• IS-IS Convergence ~ 350 secs
53© 2002, Cisco Systems, Inc. www.cisco.com
Redundant Cards & LinksRedundant Cards & Links
Fabric Chassis
...
SC0 GE LinksSC1 GE Links
Inter-SC FE Links
SC1
SC0
SC0
SC1
Line Card Chassis
...
DRP/SC1
DRP/SC1
DRP/SC0
DRP/SC0
External GE Switch 0
External GE Switch 1
54© 2002, Cisco Systems, Inc. www.cisco.com
1:1 Card Redundancy1:1 Card Redundancy
Card 1
Process A
Process B
Process C
Process A
Process B
Process C
Checkpointing
Active Logical Slot 1 Standby Logical Slot 1
Card 2
“Active”Processes
Checkpointing
Checkpointing
“Standby”Processes
55© 2002, Cisco Systems, Inc. www.cisco.com
Active / Standby SwitchoverActive / Standby Switchover
Process A
Process CProcess B
System Mgr
Card 2
7
7
Active SC
LR Daemon
Process A
Process CProcess B
System Mgr
Card 1
1 6 10
3
5
8
RedCon RedCon
4 9
QSM
4
2
11
12
13
Process B’
14
56© 2002, Cisco Systems, Inc. www.cisco.com
SC/DRP Combo SwitchoverSC/DRP Combo Switchover
SC/DRP Combo 2SC/DRP Combo 1
LR Daemon
RedCon
RedCon
LR Daemon
RedCon
RedCon
34
12 98
67
5
SC1
DRP1
SC1
DRP1
10
11
57© 2002, Cisco Systems, Inc. www.cisco.com
Traffic Switchover - Bundled link
Traffic Switchover - Bundled link
Line Card
DRP
DRP
DRP
DRP
Bundled IF
FIB
FIB
FIB
2
34
4
4
Switching Fabric
Link Monitor
Line Card
1
5
Traffic before link failure
Traffic after link failure
Link Monitor
Mgr
58© 2002, Cisco Systems, Inc. www.cisco.com
Traffic Switchover- APSTraffic Switchover- APS
Line Card A
DRP
Line Card
Line Card
Line Card
APS Manager
FIB
FIB
FIBAPS Process
5
Line Card B
2
APS Process
1
3 3
4
5
5
6
Traffic before APS switch
Traffic after APS switch
Switching Fabric
59© 2002, Cisco Systems, Inc. www.cisco.com
SC/RP Upgrade (Initial Config)SC/RP Upgrade (Initial Config)
Card 1
“Standby”Processes
Process A
Process B
Process C
“Active”Processes Checkpt.
Server
Process A
Process C
Card 2
Process BCheckpt.Server
Standby Logical Slot 1Active Logical Slot 1
Checkpointing
60© 2002, Cisco Systems, Inc. www.cisco.com
HFR HA RoadmapHFR HA RoadmapHFR HA RoadmapHFR HA Roadmap
QFT-1 QFT-2 QFT-3 Beta/FOA
Target GSR
Demonstrate limited HSU
NSF for ISIS, OSPF
Multiple Verifier Support
CheckPointing and Mirroring
RP and DRP standby and failover
Target GSR
All processes Restartable
Restartability nonservice affecting to Routing and Forwarding plane apps
RP and DRP standby
Limited SC Functionality and SC HA features
NSF support with upgrade of config data
Support for checkpoint data with version differences between releases
Target- HFR test hardware
Full functionality of SC, RP, DRP, SP and Fabric SC will be demonstrated with high availability and failover features.
Process Redundancy mechanism across DRPs demonstrated
All apps support HSU - forwarding, multicast, security and base.
Multiple LRs support and fault isolation between LRs
Software downgrade to atleast 1 prev level
Target - HFR platform
All QFT1 to QFT3 goals met
Meet product requiremnets in HA PRD.
Minimum .9999 standalone availabiity and .99999 network availability
fCS/Post FCS: HA support and assurance programs, HA test support framework implementaton
61© 1998, Cisco Systems, Inc.