elasticnet * zte*s sdn/nfv endeavor
TRANSCRIPT
![Page 1: ElasticNet * ZTE*s SDN/NFV Endeavor](https://reader035.vdocuments.mx/reader035/viewer/2022062906/584929491a28abc11a8b7c67/html5/thumbnails/1.jpg)
ZTE Proprietary
On Reliability of COTS Hardware
Dr. Li Mo
Chief Architect, CTO Group
![Page 2: ElasticNet * ZTE*s SDN/NFV Endeavor](https://reader035.vdocuments.mx/reader035/viewer/2022062906/584929491a28abc11a8b7c67/html5/thumbnails/2.jpg)
Agenda
ZTE Proprietary
• Differences between “Telecom Hardware” and COTS
Hardware
• Analysis Framework
• Enhancing Application Reliability via Backups –
Theoretical• With one backup (1+1)
• With Different Types of Data Centers (type 1 – type 4)
• Remarks
![Page 3: ElasticNet * ZTE*s SDN/NFV Endeavor](https://reader035.vdocuments.mx/reader035/viewer/2022062906/584929491a28abc11a8b7c67/html5/thumbnails/3.jpg)
© ZTE Corporation. All rights reserved.Page 3
ZTE Proprietary
Comparing “Telecom Hardware” and COTS (Commercial of the Shelf) Hardware
COTS Hardware
• May have smaller “mean time between failure” (MTBF)
• Relative smaller “mean time to repair” (MTTR)
• COTS procedures for software upgrade, patching, and maintenance contribute more to “scheduled down time”
• Different grade of reliability for data centers
• Strong fault detection and fault isolation capabilities at hardware level
• Well established traditions on software upgrade, patching, and maintenance
• Reliably Central Office assumed
Telecom Hardware
![Page 4: ElasticNet * ZTE*s SDN/NFV Endeavor](https://reader035.vdocuments.mx/reader035/viewer/2022062906/584929491a28abc11a8b7c67/html5/thumbnails/4.jpg)
© ZTE Corporation. All rights reserved.Page 4
ZTE Proprietary
New Item to Consider for COTS – Site Down Time
COTS Hardware
• Site downtime (scheduled, non-scheduled) with varying duration and varying intervals
COTS Hardware
• May have smaller “mean time between failure” (MTBF)
• Relative smaller “mean time to repair” (MTTR)
• COTS procedures for software upgrade, patching, and maintenance contribute more to “scheduled down time”
• Different grade of reliability for data centers
Type 1 99.671%
Type 2 99.741%
Type 3 99.982%
Type 4 99.995%Type
s of D
C
Relia
bilit
y
![Page 5: ElasticNet * ZTE*s SDN/NFV Endeavor](https://reader035.vdocuments.mx/reader035/viewer/2022062906/584929491a28abc11a8b7c67/html5/thumbnails/5.jpg)
© ZTE Corporation. All rights reserved.Page 5
ZTE Proprietary
Common Mechanisms to Improve Reliability – Application Level Backup
Master Server
Slave Server
(Backup Server)
Data Synchronization Benefits of Application Level Backup:
• Against failures• Facilitate maintenance and upgrade
Various Failures• Hardwar Failure• Hypervisor/OS Failure• Application Failure
![Page 6: ElasticNet * ZTE*s SDN/NFV Endeavor](https://reader035.vdocuments.mx/reader035/viewer/2022062906/584929491a28abc11a8b7c67/html5/thumbnails/6.jpg)
Agenda
ZTE Proprietary
• Differences between “Telecom Hardware” and COTS
Hardware
• Analysis Framework
• Enhancing Application Reliability via Backups –
Theoretical• With one backup (1+1)
• With Different Types of Data Centers (type 1 – type 4)
• Remarks
![Page 7: ElasticNet * ZTE*s SDN/NFV Endeavor](https://reader035.vdocuments.mx/reader035/viewer/2022062906/584929491a28abc11a8b7c67/html5/thumbnails/7.jpg)
© ZTE Corporation. All rights reserved.Page 7
ZTE Proprietary
Availability: P1
Availability: P2
COTS Server
VMVMVMVMVM
COTS Server
VMVMVMVMVM
vSwitch 1
vSwitch 2
System availability P1 * P2
![Page 8: ElasticNet * ZTE*s SDN/NFV Endeavor](https://reader035.vdocuments.mx/reader035/viewer/2022062906/584929491a28abc11a8b7c67/html5/thumbnails/8.jpg)
© ZTE Corporation. All rights reserved.Page 8
ZTE Proprietary
Introducing COTS – Focusing on Server Part of Reliability
Providing 5 9s reliability with 4 9s availability per networking equipment
Is it possible to have servers with 4 9s (or even 3 9s) availability to provide overall 5 9s reliability?
![Page 9: ElasticNet * ZTE*s SDN/NFV Endeavor](https://reader035.vdocuments.mx/reader035/viewer/2022062906/584929491a28abc11a8b7c67/html5/thumbnails/9.jpg)
© ZTE Corporation. All rights reserved.Page 9
ZTE Proprietary
Various Backup Schemes
One Backup
Master Server
Slave Server
Data Synchronization
Other Apps
Other Apps
Data Synchronization
Data Synchronization
Two Backups
Master Server
Slave Server
(Backup 1)
Slave Server
(Backup 2)
Other Apps
Other Apps
Other Apps
Data Synchronization
Load Sharing
Master Server
1
Master Server
N
N Backup Servers
Other Apps
Other Apps
Other Apps
Data SynchronizationMaster
Server 1
Slave Server 2
Other Applica
tions
Master Server 2
Slave Server 1
Other Applicati
ons1:1 Load sharing1:1, Same as One Backup Same as One Backup
Marginally Better
http://wwwen.zte.com.cn/endata/magazine/ztecommunications/2014/3/
![Page 10: ElasticNet * ZTE*s SDN/NFV Endeavor](https://reader035.vdocuments.mx/reader035/viewer/2022062906/584929491a28abc11a8b7c67/html5/thumbnails/10.jpg)
Agenda
ZTE Proprietary
• Differences between “Telecom Hardware” and COTS
Hardware
• Analysis Framework
• Enhancing Application Reliability via Backups –
Theoretical• With one backup (1+1)
• With Different Types of Data Centers (type 1 – type 4)
• Remarks
![Page 11: ElasticNet * ZTE*s SDN/NFV Endeavor](https://reader035.vdocuments.mx/reader035/viewer/2022062906/584929491a28abc11a8b7c67/html5/thumbnails/11.jpg)
© ZTE Corporation. All rights reserved.Page 11
ZTE Proprietary
A Simple Example – Markov State Transition Model
Legend:
MServer is goodM Server is badCurrent MasterM
λ
μ
M MS0 S11-λ 1- μ
Chapman – Kolmogorov Equation
orThe Result
![Page 12: ElasticNet * ZTE*s SDN/NFV Endeavor](https://reader035.vdocuments.mx/reader035/viewer/2022062906/584929491a28abc11a8b7c67/html5/thumbnails/12.jpg)
© ZTE Corporation. All rights reserved.Page 12
ZTE Proprietary
S1
S 2
S1
S1
S 2
S2
S1
S 2
S0
S1
S 2
S3
μμ
μ
(1-λ)λ(1-s) (1-λ)λ
λλ
λ2+λ(1- λ)sLegend:
S
Server is goodS
Server is bad
Current MasterS
1+1 System – Markov State Transition Model
Inverse of Server MTBF
Inverse of Server MTTRSilent Error Probability
![Page 13: ElasticNet * ZTE*s SDN/NFV Endeavor](https://reader035.vdocuments.mx/reader035/viewer/2022062906/584929491a28abc11a8b7c67/html5/thumbnails/13.jpg)
© ZTE Corporation. All rights reserved.Page 13
ZTE Proprietary
Solving the Global Balancing Equation for Getting overall System Availability (1+1)
![Page 14: ElasticNet * ZTE*s SDN/NFV Endeavor](https://reader035.vdocuments.mx/reader035/viewer/2022062906/584929491a28abc11a8b7c67/html5/thumbnails/14.jpg)
© ZTE Corporation. All rights reserved.Page 14
ZTE Proprietary
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10 10.5 11 11.5 12 12.5 13 13.5 14 14.5 15
-12
-10
-8
-6
-4
-2
0
1/100 1/1000 1/10000 1/100000
μ
System Unavailable Probability for various MTTR and server MTBF when in LOG scale
MTTR=12 min MTTR=6 min
![Page 15: ElasticNet * ZTE*s SDN/NFV Endeavor](https://reader035.vdocuments.mx/reader035/viewer/2022062906/584929491a28abc11a8b7c67/html5/thumbnails/15.jpg)
© ZTE Corporation. All rights reserved.Page 15
ZTE Proprietary
Differences in Availability between Theoretical Data and Simulation for 1+1 Backup Case
MTTR=6 minutes
Not Much Difference
![Page 16: ElasticNet * ZTE*s SDN/NFV Endeavor](https://reader035.vdocuments.mx/reader035/viewer/2022062906/584929491a28abc11a8b7c67/html5/thumbnails/16.jpg)
© ZTE Corporation. All rights reserved.Page 16
ZTE Proprietary
Improvement of 1+2 (dual backup) v.s. 1+1 (single backup)
Defining the “percentage” of improvement as
Improvement Deteriorates Fast with Silent Error
![Page 17: ElasticNet * ZTE*s SDN/NFV Endeavor](https://reader035.vdocuments.mx/reader035/viewer/2022062906/584929491a28abc11a8b7c67/html5/thumbnails/17.jpg)
© ZTE Corporation. All rights reserved.Page 17
ZTE Proprietary
Revertive Maintenance
Data SynchronizationMaster
Server
Other Apps
Slave Server
Other Apps
Other Apps
DC A DC B
Master Server
Slave Server(Backup)
Other Apps
Other Apps
Other Apps
DC BIn Maintenance
DC A
After MaintenanceStart Reverting when No Fault
Before MaintenanceStart Maintenance when No Fault
![Page 18: ElasticNet * ZTE*s SDN/NFV Endeavor](https://reader035.vdocuments.mx/reader035/viewer/2022062906/584929491a28abc11a8b7c67/html5/thumbnails/18.jpg)
© ZTE Corporation. All rights reserved.Page 18
ZTE Proprietary
The Impact of Site Maintenance is Negligible (Revertive)
![Page 19: ElasticNet * ZTE*s SDN/NFV Endeavor](https://reader035.vdocuments.mx/reader035/viewer/2022062906/584929491a28abc11a8b7c67/html5/thumbnails/19.jpg)
Agenda
ZTE Proprietary
• Differences between “Telecom Hardware” and COTS
Hardware
• Analysis Framework
• Enhancing Application Reliability via Backups –
Theoretical• With one backup (1+1)
• With Different Types of Data Centers (type 1 – type 4)
• Remarks
![Page 20: ElasticNet * ZTE*s SDN/NFV Endeavor](https://reader035.vdocuments.mx/reader035/viewer/2022062906/584929491a28abc11a8b7c67/html5/thumbnails/20.jpg)
© ZTE Corporation. All rights reserved.Page 20
ZTE Proprietary
Four Types of Data Centers (ANSI/TIA-942)
Type 1Single non-redundant distribution path serving the IT equipment. Non-redundant capacity components. Basic site infrastructure with expected availability of 99.671%.
Type 2 Meets or exceeds all type 1 requirements. Redundant site infrastructure capacity components with expected availability of 99.741%.
Type 3
Meets or exceeds all type 2 requirements. Multiple independent distribution paths serving the IT equipment. All IT equipment must be dual-powered and fully compatible with the topology of a site's architecture. Concurrently maintainable site infrastructure with expected availability of 99.982%.
Type 4
Meets or exceeds all type 3 requirements. All cooling equipment is independently dual-powered, including chillers and heating, ventilating and air-conditioning (HVAC) systems Fault-tolerant site infrastructure with electrical power storage and distribution facilities with expected availability of 99.995%
![Page 21: ElasticNet * ZTE*s SDN/NFV Endeavor](https://reader035.vdocuments.mx/reader035/viewer/2022062906/584929491a28abc11a8b7c67/html5/thumbnails/21.jpg)
© ZTE Corporation. All rights reserved.Page 21
ZTE Proprietary
1+1 System with Site Error – Markov State Transition Model (Revertive)
SS
S0
SS
S2
μ
-(1-λ)λs+2(1-λ)λ
SS
S1
μ
λ
λ2 +λ(1- λ)s
S6S
S
SS
S3
SS
S5
μ
-(1-λ)λs+2(1-λ)λ
SS
S4
μ
λ
λ2 +λ(1- λ)s
η(1-η)
γ
2η(1-η)
η
η2
2η(1-η)η(1-η)
γ
γ
γ
Both Data Center Work
Only One Data Center Works
![Page 22: ElasticNet * ZTE*s SDN/NFV Endeavor](https://reader035.vdocuments.mx/reader035/viewer/2022062906/584929491a28abc11a8b7c67/html5/thumbnails/22.jpg)
© ZTE Corporation. All rights reserved.Page 22
ZTE Proprietary
Service Impact Error Probability for Various Data Centers
1x10-5
1/(DC MTBF)
![Page 23: ElasticNet * ZTE*s SDN/NFV Endeavor](https://reader035.vdocuments.mx/reader035/viewer/2022062906/584929491a28abc11a8b7c67/html5/thumbnails/23.jpg)
Agenda
ZTE Proprietary
• Differences between “Telecom Hardware” and COTS
Hardware
• Analysis Framework
• Enhancing Application Reliability via Backups –
Theoretical• With one backup (1+1)
• With Different Types of Data Centers (type 1 – type 4)
• Remarks
![Page 24: ElasticNet * ZTE*s SDN/NFV Endeavor](https://reader035.vdocuments.mx/reader035/viewer/2022062906/584929491a28abc11a8b7c67/html5/thumbnails/24.jpg)
© ZTE Corporation. All rights reserved.Page 24
ZTE Proprietary
Remarks
COTS Hardware
• May have smaller “mean time between failure” (MTBF)
• Relative smaller “mean time to repair” (MTTR)
• COTS procedures for software upgrade, patching, and maintenance contribute more to “scheduled down time”
• Different grade of reliability for data centers
• It is possible to provide 5 9 availability with COTS hardware with application level backup
• The Impact of MTTR is not significant if it is reasonably small (e.g. less than 10 minutes) for typical hardware MTBF
• The impact of data center scheduled maintenance is negligible
• 5 9 availability with can only be achieved via type 3 and type 4 data centers
![Page 25: ElasticNet * ZTE*s SDN/NFV Endeavor](https://reader035.vdocuments.mx/reader035/viewer/2022062906/584929491a28abc11a8b7c67/html5/thumbnails/25.jpg)
© ZTE Corporation. All rights reserved.
Thanks!
Bring Network Closer