cisco 刘洋 从“路由”回归“交换”

27
从“路由”回归“交换” --探讨数据中心网络的演变 思科中国互联网运营商事业部 技术总监

Upload: guiyingshenxia

Post on 09-Dec-2014

1.508 views

Category:

Technology


6 download

DESCRIPTION

中国互联网运维高峰论坛

TRANSCRIPT

Page 1: Cisco 刘洋 从“路由”回归“交换”

从“路由”回归“交换” --探讨数据中心网络的演变

刘 洋

思科中国互联网运营商事业部技术总监

Page 2: Cisco 刘洋 从“路由”回归“交换”

“交换”的烦恼

•物理连接层次

•透明生成树,二层多路径,网络收敛

•Unicast Flooding,环路,广播风暴

Page 3: Cisco 刘洋 从“路由”回归“交换”

“路由”后的幸福生活

•ECMP(Equal Cost Multi Path);

•平滑扩展;

•快速收敛;

•防止广播风暴;

Page 4: Cisco 刘洋 从“路由”回归“交换”

烦恼

•集群的规模

•网段地址规划

•路由控制平面

•虚机

•开放平台,云计算

•价格

•Dumb Big Flat

Page 5: Cisco 刘洋 从“路由”回归“交换”

从“路由”回归“交换” --大型数据中心的交换网络

• Turn your network into a Fabric!

• 关键技术:FabricPath / Trill

FabricPath

Page 6: Cisco 刘洋 从“路由”回归“交换”

FabricPath对于二层交换的创新

• 实现交换机间多条路径同时转发流量ECMP(Equal

Cost Multi Path);去除透明生成树

• 类似路由网络的平滑扩展;

• 快速收敛;

• 防止广播风暴(TTL);

• 保持原有二层网络

• 基于会话的MAC地址学习

• 成本降低

Page 7: Cisco 刘洋 从“路由”回归“交换”

FabricPath的设计目标

Switching Minimal Configuration

Plug & Play

Auto Discovery

Auto Learning

Flat Addressing

Spanning Tree Protocol (STP)

Slow Convergence

Single Path

Edge-to-Root Rigid Design

Single Multicast Tree

Constrained Scaleability

FabricPath Routing Configuration Intense

Configured Learning

Configured Discovery

Plan & Play

Fast Convergence

Multiple Paths

Load Balancing

Multiple Multicast Trees

Hierarchical Forwarding

Any-to-any Flexible Design

Highly Scalable

Page 8: Cisco 刘洋 从“路由”回归“交换”

Cisco FabricPath Frame

Classical Ethernet Frame

FabricPath 封装结构 16-Byte MAC-in-MAC Header

Switch ID – Unique number identifying each FabricPath switch

Sub-Switch ID – Identifies devices/hosts connected via VPC+

Port ID – Identifies the destination or source interface

Ftag (Forwarding tag) – Unique number identifying topology and/or multidestination distribution tree

TTL – Decremented at each switch hop to prevent frames looping infinitely

DMAC SMAC 802.1Q Etype CRC Payload

DMAC SMAC 802.1Q Etype Payload CRC (new)

FP Tag (32)

Outer SA (48)

Outer DA (48)

Endnode ID (5:0)

Endnode ID (7:6)

U/L

I/G

RS

VD

OO

O/D

L

Etype

6 bits 1 1 2 bits 1 1 12 bits 8 bits 16 bits 10 bits 6 bits 16 bits

Switch ID Sub

Switch ID Ftag TTL Port ID

Original CE Frame

Page 9: Cisco 刘洋 从“路由”回归“交换”

FabricPath 控制平面:L2 IS-IS

L2 IS-IS 替代STP作为控制平面协议

引入链路状态协议以支持二层环境下的ECMP能力

交换Switch IDs的可达性并构建转发拓扑

提升故障检测,网络收敛及高可用性

Minimal IS-IS knowledge required –无需用户手动配置

保持了二层的即插即用特性

STP FabricPath

STP BPDU FabricPath IS-IS

STP BPDU

Page 10: Cisco 刘洋 从“路由”回归“交换”

A few key reasons:

仅维系设备之间的可达性信息,而无需IP地址的信息 – 非L3协议,是解决L2 环境下MAC地址传递的协议创新

易扩展–可使用定制的TLVs来传递信息

具备SPF功能– 优秀的拓扑构建及收敛能力

FabricPath Port

CE Port

L2 Fabric

Page 11: Cisco 刘洋 从“路由”回归“交换”

FabricPath 的数据平面

FabricPath Core

→ FabricPath interface

→ CE interface

MAC A MAC B

S10 S20

DMAC→B

SMAC→A

Payload

DMAC→B

SMAC→A

Payload

Ingress FabricPath Switch

Egress FabricPath Switch

DMAC→B

SMAC→A

Payload

DSID→20

SSID→10

DMAC→B

SMAC→A

Payload

DSID→20

SSID→10

DMAC→B

SMAC→A

Payload

DMAC→B

SMAC→A

Payload

入口FabricPath 交换机决定目的交换机ID 并且插入FabricPath 头封装

目的交换机ID 作为路由决策参考

核心内部无需终端MAC 的学习和查找

出口FabricPath 交换机去除FabricPath 头封装并转发给CE设备

Page 12: Cisco 刘洋 从“路由”回归“交换”

FabricPath MAC 转发表 Edge switches maintain both MAC address table and Switch ID table

Ingress switch uses MAC table to determine destination Switch ID

Egress switch uses MAC table to determine output switchport

Local MACs point to switchports

Remote MACs point to Switch IDs

S10 S20 S30 S40

S100 S101 S200 FabricPath

MAC A MAC C MAC D MAC B

FabricPath MAC Table on S100

MAC IF/SID

A e1/1

B e1/2

C S101

D S200

Page 13: Cisco 刘洋 从“路由”回归“交换”

FabricPath Routing 转发表 FabricPath IS-IS manages Switch ID (routing) table

All FabricPath-enabled switches automatically assigned Switch ID (no user configuration required)

Algorithm computes shortest (best) paths to each Switch ID based on link metrics

Equal-cost paths supported between FabricPath switches

S10 S20 S30 S40

S100 S101 S200

FabricPath

FabricPath Routing Table on S100

Switch IF

S10 L1

S20 L2

S30 L3

S40 L4

S101 L1, L2, L3, L4

… …

S200 L1, L2, L3, L4

One ‘best’ path to S10 (via L1)

Four equal-cost paths to S101

L1 L2 L4 L3

Page 14: Cisco 刘洋 从“路由”回归“交换”

FabricPath Routing 转发表项构建

S10 S20 S30 S40

S100 S101 S200 FabricPath

MAC A MAC C MAC D MAC B

L1 L2 L4 L3

L5 L6 L7 L8

L9 L10 L11 L12

Switch IF

S10 L1

S20 L2

S30 L3

S40 L4

S101 L1, L2, L3, L4

… …

S200 L1, L2, L3, L4

Switch IF

S20 L1,L5,L9

S30 L1,L5,L9

S40 L1,L5,L9

S100 L1

S101 L5

… …

S200 L9

Switch IF

S10 L4,L8,L12

S20 L4,L8,L12

S30 L4,L8,L12

S100 L4

S101 L8

… …

S200 L12

Switch IF

S10 L9

S20 L10

S30 L11

S40 L12

S100 L9, L10, L11, L12

S101 L9, L10, L11, L12

… …

Page 15: Cisco 刘洋 从“路由”回归“交换”

Putting It All Together – Host A to Host B (1) Broadcast ARP Request

S10 S20 S30 S40

S100 S101 S200 FabricPath

Root for Tree 1

Root for Tree 2

MAC A MAC B

Multidestination Trees on Switch 100

Tree IF

1 L1,L2,L3,L4

2 L4

DMAC→FF

SMAC→A

Payload

DSID→FF Ftag→1

SSID→100

Broadcast →

DMAC→FF

SMAC→A

Payload

Multidestination Trees on Switch 10

Tree IF

1 L1,L5,L9

2 L9

L1 L2 L4 L3

L5 L6 L7 L8

L9 L10 L11 L12

Ftag →

Ftag →

DMAC→FF

SMAC→A

Payload

DSID→FF Ftag→1

SSID→100

FabricPath MAC Table on S200

MAC IF/SID

Multidestination Trees on Switch 200

Tree IF

1 L9

2 L9,L10,L11,L12

FabricPath MAC Table on S100

MAC IF/SID MAC IF/SID

A e1/1 (local)

DMAC→FF

SMAC→A

Payload

Learn MACs of directly-connected devices unconditionally

Don’t learn MACs in flood frames

Page 16: Cisco 刘洋 从“路由”回归“交换”

Putting It All Together – Host A to Host B (2) Unicast ARP Reply

S10 S20 S30 S40

S100 S101 S200 FabricPath

MAC A MAC B

Multidestination Trees on Switch 100

Tree IF

1 L1,L2,L3,L4

2 L4

DMAC→A

SMAC→B

Payload

DSID→MC1 Ftag→1

SSID→200

Ftag →

DMAC→A

SMAC→B

Payload

Multidestination Trees on Switch 10

Tree IF

1 L1,L5,L9

2 L9

Ftag →

Unknown →

DMAC→A

SMAC→B

Payload

DSID→MC1 Ftag→1

SSID→200

FabricPath MAC Table on S200

MAC IF/SID

Multidestination Trees on Switch 200

Tree IF

1 L9

2 L9,L10,L11,L12

FabricPath MAC Table on S100

MAC IF/SID

A e1/1 (local) DMAC→A

SMAC→B

Payload

MAC IF/SID

B e12/2 (local)

A →

MAC IF/SID

A e1/1 (local)

B S200 (remote)

L1 L2 L4 L3

L5 L6 L7 L8

L9 L10 L11 L12

A → If DMAC is known, then learn remote MAC

Page 17: Cisco 刘洋 从“路由”回归“交换”

FabricPath MAC Table on S200

MAC IF/SID

B e12/2 (local)

FabricPath MAC Table on S100

MAC IF/SID

A e1/1 (local)

B S200 (remote)

Putting It All Together – Host A to Host B (3) Unicast Data

S10 S20 S30 S40

S100 S101 S200 FabricPath

MAC A MAC B S200 →

DMAC→B

SMAC→A

Payload

L1 L2 L4 L3

L5 L6 L7 L8

L9 L10 L11 L12

S200 →

DMAC→B

SMAC→A

Payload

DSID→200 Ftag→1

SSID→100

MAC IF/SID

A S100 (remote)

B e12/2 (local)

DMAC→B

SMAC→A

Payload

B → B →

FabricPath Routing Table on S100

Switch IF

S10 L1

S20 L2

S30 L3

S40 L4

S101 L1, L2, L3, L4

… …

S200 L1, L2, L3, L4

DMAC→B

SMAC→A

Payload

DSID→200 Ftag→1

SSID→100

FabricPath Routing Table on S30

Switch IF

… …

S200 L11

FabricPath Routing Table on S30

Switch IF

… …

S200 – S200 →

Hash

Page 18: Cisco 刘洋 从“路由”回归“交换”

MAC C

基于会话的MAC学习

FabricPath Core

MAC A

MAC B

FabricPath MAC Table on S100

MAC IF/SID

A e1/1 (local)

B S200 (remote)

S100

S200

S300

FabricPath MAC Table on S200

MAC IF/SID

A S100 (remote)

B e12/1(local)

C S300 (remote)

FabricPath MAC Table on S300

MAC IF/SID

B S200 (remote)

C e7/10 (local)

Page 19: Cisco 刘洋 从“路由”回归“交换”

Conversational MAC Learning

500 MACs

500 MACs

500 MACs

500 MACs

250 MACs

250 MACs

250 MACs

250 MACs

ALL MACs needs to be learn on EVERY Switch

Large L2 domain and virtualization present challenges to MAC Table scalability

STP Domain

Local MAC: Source-MAC Learning only happen to traffic received on CE Ports

Remote MAC: Source-MAC for traffic received on FabricPath Ports are only learned if Destination-MAC is already

known as Local

S11

A C

B

L2 Fabric

MAC IF

C 3/1

A S11

MAC IF

B 2/1

MAC IF

优化资源利用率 – Learning only the MAC addresses required

Page 20: Cisco 刘洋 从“路由”回归“交换”

Same node type used in all roles (Spine and Edge)

Fine Grain Redundancy

Additional density provided through density of node or additional layers

High density spine node

Smaller fixed leaf

Fewer control planes than pure Clos

Layer-1.5 Spine (Dumb Core)

Intelligent Edge

CLOS Scale-Up Spine Scale-Out Leaf

Lean Core Smart Edge

Architectural Approach for MSDC

Page 21: Cisco 刘洋 从“路由”回归“交换”

Fabricpath 构建通用网络交换平台

POD 1

VLANs 100-199

POD 2 POD 3

VLANs 200-299 VLANs 300-399 VLANs 100-399

PODS 1-3

Page 22: Cisco 刘洋 从“路由”回归“交换”

大规模数据中心的通用网络交换平台 --网络对业务部署灵活性的支持

模块化 易扩展

网络带宽及延时的一致性 与服务器所处位置无关

业务的快速部署 计算资源的灵活移动和调配

Any service on any server, at any time!!!

可扩展性 业务/集群的扩展不再受制于网络

服务器的使用效率 服务器重复利用

可管理性 即插即用,配置最简化,人工干预少

可靠性 单点故障对整体业务的影响

Page 23: Cisco 刘洋 从“路由”回归“交换”

从“路由”回归“交换” --中小型数据中心的交换网络

• Turn your network into a Switch

• 关键技术:远端扩展模块,FEX as TOR

Nexus 7000/5000 Virtualized chassis

+

Nexus 5000

Nexus 2000 Fabric Extender

=

Page 24: Cisco 刘洋 从“路由”回归“交换”

FEX Terminology

FEX can be connected to a parent switch in three ways:

single attached without any vPC running on the parent switch

single attached with vPC running on the parent switch

dual attached in vPC mode

Parent switch

vPC Primary

vPC Secondary

Fabric Links

vPC 1 vPC 2

Fabric Links

vPC Primary

vPC Secondary

Fabric Links

HIFs HIFs

HIFs

NIFs NIFs

NIFs

Page 25: Cisco 刘洋 从“路由”回归“交换”

FEX Inner Functioning Inband Management Model

Fabric extender is discovered by switch using an L2 Satellite Discover Protocol (SDP) that is run on the uplink port of fabric extender

Core Switch checks software image

Core Switch pushes programming data to Fabric Extender

1-48 GigE

N5k01

1,2,3,4

softw

are

im

age,

configura

tion

Page 26: Cisco 刘洋 从“路由”回归“交换”

• 扁平化结构

• 应用在更大区域的灵活部署

• 线速的网络

Data Center-Wide Scalability at Layer 2

Page 27: Cisco 刘洋 从“路由”回归“交换”

谢谢