t-110.5116 computer networks ii - aalto university · m. alizadeh et al: “data center...

• 23.9.2010

T-110.5116 Computer Networks II Data center networks 1.12.2012 Matti Siekkinen

(Sources: S. Kandula et al.: “The Nature of Datacenter: measurements & analysis”, A. Greenberg: “Networking The Cloud”, M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” )

Outline

•  What are data center networks? •  Layer 2 vs. Layer 3 in data center networks •  Data center network architectures •  TCP in data center networks

–  Problems of basic TCP –  Data Center TCP (DCTCP)

•  Conclusions

• 2

What is a data center?

•  Contains servers and data •  Has a network

–  Connect servers together

•  Runs applications and services –  Internal and external

•  Centrally managed •  Operated in controlled environment •  Can have very different sizes

–  SME datacenter vs. Google

• 3

Applications and services

•  External facing –  Search, Mail, Shopping Carts, …

•  Internal to the company/institution –  E.g. ERP (Financial, HR, …)

•  Services internal to the data center –  Those necessary for the data center to work

•  E.g. network operations (DNS, NFS, DHCP), backup –  Building blocks for external facing apps

•  MapReduce, GFS, BigTable (Google), Dynamo (Amazon), Hadoop (Yahoo!), Dryad (Microsoft)

•  Often distributed

• 4

Multi-tier architecture

•  E.g. 3-tiers –  Front end servers –  Applications servers –  Backend database servers

•  Advantages –  Performance & scalability –  Security

• 5

Handles static requests

Handles dynamic content

Handles database transactions

What does it look like? •  Servers in racks

–  Contains commodity servers (blades) –  Connected to Top-Of-Rack switch –  Aggregated traffic to next level

•  Modular data centers –  Shipping containers full of racks

Inside a container From Microsoft Chicago data center

• 6

Large data center requires a lot of...

•  Some statistics –  Google: 450,000 servers in 2006, estimated

over a million by now –  Microsoft is doubling the number of servers

every 14 months

• 7

Cooling

Photos from Microsoft Chicago data center

Cloud computing

•  Cloud computing –  Abstract underlying resouces from the service provided –  Abstraction on different levels: IaaS, PaaS, SaaS

•  Virtualization enables cloud’s many properties –  Elastic resource allocation

•  Of course limited by number of physical servers •  One users resources limited by SLA, not by single piece of

hardware –  Efficient use of resources

•  Don’t need to run all servers full speed all the time •  Client’s VMs can run on any physical server

• 8

Data center vs. Cloud

•  Data center is physical –  Physical infrastructure that runs services

•  Cloud is not physical –  Offers some service(s) –  Physical infrastructure is virtualized away

•  Cloud usually needs to be hosted in a data center –  Depends on scale

•  Data center does not need to host cloud services •  Private cloud vs. own data center

–  Not the same thing

• 9

Cloud DC vs. Enterprise DC

•  Traditional enterprise DC: IT staff cost dominates –  Human to server ratio: 1:100 –  Less automation in management –  Scale up: a few high priced servers –  Cost borne by the enterprise

•  Utilization is not critical

•  Cloud service DC: other costs –  Human to server ratio: 1:1000 –  Automation is more crucial –  Distributed workload, spread out on lots of commodity servers –  High upfront cost amortized over time and use –  Pay per use for customers

•  Utilization is critical

• 10

What is a data center network (DCN)?

•  Enables communication within DC –  Among the different servers

•  In practice –  HW: switches, routers, and cabling –  SW: communication protocols (layers 2-4)

•  Principles evolved from enterprise networks

• 11

What is a data center network (DCN)?

•  Both layers 2 (link) and 3 (network) present –  Not only L3 routers but also L2 switches –  Layer 2 subnets connected with layer 3

•  Layer 4 (transport) needed similar to any packet networks

•  Note: does not have to be TCP/IP! –  Not part of routed Internet

•  Cannot resolve DC server’s address directly from Internet, only front end servers

–  But often is TCP/IP…

• 12

email WWW phone..."

SMTP HTTP SIP..."

TCP UDP…"

Eth PPP WiFi 3GPP…"

copper fiber radio OFDM FHSS..."

What makes DCNs special?

•  Just plug all servers to an edge router and be done with it? –  Several issues with this approach

•  Scaling up capacity –  Lots of servers need lots of switch ports –  E.g.: State of the art Cisco Nexus 7000 modular data center switch (L2

and L3) supports max. 768 1/10GE ports •  Switch capacity and price

–  Prices goes up with nb of ports –  E.g.: List price for 768 ports with 10GE modules somewhere beyond $1M –  Buying lots of commodity switches is an attractive option

•  Potentially majority of traffic stays within DC –  Server to server

• 13

What makes DCNs special? (cont.)

•  Requirements different from Internet applications –  Large amounts of bandwidth –  Very, very short delays –  Still, often Internet protocols (TCP/IP) used

•  Management requirements –  Incremental expansion –  Should be able to withstand server failures, link outages, server

rack failures •  Under failures, performance should degrade gracefully

•  Requirements due to expenses –  Cost-effectiveness; high throughput per dollar –  Power efficiency

⇒ DCN topology and equipment matter a lot

• 14

Data Center Costs

Amortized Cost* Component Sub-Components ~45% Servers CPU, memory, disk ~25% Power infrastructure UPS, cooling, power distribution ~15% Power draw Electrical utility costs ~15% Network Switches, links, transit

•  Total cost varies –  Upwards of $1/4 B for mega data center

•  Server costs dominate –  Network costs also significant

⇒ Network should allow high utilization of servers

Source: Greenberg et al. The Cost of a Cloud: Research Problems in Data Center Networks. Sigcomm CCR 2009. *3 yr amortization for servers, 15 yr for infrastructure; 5% cost of money

• 15

Outline

•  Conclusions

• 16

Switch vs. router: What’s the difference?

•  Switch is layer 2 device –  Does not understand IP protocol –  Does not run any routing protocol

•  Router is layer 3 device –  “Speaks” IP protocol –  Runs routing protocols to determine shortest paths

•  OSPF, RIP, etc.

•  Terminology not so clear –  L2/3, i.e. multi-layer switches

• 17

Switch vs. router: Difference in basic functioning •  Router

–  Forwards packets based on destination IP address •  Prefix lookup against routing tables

–  Routing tables built and maintained by routing algorithms and protocols

•  Protocols exchange information about paths to known destinations •  Algorithms compute shortest paths based on this information

–  Broadcast sending usually not allowed •  Switch

–  Forwards frames (packets) based on destination MAC address –  Uses switch table

•  Equivalent to routing table in router –  Broadcast sending is common –  How is switch table built and maintained since there is no routing

protocol?

• 18

Switch is self learning •  When frame is received from one port

–  Switch learns that sender is behind that port –  Switch adds that information to switch table –  Soft state: forget after a while

•  If destination not (yet) known –  Flood to all other ports

•  Flooding can lead to forwarding loops –  Switches connected in cyclic manner –  These loops can create broadcast storms

•  Spanning tree protocol (STP) used to avoid loops –  Generates loop-free topology –  Avoid using some ports when flooding –  Rapid Spanning Tree Protocol (RSTP)

•  Faster convergence after a topology change

• 19

Port 1

Port 2 Port 2 AA 1 AA 1

AA 2 AA 2 AA 1 AA 1

And so on… No TTL in L2 headers!

<Src=AA, Dest=DD>

Layer 2 vs. Layer 3 in DCN

•  Management –  L2 close to plug-and-play –  L3 usually requires some manual configuration (subnet mask, DHCP)

•  Scalability and performance –  L2 broadcasting and STP scale poorly –  L2 forwarding less scalable than L3 fwding

•  L2 based on flat MAC addresses •  L3 based on hierarchical IP addresses (prefix lookup)

–  L2 has no such load balancing over multiple paths as L3 –  L2 loops may still happen in practice, even with STP

• 20

Layer 2 vs. Layer 3 in DCN

•  Flexibility –  VM migration may require change of IP address in L3 network

•  Need to conform to subnet address –  L2 network allows any IP address for any server

•  Some reasons may prevent using pure L3 design –  Some servers may need L2 adjacency

•  Servers performing the same functions (load balancing, redundancy) •  Heartbeat or application packets may not be routable

–  Dual homed servers may need to be on same L2 domain •  Connected to two different access switches •  Some configurations require both primary and secondary to be in same L2

domain

• 21

•  VLAN = Virtual Local Area Network •  Some servers may need to belong to same L2 broadcast

domains –  See previous slide…

•  VLANs overcome limitations of physical topology –  Run out of switch ports

•  VLAN allows flexible growth while maintaining layer 2 adjacency –  L2 domain across routers

•  VLAN can be port-based or @MAC-based

• 22

Port-based VLAN

•  Traffic isolation –  Frames to/from ports 1-8 can

only reach ports 1-8 –  Can also define VLAN based on

MAC addresses of endpoints, rather than switch port

•  Dynamic membership –  Ports can be dynamically

assigned among VLANs •  Forwarding between VLANS

done via routing

• 23

16 10 2

VLAN1 (ports 1-8)

VLAN2 (ports 9-15)

router

VLANs spanning multiple switches

•  VLANs can span over multiple switches •  Also over different routed subnets

–  Routers in between

VLAN1 (ports 1-8)

VLAN2 (ports 9-15)

Ports 2,3,5 belong to VLAN1 Ports 4,6,7,8 belong to VLAN2

4 6 8 16

• 24

Outline

•  Conclusions

• 25

Design Alternatives for DCN

Two high level choices for Interconnections: •  Specialized hardware and communication protocols

–  E.g. Infiniband seems common – 

•  Can provide high bandwidth & extremely low latency –  Custom hardware takes care of some reliability tasks

•  Relatively low power physical layer – 

•  Expensive •  Not natively compatible with TCP/IP applications

•  Commodity (1/10 Gb) Ethernet switches and routers –  Compatible –  Cheaper –  We focus on this

• 26

Conventional DCN architecture

•  Topology: Two- or three-level trees of switches or routers –  Multipath routing –  High bandwidth by

appropriate interconnection of many commodity switches

–  Redundancy

Internet

Layer-3 router

Layer-2/3 aggregation switches Layer-2 Top-Of-Rack access switches

Servers

• 27

Issues with conventional architecture

•  Bandwidth oversubscription –  Total bandwidth at core/aggregate level less than summed up

bandwidth at access level –  Limited server to server capacity –  Application designers need to be aware of limitations

•  No performance isolation –  VLANs typically provide reachability isolation only –  One server (service) sending/receiving too much traffic hurts all

servers sharing its subtree

•  There are more…

• 28

One solution to oversubscription

•  FAT Tree topology with special look-up scheme –  Add more commodity switches

•  Carefully designed topology •  All ports have same capacity as servers

–  Enables •  Full bisection bandwidth •  Lower cost because all switch ports have

same capacity –  Drawbacks

•  Need customized switches –  Special two level look-up scheme to

distribute traffic •  Lot of cabling

• 29

M. Al-Fares et al. Commodity Data Center Network Architecture. In SIGCOMM 2008.

Core Switches

Aggregation Switches

Edge Switches

FAT Tree

One solution to performance isolation: VLB •  Random flow spreading with Valiant Load Balancing (VLB)

–  Similar FAT Tree topology with commodity switches –  Every flow “bounced” off a random intermediate switch –  Provably hotspot free for any admissible traffic matrix –  No need to modify switches (std forwarding)

•  Relies on ECMP and clever addressing –  Requires some changes to servers

• 30

10G D/2 ports

D/2 ports

. . . D switches

D/2 switches Intermediate node switches in VLB

D ports

Top Of Rack switch

[D2/4] * 20 Servers 20

Aggregation switches

A. Greenberg et al. VL2: A Scalable and Flexible Data Center Network. In SIGCOMM, 2009.

DCN architectures in research

•  Lots of alternative proposed architectures in recent years •  Goals

–  Overcome limitations of typical architectures of today –  Use commodity standard equipment

•  VL2 & Monsoon & CamCube (MSR) •  Portland (UCSD) •  Dcell & Bcube (MSR, Tsinghua, UCLA) •  …

• 31

Outline

•  Conclusions

• 32

TCP in the Data Center

•  TCP rules as transport inside DC –  99.9% of traffic

•  DCNs different environment for TCP compared to normal Internet e2e transport –  Very short delays –  Specific application workloads

•  How well does TCP work in DCNs? –  Several problems…

• 33

Worker Nodes

Partition/Aggregate Application Structure

• 34

Deadline = 250ms

Deadline = 50ms

Deadline = 10ms

Internet

•  The foundation for many large-scale web applications –  Web search, Social network

composition, Ad selection, etc. •  Time is money -> strict deadlines •  Missed deadline means lower

quality result

Worker Nodes

Partition/Aggregate Application Structure

• 35

Deadline = 250ms

Deadline = 50ms

Deadline = 10ms

Internet

•  Deadlines in lower hierarchy must meet with all-up deadline

•  Iterative requests common •  1-4 iterations typical •  Workers have tight deadlines

•  99.9th percentiles of delay matter for companies •  1 out of 1000 responses •  Can potentially impact large number of customers

Workloads

•  Query-response traffic –  Partition/Aggregate –  Part of the “mice” flows

•  Background traffic –  Short messages [50KB-1MB]

•  Coordination, control state •  Part of the “mice” flows

–  Large flows [1MB-50MB] •  Updating data on each server •  The “elephant” flows

•  Problem: –  All this traffic goes through same switches –  Requirements are conflicting

• 36

Requires minimal delay

Requires high throughput

Traffic patterns from one cluster of Microsoft’s DCN

ln(Bytes) exchanged per 10s

  Traffic exchanged between server pairs in 10s period

  Servers within a rack are adjacent on axis

  Work-Seeks-Bandwidth (W-S-B)   Small squares around

diagonal   Scatter-Gather (S-G)

  Horizontal and vertical lines

Traffic patterns from one cluster of Microsoft’s DCN (cont.) •  Work-seeks-bandwidth

–  Need to make efforts to place jobs under the same ToR

•  Scatter-gather-patterns –  Server pushes/pulls data to/from many servers across the

cluster –  Distributed query processing: map, reduce

•  Data divided into small parts •  Each servers works on particular part •  Answers aggregated

–  Need for inter-ToR communication •  Computation constrained by the network

DCN characteristics

•  Network characteristics –  Large aggregate bandwidths –  Very short round trip time delays (<1ms)

•  Typical switches –  Use large numbers of commodity switches –  Typically commodity switch has shared memory

•  Common memory pool for all ports –  Why not separated memory spaces?

•  Cost issue for commodity switches

• 39

Resulting problems with TCP in DCN

•  Incast

•  Queue Buildup

•  Buffer Pressure

• 40

Problems: Incast

• 41

Worker 1

Worker 2

Worker 3

Worker 4

Aggregator

•  Synchronized mice collide.   Caused by Partition/Aggregate

Incast

•  What happens next? –  TCP timeout –  Default minimum values of timeout 200-400ms depending on

•  Why is that a major problem? –  Several order of magnitude longer than RTT -> huge penalty –  Fail to meet deadlines in all levels

• 42

Problems: Incast

• 43

A TCP timeout

Worker 1

Worker 2

Worker 3

Worker 4

Aggregator

RTOmin = 300 ms

Problems: Queue Buildup

•  Remember the different workloads –  Small “mice” flows –  Large “elephant” flows

•  Large flows can eat up the shared buffer space –  Same outgoing port

•  Result is similar than with incast

• 44

Problems: Queue Buildup

Sender 1

Sender 2

Receiver

Big flows build up queues   Increased latency for short flows  Packet loss

• 45

Problems: Buffer pressure

•  Kind of generalization of the previous problem •  Increased queuing delay and packet loss due to long

flows traversing other ports –  Shared memory pool –  Packets incoming and outgoing different ports still eat up each

common buffer space

• 46

Outline

•  Conclusions

• 47

Data Center Transport Requirements

• 48

1.  High Burst Tolerance –  Cope with the Incast problem

2.  Low Latency –  Short flows, queries

3. High Throughput –  Continuous data updates, large file transfers

We want to achieve all three at the same time

Exploring the solution space Proposal Throughput Burst tolerance

(Incast) Latency

Deep switch buffers Can achieve high throughput

Tolerates large bursts

Queuing delays increase latency

Shallow buffers Can hurt throughput of elephant flows

Cannot tolerate bursts well

Avoids long queuing delay

Jittering :/ No major impact Prevents Incast Increases median latency

Shorter RTOmin :/ No major impact Helps recover faster

Doesn’t help queue buildup

Nw assisted congestion ctrl (ECN style)

High throughput with high utilization

 Helps in most cases Problem if only 1 pkt is too much

Reacts early to queue buildup

• 49

Proposal Throughput Burst tolerance (Incast)

Latency

Jittering

•  Add random delay before responding –  Desynchronize the responding sources to avoid buffer overflow

•  Jittering trades off median against high percentiles

Jittering off Jittering on

Requests are jittered over 10ms window

• 50

Latency

Shorter RTOmin Improves throughput

Helps recover faster

Latency

Shorter RTOmin Improves throughput

Helps recover faster

Exploring the solution space

• 51

Review: TCP with ECN

• 52

Sender 1

Sender 2

Receiver ECN Mark (1 bit)

ECN = Explicit Congestion Notification

Q: How do TCP senders react? A: Cut sending rate by half

DCTCP: Two key ideas

1.  React in proportion to the extent of congestion, not just its presence   Reduces variance in sending rates, lowering queuing requirements

2.  Mark based on instantaneous queue length   Fast feedback to better deal with bursts

ECN Marks TCP DCTCP

1 0 1 1 1 1 0 1 1 1 Cut window by 50% Cut window by 40%

0 0 0 0 0 0 0 0 0 1 Cut window by 50% Cut window by 5%

Q: Why normal TCP with ECN does not behave like DCTCP? A: Fairness…

• 53

Data Center TCP Algorithm

•  Switch side: –  Mark packets when Queue Length > K

•  Sender side: –  Maintain moving average of fraction of packets marked (α). –  In each RTT:

•  Adaptive window decreases: –  Note: decrease factor between 1 and 2.

• 54

KMark Don’t mark

DCTCP in Action

• 55

Why does DCTCP work?

•  High Burst Tolerance –  Aggressive marking → sources react before packets are

dropped –  Large buffer headroom → bursts fit

•  Low Latency –  Small buffer occupancies → low queuing delay

•  High Throughput –  ECN averaging → smooth rate adjustments, low variance –  Leads to high utilization

• 56

Completely solves the Incast problem?

•  Remember Incast: large number of synchronized small flows hit the same queue

•  Depends on the number of small flows –  Does not help if so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst •  No congestion control helps •  Only solution is to somehow schedule responses (e.g. jittering)

•  Helps if each flow has several packets to transmit –  Windows build up over multiple RTTs –  Bursts in subsequent RTTs would lead to packet drops –  DCTCP sources receive enough ECN feedback to prevent

buffer overflows

• 57

Comparing TCP and DCTCP

•  Emulate traffic within 1 Rack of Bing cluster –  45 1G servers, 10G server for external traffic

•  Generate query, and background traffic –  Flow sizes and arrival times follow distributions seen in Bing

•  Metric: –  Flow completion time for queries and background flows

•  RTOmin = 10ms for both TCP & DCTCP –  More than fair comparison

• 58

Comparing TCP and DCTCP (cont.) Background Flows Query Flows

• 59

Low latency for short flows

• 60

High throughput for long flows

• 61

High burst tolerance for query flows

• 62

DCTCP summary

•  DCTCP –  Handles bursts well –  Keeps queuing delays low –  Achieves high throughput

•  Features: –  Simple change to TCP and a single switch parameter –  Based on existing mechanisms

• 63

TCP for DCN research

•  Data transport in DCN has received attention recently •  Several solutions proposed just this year

–  Deadline-Aware Datacenter TCP (D2TCP) (Purdue, Google) –  DeTail (cross layer solution) (Berkeley, Facebook) –  …

• 64

Outline

•  Conclusions

• 65

Wrapping up

•  Data center networks provide specific networking challenges –  Potentially huge scale –  Different requirements than with traditional Internet applications

•  Recently a lot of research activity –  New proposed architectures and protocols –  Big deal to companies with mega-scale data centers: $$

•  Popularity of cloud computing accelerates this evolution

• 66

Want to know more?

1.  M. Arregoces and M. Portolani. Data Center Fundamentals. Cisco Press, 2003. 2.  Kandula, S., Sengupta, S., Greenberg, A., Patel, P., and Chaiken, R. 2009. The nature of data center

traffic: measurements & analysis. In Proceedings of IMC 2009. 3.  Vasudevan, V., Phanishayee, A., Shah, H., Krevat, E., Andersen, D. G., Ganger, G. R., Gibson, G. A.,

and Mueller, B. 2009. Safe and effective fine-grained TCP retransmissions for datacenter communication. In Proceedings of the ACM SIGCOMM 2009.

4.  A. Greenberg et al. VL2: A Scalable and Flexible Data Center Network. In SIGCOMM, 2009. 5.  C. Guo et al. DCell: A Scalable and Fault Tolerant Network Structure for Data Centers. In SIGCOMM,

2008. 6.  M. Al-Fares, A. Loukissas, and A. Vahdat. A Scalable, Commodity Data Center Network Architecture.

In Proceedings of the ACM SIGCOMM 2008. 7.  Niranjan Mysore, R., Pamboris, A., Farrington, N., Huang, N., Miri, P., Radhakrishnan, S.,

Subramanya, V., and Vahdat, A. 2009. PortLand: a scalable fault-tolerant layer 2 data center network fabric. In Proceedings of the ACM SIGCOMM 2009.

8.  Joseph, D. A., Tavakoli, A., and Stoica, I. 2008. A policy-aware switching layer for data centers. In Proceedings of the ACM SIGCOMM 2008.

9.  Guo, C., Lu, G., Li, D., Wu, H., Zhang, X., Shi, Y., Tian, C., Zhang, Y., and Lu, S. 2009. BCube: a high performance, server-centric network architecture for modular data centers. In Proceedings of the ACM SIGCOMM 2009.

10.  Abu-Libdeh, H., Costa, P., Rowstron, A., O'Shea, G., and Donnelly, A. 2010. Symbiotic routing in future data centers. In Proceedings of the ACM SIGCOMM 2010.

11.  Check SIGCOMM 2012 program as well

• 67

t-110.5116 computer networks ii - aalto university · m. alizadeh et al: “data center...

Documents

dctcp & codel the best is the friend of the good

a,b, r. alizadeh , m. attarian , y. jaferian , p. shojaei

analysis of dctcp: stability, convergence, and...

dctcp final

alizadeh 2008-chogha mish ii

pancreatic sarcoidosis: a literature reviewpancreatic...

iranian prehistoric project abbas alizadeh

experiences evaluating dctcp

scenario planning (alizadeh)

mohammad alizadeh adel javanmard and balaji prabhakar

t-110.5116 computer networks ii€¦ · t-110.5116 computer...

t-110.5116 computer networks ii - aalto university

mohammad alizadeh stanford university joint with:

packet transport mechanisms for data center networks...

· the q method for symmetric cone programming yu xia∗...

b99705021 李奕德. abstract intro ecn in dctcp tdctcp ...

alizadeh brandt diebold range based volatility estimators

phd alizadeh

data center tcp (dctcp)balaji/papers/10datacenter.pdf ·...

balaji prabhakar mohammad alizadeh , abdul kabbani , and ...