fast tcp cheng jin david wei steven low netlab.caltech.edu
TRANSCRIPT
FAST TCP
Cheng JinDavid Wei
Steven Low
netlab.CALTECH.edu
Acknowledgments Caltech
Bunn, Choe, Doyle, Hegde, Jayaraman, Newman, Ravot, Singh, X. Su, J. Wang, Xia
UCLA Paganini, Z. Wang
CERN Martin
SLAC Cottrell
Internet2 Almes, Shalunov
MIT Haystack Observatory Lapsley, Whitney
TeraGrid Linda Winkler
Cisco Aiken, Doraiswami, McGugan, Yip
Level(3) Fernes
LANL Wu
Outline
Motivation & approach FAST architecture Window control algorithm Experimental evaluation
skip: theoretical foundation
Congestion control
xi(t)
pl(t)
Example congestion measure pl(t) Loss (Reno) Queueing delay (Vegas)
TCP/AQM
Congestion control is a distributed asynchronous algorithm to share bandwidth
It has two components TCP: adapts sending rate (window) to congestion AQM: adjusts & feeds back congestion information
They form a distributed feedback control system Equilibrium & stability depends on both TCP and AQM And on delay, capacity, routing, #connections
pl(t)
xi(t)TCP: Reno Vegas
AQM: DropTail RED REM/PI AVQ
Difficulties at large window
Equilibrium problem Packet level: AI too slow, MD too drastic Flow level: required loss probability too
small Dynamic problem
Packet level: must oscillate on binary signal
Flow level: unstable at large window
5
Packet & flow level
ACK: W W + 1/W
Loss: W W – 0.5W
Packet level
Reno TCP
Flow level
Equilibrium
Dynamics
pkts (Mathis formula)
Reno TCP
Packet level Designed and implemented first
Flow level Understood afterwards
Flow level dynamics determines Equilibrium: performance, fairness Stability
Design flow level equilibrium & stability Implement flow level goals at packet level
Reno TCP
Packet level Designed and implemented first
Flow level Understood afterwards
Flow level dynamics determines Equilibrium: performance, fairness Stability
Packet level design of FAST, HSTCP, STCP guided by flow level properties
Packet level
ACK: W W + 1/W
Loss: W W – 0.5W
Reno AIMD(1, 0.5)
ACK: W W + a(w)/W
Loss: W W – b(w)W
HSTCP AIMD(a(w), b(w))
ACK: W W + 0.01
Loss: W W – 0.125W
STCP MIMD(a, b)
RTT
baseRTT W W :RTT FAST
Flow level: Reno, HSTCP, STCP, FAST
Similar flow level equilibrium
= 1.225 (Reno), 0.120 (HSTCP), 0.075 (STCP)
pkts/sec (Mathis formula)
Flow level: Reno, HSTCP, STCP, FAST
Different gain and utility Ui
They determine equilibrium and stability
Different congestion measure pi Loss probability (Reno, HSTCP, STCP) Queueing delay (Vegas, FAST)
Common flow level dynamics!
windowadjustment
controlgain
flow levelgoal=
Implementation strategy
Common flow level dynamics
windowadjustment
controlgain
flow levelgoal=
Small adjustment when close, large far away Need to estimate how far current state is wrt target Scalable
Window adjustment independent of pi Depends only on current window Difficult to scale
Outline
Motivation & approach FAST architecture Window control algorithm Experimental evaluation
skip: theoretical foundation
Architecture
RTT timescaleLoss recovery
<RTT timescale
Architecture
Each component designed independently upgraded asynchronously
Architecture
Each component designed independently upgraded asynchronously
WindowControl
Uses delay as congestion measure Delay provides finer congestion info Dealy scales correctly with network capacity Can operate with low queuing delay
FAST-TCP basic idea
Loss
C Window
Que
ue D
elay
FASTLoss Based TCP
Window control algorithm
Full utilization regardless of bandwidth-delay product
Globally stable exponential convergence
Fairness weighted proportional fairness parameter
Outline
Motivation & approach FAST architecture Window control algorithm Experimental evaluation
Abilene-HENP network Haystack Observatory DummyNet
Abilene Test
OC48
OC192
(Yang Xia, Harvey Newman, Caltech)
Periodic lossesevery 10mins
(Yang Xia, Harvey Newman, Caltech)
Periodic lossesevery 10mins
(Yang Xia, Harvey Newman, Caltech)
Periodic lossesevery 10mins
FAST backs off tomake room for Reno
Haystack Experiments
Lapsley, MIT Haystack
Haystack - 1 Flow (Atlanta-> Japan)
• Iperf used to generate traffic.• Sender is a Xeon 2.6 Ghz• Window was constant:Burstiness in rate due to Host processing and ack spacing.
Lapsley, MIT Haystack
Haystack – 2 Flows from 1 machine (Atlanta -> Japan)
Lapsley, MIT Haystack
Timeout
All outstanding packets marked as lost.1. SACKs reduce lost packets
2. Lost packets retransmitted slowlyas cwnd is capped at 1 (bug).
Linux Loss Recovery
DummyNet Experiments
Experiments using emulated network. 800 Mbps emulated bottleneck in
DummyNet.
Sender PC
Dual Xeon 2.6Ghz 2Gb
Intel GbE
Linux 2.4.22
DummyNet PC
Dual Xeon 3.06Ghz 2Gb
FreeBSD 5.1
800Mbps
Receiver PC
Dual Xeon 2.6Ghz 2Gb
Intel GbE
Linux 2.4.22
Dynamic sharing: 3 flowsFAST Linux
Dynamic sharing on Dummynet capacity = 800Mbps delay=120ms 3 flows iperf throughput Linux 2.4.x (HSTCP: UCL)
Dynamic sharing: 3 flowsFAST Linux
HSTCPBIC
Steady throughput
FAST Linux
throughput
loss
queue
STCPHSTCP
Dynamic sharing on Dummynet capacity = 800Mbps delay=120ms 14 flows iperf throughput Linux 2.4.x (HSTCP: UCL)
30min
FAST Linux
throughput
loss
queue
HSTCP
30min
Room for mice !
HSTCP BIC
Average Queue vs Buffer Size
Dummynet capacity
= 800Mbps Delay
=200ms 1 flows Buffer size:
50, …, 8000 pkts
(S. Hedge, B. Wydrowski, etc, Caltech)
Is large queue necessary for high throughput?
FAST TCP: motivation, architecture, algorithms, performance. IEEE Infocom March 2004
-release: April 2004Source freely available for any non-profit use
netlab.caltech.edu/FAST
Aggregate throughput
ideal performance
Dummynet: cap = 800Mbps; delay = 50-200ms; #flows = 1-14; 29 expts
Aggregate throughput
small window800pkts
largewindow
8000
Dummynet: cap = 800Mbps; delay = 50-200ms; #flows = 1-14; 29 expts
Fairness
Jain’s index
HST
CP ~
Ren
oDummynet: cap = 800Mbps; delay = 50-200ms; #flows = 1-14; 29 expts
Stability
Dummynet: cap = 800Mbps; delay = 50-200ms; #flows = 1-14; 29 expts
stable indiverse
scenarios
FAST TCP: motivation, architecture, algorithms, performance. IEEE Infocom March 2004
-release: April 2004Source freely available for any non-profit use
netlab.caltech.edu/FAST
BACKUP Slides
IP Rights
Caltech owns IP rights applicable more broadly than TCP leave all options open
IP freely available if FAST TCP becomes IETF standard Code available on FAST website for any non-commercial use
WAN in Lab
Caltech: John Doyle, Raj Jayaraman, George Lee, Steven Low (PI), Harvey Newman, Demetri Psaltis, Xun Su, Yang Xia
Cisco: Bob Aiken, Vijay Doraiswami, Chris McGugan, Steven Yip
netlab.caltech.edu
NSF
Key Personnel Steven Low, CS/EE Harvey Newman,
Physics John Doyle, EE/CDS Demetri Psaltis, EE
Cisco Bob Aiken Vijay Doraiswami Chris McGugan Steven Yip
Raj Jayaraman, CS Xun Su, Physics Yang Xia, Physics George Lee, CS
2 grad students 3 summer students Cisco engineers
Spectrum of toolslog(cost)
log(abstraction)mathsimulationemulationlive nk WANiLab
NSSSFNetQualNetJavaSim
Mathis formulaOptimizationControl theoryNonlinear modelStocahstic model
DummyNetEmuLabModelNetWAIL
PlanetLabAbileneNLRDataTAGCENICWAILetc
?
…we use them all
Spectrum of tools
mathsimulationemulationlive nk WANiLab
Distance High High High
Speed High High Low
Realism High High Low
Traffic High Low Low
Configurable Low Medium High
Monitoring Low Medium High
Cost High Medium Low
Critical in developmente.g. Web100
Goal
State-of-the-art hybrid WAN High speed, large distance
2.5G 10G 50 – 200ms
Wireless devices connected by optical core
Controlled & repeatable experiments Reconfigurable & evolvable Built in monitoring capability
WAN in Lab
5-year plan 6 Cisco
ONS15454 4 routers 10s servers Wireless
devices 800km fiber ~100ms
RTT
OSPF Area: 40OSPF Area: 20
OSPF Area: 10 OSPF Area:30
OPTICAL NETWORK
ONS15454Site B
ONS15454Site E
ONS15454Site C
ONS15454Site D
CISCO7613
(Bottleneck Rtr)
ML-Series NeworkModule
ML-Series NeworkModule
ML-Series networkmodule
CISCO7613
(Bottleneck Rtr)ML-Series Nework
Module
ONS15454Site A
ONS15454Site F
10GE : 100KM
10GE: 100km
Server ServerServer Server
Server Server
CISCO7613
(Bottleneck Rtr)
Server Server Server ServerServer Server Server Server
Linux Farm
Server
Server
Server
Server Server Server ServerServer Server
CISCO7613
(Bottleneck Rtr)
Server Server ServerServer
192.168.10/24 192.168.30/24
10.0.2/24
ITANIUM -10GE Server
10.0.3/24
WirelessComponents
WirelessComponents
Itanium -10GE Server
10.0.3/24
Linux Farm
Server
Server
Server
Linux FarmServer
ServerServer
Linux FarmServer
ServerServerWireless
ComponentsWireless
Components
ITANIUM10GE Server
10.0.3/24
10.0.2/24
10.0.2/24 10.0.2/24
192.168.20/24
ITANIUM10GE Server
10.0.3/24
192.168.40/24
10.0.1/24
10.0.5/2410.0.1/24
10.0.4/24
10.0.4/24
10.0.5/24
V. Doraiswami (Cisco)R. Jayaraman (Caltech)
OSPF Area: 20
OSPF Area: 10
OPTICAL NETWORK
ONS15454Site B
ONS15454Site D
CISCO7613
(Bottleneck Rtr)
ONS15454 (to support
additionalML-Series cards)
ONS15454 (to support
additionalML-Series cards)
ONS15454Site A
10
GE
: 10
0K
M
Server ServerServer Server Server Server
Server Server
CISCO7613
(Bottleneck Rtr)
Server Server ServerServer
192.168.10/24
10.0.2/24
ITANIUM -10GE Server
10.0.2/24
WirelessComponents
Itanium -10GE Server
10.0.2/24
WirelessComponents
10.0.2/24
192.168.20/24
10.0.1/24
10.0.1/24
WAN in Lab
Year-1 plan 3 Cisco ONS
15454 2 routers 10s servers Wireless
devices
V. Doraiswami (Cisco)R. Jayaraman (Caltech)
Hybrid NetworkScenarios: Ad hoc network Cellular network Sensor network
How optical core supports wireless
edges?
X. Su (Caltech)
Experiments Transport & network layer
TCP, AQM, TCP/IP interaction
Wireless hybrid networking Wireless media delivery Fixed wireless access Sensor networks
Optical control plane Grid computing
UltraLight
WAN in Lab Capacity: 2.5 – 10 Gbps Delay: 0 – 100 ms round trip Delay: 0 – 400 ms round trip
Configurable & evolvable Topology, rate, delays, routing Always at cutting edge
Flexible, active debugging Passive monitoring, AQM
Integral part of R&A networks Transition from theory, implementation,
demonstration, deployment Transition from lab to marketplace
Global resource Part of global infrastructure UltraLight led by
Newman
Unique capabilities
Calren2/Abilene
Chicago
Amsterdam
CERN
Geneva
SURFNet
StarLight
WAN in LabCaltech
research & production networks
Multi-Gbps50-200ms delay
Experiment
Network debugging
Performance problems in real network Simulation will miss Emulation might miss Live network hard to debug
WAN in Lab Passive monitoring inside network Active debugging possible
Passive monitoring
Fibersplitter
DAG
RAID
TimestampHeader
GPS
Monitor
No overhead on system Can capture full info at OC48
UofWaikato’s DAG card captures at OC48 speed
Can filter if necessary Disk speed = 2.5Gbps*40/1500
= 66Mbps Monitors synchronized by GPS
or cheaper alternatives Data stored for offline
analysis
D. Wei (Caltech)
Passive monitoring
D. Wei (Caltech)
Fibersplitter
DAG
RAID
TimestampHeader
GPS
Monitor
Server
Server
router
router
monitor
monitor
monitor monitor
monitor
monitor
Web100, MonALISA
UltraLight testbed
UltraLight team (Newman)
Status Hardware
Optical transport design: finalized IP infrastructure design: finalized (almost) Wireless infrastructure design: finalized Price negotiation/ordering/delivery: summer 04
Software Passive monitoring: summer student Management software: 2005 -
Physical lab Renovation: to be completed by summer 04
2007200620052003 2004
hardwaredesign
physical building
fundraising
NSF funds10/03
Status
usabletestbed12/04
monitoring
trafficgeneration
connectedUltraLight
usefultestbed12/05
AROfunds5/04
expansion
support
management
OSPF Area: 40OSPF Area: 20
OSPF Area: 10 OSPF Area:30
OPTICAL NETWORK
ONS15454Site B
ONS15454Site E
ONS15454Site C
ONS15454Site D
CISCO7613
(Bottleneck Rtr)
ML-Series NeworkModule
ML-Series NeworkModule
ML-Series networkmodule
CISCO7613
(Bottleneck Rtr)ML-Series Nework
Module
ONS15454Site A
ONS15454Site F
10GE : 100KM
10GE: 100km
Server ServerServer Server
Server Server
CISCO7613
(Bottleneck Rtr)
Server Server Server ServerServer Server Server Server
Linux Farm
Server
Server
Server
Server Server Server ServerServer Server
CISCO7613
(Bottleneck Rtr)
Server Server ServerServer
192.168.10/24 192.168.30/24
10.0.2/24
ITANIUM -10GE Server
10.0.3/24
WirelessComponents
WirelessComponents
Itanium -10GE Server
10.0.3/24
Linux Farm
Server
Server
Server
Linux FarmServer
ServerServer
Linux FarmServer
ServerServerWireless
ComponentsWireless
Components
ITANIUM10GE Server
10.0.3/24
10.0.2/24
10.0.2/24 10.0.2/24
192.168.20/24
ITANIUM10GE Server
10.0.3/24
192.168.40/24
10.0.1/24
10.0.5/2410.0.1/24
10.0.4/24
10.0.4/24
10.0.5/24
CS DeptJorgensen Lab
NetLab
WANin Lab
G. Lee, R. Jayaraman, E. Nixon (Caltech)
Summary Testbed driven by research agenda
Rich and strong networking effort Integrated approach:
theory + implementation + experiments “A network that can break”
Integral part of real testbeds Part of global infrastructure UltraLight led by
Harvey Newman (Caltech) Integrated monitoring & measurement
facility Fiber splitter passive monitors MonALISA