end-to-end provisioned optical network testbed for large-scale escience applications
DESCRIPTION
Nagi Rao, Bill Wing Computer Science and Mathematics Division Oak Ridge National Laboratory [email protected],[email protected]. Tony Mezzacappa Physics Division Oak Ridge National Laboratory [email protected]. - PowerPoint PPT PresentationTRANSCRIPT
End-To-End Provisioned Optical Network Testbed for Large-Scale eScience Applications
Nagi Rao, Bill Wing Nagi Rao, Bill Wing
Computer Science and Mathematics Computer Science and Mathematics DivisionDivision
Oak Ridge National LaboratoryOak Ridge National Laboratory
[email protected],[email protected]
Nov 12, 2003Nov 12, 2003
Project Kick-off Meeting, University of VirginiaProject Kick-off Meeting, University of Virginia
Sponsored bySponsored by
NSF Experimental Infrastructure Networks ProgramNSF Experimental Infrastructure Networks Program
Tony MezzacappaTony Mezzacappa
Physics DivisionPhysics Division
Oak Ridge National LaboratoryOak Ridge National Laboratory
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Outline of Presentation
1. Project details2. TSI network and application interface requirements3. Transport for dedicated channels
1. dynamics of shared streams2. channel stabilization
4. Work Plan
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
ORNL Project Details
Principal Investigators:Nagi Rao – Computer Scientist/EngineerBill Wing – Network Engineer/ScientistTony Mezzacappa – Astrophysicist
Technical Staff:Qishi Wu – Post-Doctoral FellowMenxia Zhu – Phd StudentSteven Carter – Systems and Network Support
Budget: 850K (364K year1)
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Networking Activities:Data transfers: Archive and supply
massive amounts of data (terabyte/days)
Interactive visualizations: Visualize archival or on-line data
Remote steering and control: control computations and visualizations into regions of interest
Coordinated operations: collaborative visualization and steering
TSI Computations: Networking Support
Visualization stream
Control stream
Data stream
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
High Bandwidth Data Channels:Off-line Transfers: Terabyte datasets
Supercomputers – high performance storage systemsStorage – host nodes and visualization servers
On-line Transfers:Supercomputers – visualization nodes
Control and Steering Channels:Interactive visualization – human response timeComputational steering – respond to “inertia” of computation
Coordinated Channels:Coordinated visualization, steering, and archivalMultiple visualization and steering nodes
On Internet: these channels can be supported only in a limited way – It is difficult to sustain large data rates in a fair manner – Unpredictability of transport dynamics makes it very difficult to achieve stability
Types of Networking Channels
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Several Candidate Protocols (to be tested):UDP-based data transport:
UDT(SABUL), tsunami, hurricane, RBUDP, IQ-RUDP, and othersAdvantages: application-level implementations and conceptually simple methodsDisadvantages: unstable code and hard to configure parameters
Tuned TCP methods:net100: tune flow windows large enough to avoid self created lossesAdvantages: known mechanisms and tested kernel codeDisadvantages: physical losses are problematic
– TCP interprets physical losses as congestion and reduces throughputHost Issues for 1-10Gbps Rates: Impedance match issues
– Buffering in NIC, kernel and application, disk speeds • –zero-copy kernel patch and ST
– OS bypass, RDMA
Data Transfers Over Dedicated Channels
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Multiple Streams Over Dedicated Channels
Visualization stream
SteeringVisualization control
Example:•Monitor computation through a visualization channel•Interactive visualization – rotate, project different subspaces •Computational Steering – specify parameters on the fly•Archive/load the data – store the interesting data
High performancestorage
Option 1:Dedicated channels for each stream
•4 NICS – 4 MSPP slots
Data stream
Option 2:Share dedicated channels
•single NIC and MSPP slots•realize sharing at protocol or application level
Option 3:Visualization streams on one channel Data and steering streams on another channel
•two NIC and MSPP slots•realize sharing at protocol or application level
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Connection: Logical: host site to host site
Circuit or Channel or Bandwidth Pipe:Physical: NIC-NIC
Stream: Logical: Application to application
Terminology Review
Visualization stream
Control stream
Data stream
connection
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Advantages: No other traffic on the channel• Simpler protocols:
– Rate controllers with loss recovery mechanisms would suffice for• data transfers and• control channels for host-host connections
• Coordination between the streams can be handled at application/middleware levelDisadvantages:• Scaling problems:
– single connection requires 4 NIC-NIC pairs and 4 channels in the example– main computation site supporting 5 users requires
• host with 20 NICs and 20 channels• MSPP with least 20 slots (e.g 5 blades each with 4 GigE slots)
• Utilization problems:– Even a small control stream needs an entire channel (with minimum resolution)
• E.g., 10Mbps control stream on GigE channel
Dedicated NIC-NIC Channels
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Streams interact and affect each other:• Packets may be “pooled” at the source and destination nodes:
– NIC – interrupt coalescing and buffer clearing– NIC-kernel transfers through buffers– Kernel-application transfers
• Processor load determines interrupt response time at finer levelsTwo important consequences
– Protocols or applications need to “share” the channel• Need protocols that allow for appropriate bandwidth sharing• TCP-like paradigm but a more structured problem
– Total bandwidth is known– Competing traffic is host generated
– Protocol interaction could generate complicated dynamics• Need protocols that stabilize the dynamics for control channels• Very few protocols exist that protect against “underflow”
• Need a combination of existing and newer protocols
Multiple Streams on Single NIC-NIC Channel
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
TSI Application interfaces and networking modules
Application module1 Application
Module 2Application Module 3
Dedicated provisioned channels
Bulk transportmodules streaming
protocols
Stabilizationmodules
Controlmodules
channels
middleware
protocols
applicationsinterfaces
Computational steering
dynamics visualizationdata transfers
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Overall Approach: Separate the steering and display components:– Steering module – connect it visualization control channel– Display module – Separate rendering and display sub-modules and locate them at hosts– Connect sub-modules over data channels
• Candidates under consideration – all need hooks to use dedicated channels– OpenGL, VTK codes
– code needs to be modified with appropriate calls – non-trivial – enSight
• can operate across IP networks without firewalls• High cost and no access to source code
– Paraview • stability problems and hard to use
– Aspect (?)• Developed at ORNL • Has functionality similar to Paraview with additional analysis modules• Developers are willing to incorporate CHEETAH modules
– On-line streaming– Large datasets
Interfacing with visualization modules
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Decomposition of visualization pipeline:– “links” have different bandwidths
• Geometry could be larger than data• Display bandwidth can be much smaller – human consumption
– tasks require different computational power• Large datasets require a cluster to compute the geometry• Rendering can be done on graphics-enabled machines• Display can be transferred to X-enabled machine
Pipeline can be realized over the network and display can be forward to user host
Optimizing visualization pipeline on a network
data storage geometrycomputation
displayrendering
Host node
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Problem is simpler than Internet:Total available channel bandwidth is known All traffic is generated by the nodes and is “known”Fairness issues are simpler – nodes can allocate bandwidth among streams
TCP addresses these problems over the Internet:slow-start to figure out available bandwidthpacket loss and time-out to conclude traffic levelsAIMD to adjust the flow rate
Bandwidth partitioning among data streams might require close-loop control:Simply (open-loop) control of data rates at application level does not always work:
Example: NIC has higher capacity than the provisioned channel:1. packets might be combined and sent out at higher rate by NIC causing losses at MSPP2. packets can be coalesced at receiver NIC resulting rates different from sending
Protocols for dedicated channels – multiple data streams
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Problem is to maintain “steady” dynamics for the control streams between applicationsNot just between NICs or at the lineComplicated end-to-end dynamics can be caused by various factors:
Channel losses:Physical lossesLosses due to sum of streams exceeding the capacity
Impedance mismatch between NIC and lineNIC and kernelkernel and application
On the Internet:Only probabilistic solution is possible over Internet because of complicated cross traffic dynamics – our
solutions based on stochastic approximationTCP does not solve the problem
Multiple TCP/UDP streams generate chaos-like dynamicsSingle TCP stream on the dedicated channel has underflow problem
Tune the flow-window at the desired level and adjust AIMD not to kick-inburst of losses can kill the stream – TCP interprets
This problem still simpler than Internet:Here cross-traffic is generated by the nodes and is “known”
Channels must explicitly stabilized using application-level closed loop control
Protocols for dedicated channels – multiple data and control streams
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Simulation Results: TCP-AIMD exhibits chaos-like trajectoriesTCP streams competing with each other on a dedicated link (Veres and Boda
2000)TCP competing with UDP on a dedicated link (Rao and Chua 2002)
Analytical Results (Rao and Chua 2002): TCP-AIMD has chaotic regimesCompeting with UDP steady streams on a dedicated link
State space analysis and Poincare maps
Internet Measurements (2003, last few weeks): TCP-AIMD traces are a complicated mixture of stochastic and chaotic componentsNote: on dedicated links we expect less or no chaotic component
Complicated Dynamics Interacting Streams
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Question: How relevant are the simulation and analytical results on chaotic trajectories?Answer: Only partially.
Internet (net100) traces show that TCP-AIMD dynamics are complicated mixture of chaotic and stochastic regimes:
– Chaotic – TCP-AIMD dynamics– Stochastic – TCP response to network traffic
Basic Point: TCP Traces collected on all Internet connections showed complicated dynamics
– classical “saw-tooth” profile is not seen even once– This is not a criticism against TCP, it was not intended for smooth dynamics
Internet Measurements – Joint work with Jianbo Gao
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Cwnd time series for ORNL-LSU connection
Connection: OC192 to Atlanta-Sox; Internet2 to Houston; LAnet to LSU
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Both Stochastic and Chaotic Parts are dominant
Lorenz – chaoticCommon envelope
Uniform RandomSpread out
TCP traces have:common envelope andspread out at certain scales
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Characterized as Anomalous DiffusionsLog-log displacement curves
Large exponent: typical of chaotic systems with injected noise
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
End-to-End Delay Dynamics Control: End Filtering
Objective: Achieve smooth end-to-end delay Solution: 1. Reduce end-to-end delay using two-paths via daemons: ORNL-OU, ORNL-ODU_OU
2. Filter the output at destination
Destination
U. Oklahoma
ORNL: source
Old Dominion
Uni.
Internet ConnectionInternet Connection
ORNL
U. Oklahoma
X-axis: message sizes X-axis: message sizes (bytes)(bytes)
Y-axis: end-to-end delay Y-axis: end-to-end delay (sec)(sec)
ORNLU. Oklahoma
Old Dominion
Uni.
filter
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Throughput Stabilization – Joint work with Qishi Wu
• Niche Application Requirement: Provide stable throughput at a target rate - typically much below peak bandwidth
– Commands for computational steering and visualization– Control loops for remote instrumentation
• TCP AIMD is not suited for stable throughput– Complicated dynamics– Underflows with sustained traffic
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Measurements: ORNL-LSU
ORNL-LSU old connection: Esnet peering with Abilene in New YorkBoth hosts have 10M NICS
Throughput stabilized within seconds at the target rate and was stable under:
•Large and small ftp at hosts and LAN
•Web browsing
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Stochastic Approximation: UDP window-based method
Transport control loop
Source node S Destination node D
data packets
acknowledgements
transmission rater(t)
Destinationgoodput
)(tgD
Sourcegoodput
)(tg S
Objective: adjust source rate to achieve (almost) fixed goodput at the destination
Difficulty: data packets and acks are subject to random processes
Approach: Rely on statistical properties of data paths
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Throughput and loss rates vs. window size and cycle time
Objective: adjust source rate to yield the desired throughput at destination
Typical day Christmas day
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Adaptation of source rate
• Adjust the window size
• Adjust cycle-time
• Both are special cases of classical Robbins-Monroe method
*, 1 , ( )sc n c n n
a TW W g g
n
, 1*
,
1.0/1.0
( )s n
cn
s n
Ta W
g gT n
^
1 ( ) *n n n nr r g r g
n
nnn ,0,0
Target throughput
Noisy estimate
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Performance Guarantees
• Summary:Stabilization is achieved with a high probability with a very
simple estimation of source rate• Basic result: for the general update
• We have
),1min(2
1,0*)(1 agg
n
arr nnn
,2
3if)
1(
,2
3if)
1(
])[(
)(2
2
nO
nO
rE nn
)(1
nOnn
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Internet Measurements
• ORNL-LSU connection (before recent upgrade)– Hosts with 10 M NIC– 2000 mile network distance
• ORNL-NYC – ESnet• NYC-DC-Hou – Abilene• HOU-LSU – Local n/s
• ORNL-GaTech Connection– Hosts with GigE NICS– ORNL-Juniper router – 1Gig link– Juniper- ATL Sox – OC192 (1Gig link)– Sox-GaTech – 1Gig link
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
ORNL-LSU Connection
ESnet
Local
ORNL
LSU
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Goodput Stabilization: ORNL-LSUExperimental Results
• Case 1: Target goodput = 1.0 Mbps, rate control through congestion window, a = 0.8,
Datagram acknowledging time ( ) vs. source rate (Mbps) & goodput (Mbps)
s
Datagram acknowledging time ( ) vs. source rate (Mbps) & goodput (Mbps)
• Case 2. Target goodput = 2.0 Mbps, rate control through congestion window, a = 0.8,
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Goodput Stabilization: ORNL-LSUExperimental Results
• Case 3. Target goodput = 3.0 Mbps, rate control through congestion window, a = 0.8,
8.0
Datagram acknowledging time ( ) vs. source rate (Mbps) & goodput (Mbps)
s
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Goodput Stabilization:ORNL-LSUExperimental Results
• Case 4. Target goodput = 2.0 Mbps, rate control through sleep time, a = 0.8,
Datagram acknowledging time ( ) vs. source rate (Mbps) & goodput (Mbps)
s
• Case 5. Target goodput = 2.0 Mbps, rate control through sleep time, a = 0.9,
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Throughput Stabilization: ORNL-GaTech
. Desired goodput level = 20.0 Mbps, a = 0.8, , adjustment made on congestion window
8.0
Desired goodput level = 2.0 Mbps, a = 0.8, , adjustment made on sleep time
8.0
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Experiments with tsunamifirebird.ccs.ornl.gov – ccil.cc.gatech.edu
• Network transport control settings:– NIC speed and path bandwidth: 1 Gbps – Transferred file size: 204,800,000 bytes– Using default_block_size: 32768 bytes
• Transmission statistics from Tsunami:– Ave. sending rate 296.05 Mbps– Loss rate: 64.32%– Transfer time: 17.51 sec– Throughput: 93.6 Mbps– Sending time&receiving time vs. block sequence number (figure
next slide)
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Tsunami measurements ozy4.csm.ornl.gov – resource.rrl.lsu.edu
• Path bandwidth: 10 Mbps• Using datagram size: 1400 bytes (the
default one doesn’t work)• File size: 10,240,000 bytes• Case 1: Only Tsunami running
– Throughput 9.47 Mbps (receiver, client)
– Goodput 4.20 Mbps (sender, server)
– Sending time&receiving time vs. datagram sequence number (figure right)
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
• Case 2: Only ONTCOU (throughput maximization SA) running• Source goodput: 3.5 Mbps• Sending time&acknowledging time vs. datagram sequence number• Sending rate vs. source goodput
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
• Case 3: Tsunami and ONTCOU running simultaneously with the same datagram size– Tsunami
• Not completed– ONTCOU
• Transmission completed• Throughput: 0.533Mbps• Sending time&acknowledging
time vs. datagram sequence number (figure next)
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Design and test transport protocols for dedicated channels1. Single data streams – collaboration with UVa2. One data and two control streamsTesting on ORNL-ATL-ORNL GigE-SONET link
Interfaces with visualization software:Simple supernova computation at ORNL hosts on dedicated linkDeveloping interfaces to Aspect visualization modules and testingTest Paraview and EnSight
ORNL Year 1 Tasks
ORNL host 1linux
ORNL host 1linux
Juniper M160 router
OC 192 SOX router
ORNL Atlanta
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Design and test transport protocols for dedicated channelsMultiple data, visualization and control streamsTesting on CHEETAH testbed
Interface with visualization:Interfacing supernova visualization modules over CEETAHDeveloping interfaces to Aspect visualization modules with TSI dataset
ORNL Year 2 Tasks
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Design and test transport protocols for dedicated channelsCollaborating multiple data, visualization and control streamsTesting on CHEETAH testbed
Interface with visualization:Interfacing supernova visualization and computation modules over CEETAHDeveloping interfaces to Aspect visualization modules with TSI on-line computationsOptimizing mapping of visualization pipeline
ORNL Year 3 Tasks
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Feedback and Corrections
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Dynamics of visualization control and steering streams must be stabilized from application to application– Not enough to stabilize lower transport levels– NIC to line transfers may not be smooth– Application to kernel transfers depend on the processor load
• Provide a user interface for steering and connect it to transport modules
Interfacing with steering modules