1
A Study of Applications for
Optical Circuit-Switched Networks
Xiuduan FangMay 1, 2006
Supported by NSF ITR-0312376, NSF EIN-0335190,
and DOE DE-FG02-04ER25640 grants
2
Outline
Introduction CHEETAH Background
― CHEETAH concept and network― CHEETAH end-host software
Analytical Models of GMPLS Networks Application (App) I: Web Transfer App App II: Parallel File Transfers Summary and Conclusions
3
Introduction Many optical connection-oriented (CO)
testbeds― E.g., CANARIE's CA*net 4, UKLight, and CHEETAH― Primarily designed for e-Science apps
Use Generalized Multiprotocol Label Switching (GMPLS)
― Immediate request, call blocking Motivation: extend these GMPLS networks
to million of users Problem Statement
― What apps are well served by GMPLS networks?― Design apps to use GMPLS networks efficiently
4
Circuit-switched High-speed End-to-End Transport ArcHitecture (CHEETAH)
Designed as an “add-on” service to the Internet and leverages the services of the Internet
Optical circuit-switched CHEETAH
network
Optical circuit-switched CHEETAH
network
Packet-switched Internet
Packet-switched Internet
Endhost
NIC I
NIC II
Endhost
NIC I
NIC II
IP router IP router
Ethernet-SONETgateway
Ethernet-SONETgateway
CHEETAH concept
5
CHEETAH Network
zelda4
Sycamore SN16000
1G
ORNL, TN
Atlanta, GA
NC
Direct fibersVLANsMPLS tunnels
mvstu6
UVa
CUNY
zelda5
Sycamore SN16000
zelda3
zelda1
zelda2
OC-192 lambda
MCNCCatalyst
7600
wukongSN16000
UVa Catalyst
4948
NCSUM20
CentuarFastIron
FESX448
WASHAbileneT640
NYCHOPI
Force10
WASHHOPI
Force10
CUNYFoundry
CUNYHost
6
CHEETAH End-host Software
Application
RSVP-TE client
TCP/IPNIC 1
NIC 2
End hostCHEETAH software
Routing decision
C-TCP
OCS clientInternet
CHEETAH network
Application
RSVP-TE client
TCP/IP NIC 1
NIC 2
End hostCHEETAH software
Routing decision
C-TCP
OCS client
OCS: Optical Connectivity ServiceRD: routing decisionRSVP-TE: ReSerVation Protocol-Traffic EngineeringC-TCP: Circuit-TCP
7
Outline
Introduction CHEETAH Background
― CHEETAH concept and network― CHEETAH end-host software
Analytical Models of GMPLS Networks Application (App) I: Web Transfer App App II: Parallel File Transfers Summary and Conclusions
8
Assumptions: ― Call arrival rate, (Poisson process)― Single link― Single class: all apps are of the same type
A link of capacity C; m circuits; per-circuit BW=C/m m is a measure of high-throughput vs. moderate-
throughput For high-throughput (e.g., e-Science apps), m is small
Problem: what apps are suitable for GMPLS networks?
Analytical Models of GMPLS Networks
/1
― Measure of suitability: Call-blocking probability, Pb Link utilization, U
― App properties: Per-circuit BW Call-holding time,
9
BW sharing models
is independent of/1 mC /
…
1
N
Link L, capacity C
…
,
…
1
N
Link L, capacity CRD0
/1 mC / is dependent on
File size distribution:
:shape , k :scale
Two kinds of apps: whether is dependent on /1 mC /
The Erlang-B formula
:crossover file size
10
Numerical Results: is independent of/1 mC /
Two equations, four variables Fix U and m, compute Pb and
11
Numerical Results: is independent of
/1
m=10
Pb=23.62%
Conclusions: to get high U Small m (~10): high Pb, thus book-ahead or call queuing Large m (~1000): high , thus large N Intermediate m (~100): large is preferred
/1 mC /
)/( N/1
12
Conclusions: to get high U Small m (~10): high Pb, thus book-ahead or call
queuing As m increases, N does not increase m=100, to get U>80%, Pb<5%: 6MB< <29MB, thus
Numerical Results: is dependent on , whenmC /
ss 3.2/15.0
/1MBk 25.1,1.1
13
Conclusions for Analysis Ideal apps require BW on the order of
one-hundredth the link capacity as per-circuit rate
Apps where is independent of― long call-holding time is preferred
Apps where is dependent on― need short call-holding time
mC //1
/1 mC /
14
Outline
Introduction CHEETAH Background
― CHEETAH concept and network― CHEETAH end-host software
Analytical Models of GMPLS Networks Application (App) I: Web Transfer App App II: Parallel File Transfers Summary and Conclusions
15
APP I: Web Transfer App on CHEETAH
Why web transfer?― Web-based apps are ubiquitous― Based on the previous analysis, m=100 is
suitable for CHEETAH Consists of a software package WebFT
― Leverages CGI for deployment without modifying web client and web server software
― Integrated with CHEETAH end-host software APIs to allow use of the CHEETAH network in a mode transparent to users
16
Control messages via Internet
WebFT Architecture
Web serverWeb client
Web Server (e.g. Apache)
CGI scripts (download.cgi &
redirection.cgi
URLResponse
WebFT sender
OCS API RD API
RSVP-TE API
C-TCP API
Web Browser(e.g. Mozilla)
WebFT receiver
RSVP-TE API
C-TCP API Data transfers via a circuit
OCS daemon
RD daemon
RSVP-TE daemon
RSVP-TE daemon
Cheetah end-host software APIsand daemons
Cheetah end-host software APIsand daemons
17
Experimental Testbed for WebFT
zelda3 and wukong: Dell machines, running Linux FC3 and ext2/3, with RAID-0 SCCI disks
RTT between them: 24.7ms on the Internet path, and 8.6ms for the CHEETAH circuit.
load Apache HTTP server 2.0 on zelda3
CHEETAH Network
CHEETAH Network
InternetInternet
zelda3
NIC I
NIC II
wukong
NIC I
NIC II
IP routers IP routers
NCSUAtlanta, GA
Sycamore SN16000Atlanta, GA
Sycamore SN16000MCNC, NC
18
Experimental Results for WebFT
The web page to test WebFT
Test parameters: ― Test.rm: 1.6 GB, circuit rate: 1 Gbps
Test results― throughput: 680 Mbps, delay: 19 s
19
Outline
Introduction CHEETAH Background
― CHEETAH concept and network― CHEETAH end-host software
Analytical Models of GMPLS Networks Application (App) I: Web Transfer App App II: Parallel File Transfers Summary and Conclusions
20
APP II: Parallel File Transfers on CHEETAH
Motivation: E-Science projects need to share large volumes of data (TB or PB)
Goal: achieve multi-Gb/s throughput Two factors limit throughput
― TCP’s congestion-control algorithm― End-host limitations
Solutions to relieve end-host limitations
― Single-host solution― Cluster solution, which has two variations
General case: non-split source file Special case: split source file
21
General-Case Cluster Solution
OriginalSource
Host 1
Host i
Host n
split
Host 1’
Host i’
Host n’
OriginalSink
transfer
transfer
transfer
assemble
…
……
… ……
22
Software Tools: GridFTP and PVFS2
GridFTP: a data-transfer protocol on the Grid
― Extends FTP by adding features for partial file transfer, multi-streaming and striping
― We mainly use the GridFTP striped transfer feature.
PVFS: Parallel Virtual File System― An open source implementation of a parallel
file system― Stripes a file across multiple I/O servers like
RAID0― A second version: PVFS2
23
SPOR <host-port pairs>
response to SPOR
GridFTP server
globus-url-copy
GridFTP striped transfer
Block 1
Block n+1
…
Block 1
Block n+1
…
data node R1
data node Rn
Parallel File System
GridFTP server
…
Block 1
Block n+1
…
Block 1
Block n+1
…
data node S1
data node Sn
Parallel File System
…receiving front end sending front end
SPAS
a list
of host-
port pair
s
Sending data nodes initiate data connections to receiving nodes
24
General-Case Cluster Solution:DesignSteps
Approach
Pros. Cons.
Splitting &Assemblin
g
GridFTP partial file transfer
Wastes disk space,Performance overhead
Socket program
Avoids wasting disk space
Performance overhead
pvfs2-cpAvoids wasting disk space
Transferring
GridFTP partial file transfer
Many independent transfers incurring much overhead to set up and release connections
GridFTP striped transfer
A single file transfer
25
General-Case Cluster Solution:Implementation
To get a high throughput, we need to make data nodes responsible for data blocks in their local disks
Block 1
Block n+1
…
Block 1
Block n+1
…
data node R1
data node Rn
PVFS2
Block 1
Block n+1
…
Block 1
Block n+1
…
data node S1
data node Sn
PVFS2
… …― Make PVFS2 and GridFTP have the same
stripe pattern Problems:
― PVFS2 1.0.1 does not provide a utility to inspect data distribution
― Data connections between sending and receiving nodes are random
26
Random data connections
Block 1
Block n+1
…
Block 1
Block n+1
…
data node R1
data node Rn
PVFS2
Block 1
Block n+1
…
Block 1
Block n+1
…
data node S1
data node Sn
PVFS2
… …
27
Random data connections
Block 1
Block n+1
…
Block 1
Block n+1
…
data node R1
data node Rn
PVFS2
Block 1
Block n+1
…
Block 1
Block n+1
…
data node S1
data node Sn
PVFS2
… …
28
Implementation - Modifications to PVFS2
Goal: know a priori how a file is striped in PVFS2 Use strace command to trace systems calls
called by pvfs2-cp ― Pvfs2-fs-dump gives the (non-deterministic) I/O server
order of file distribution― Pvfs2-cp ignores the –s option for configuring stripe size
Modify PVFS2 code― For load balance, PVFS2 stripes files starting with a
random server: jitter = (rand() % num_io_servers); ― Set jitter = -1 to get a fixed order of data distribution― Change the default stripe size (original: 64KBytes)
29
Implementation - Modifications to GridFTP Goal: use a deterministic matching
sequence between sending and receiving data nodes Method: modify the implementation of SPAS and SPOR commands
― SPAS: sort the list of host-port pairs based on the IP-address order for receiving data nodes
― SPOR: request sending data nodes to initiate data connections sequentially to receiving data nodes
30
Experimental Results
Conducted on a 22-node cluster, sunfire Reduced network-and-disk contention Performance of PVFS2 implementation
was poor
31
Summary and Conclusions Analytical Models of GMPLS Networks
― Ideal apps require BW on the order of one-hundredth the link capacity as per-circuit rate
Application I: Web Transfer Application― provided deterministic data services to
CHEETAH clients on dedicated end-to-end circuits
― No modifications to the web client and web server software by leveraging CGI
Application II: Parallel File Transfers― Implemented a general-case cluster solution
by using PVFS2 and GridFTP striped transfer ― Modified PVFS2 and GridFTP code to reduce
network-and-disk contention
32
Publication Lists
M. Veeraraghavan, X. Fang, and X. Zheng, On the suitability of applications for GMPLS networks, submitted to IEEE Globecom2006
X. Fang, X. Zheng, and M. Veeraraghavan, Improving web performance through new networking technologies, IEEE ICIW'06, February 23-25, 2006 Guadeloupe, French Caribbean
33
Future Work Analytical Models of GMPLS Networks
― Multi-class― Multiple links and network models
Application I: Web Transfer Application― Design a Web partial CO transfer to enable
non-CHEETAH hosts to use CHEETAH― Connect multiple CO networks to further
reduce RTT Application II: Parallel File Transfers
― Test the general-case cluster solution on CHEETAH
― Work on PVFS2 or try GPFS to get a high I/O throughput
34
A Classification of Networks that Reflects Sharing Modes
35
The client can be reached via the CHEETAH network (OCS)
Request a CHEETAH circuit (Routing Decision)
Set up a circuit (RSVP_TE client)
Send the file via C-TCP
Release the circuit (RSVP_TE client)
Yes
Yes
Succeed
No
No
Fail
Return Success Return Failure
The flow chart for the WebFT sender
36
The WebFT Receiver Integrates with the CHEETAH end-host
software modules similar to the WebFT sender.
Runs as a daemon in the background on the client host to avoid manual intervention.
Also provides the WebFT sender a desired circuit rate.
37
Experimental Results for WebFT
38
PVFS2 Architecture
39
Experimental Configuration Configuration of PVFS2 I/O servers
― The 1st PVFS2: sunfire1 through sunfire5― The 2nd PVFS2: sunfire10, and sunfire6 through 9
Configuration of GridFTP servers― Sending front end: sunfire1 with data nodes sunfire1
through sunfire5― Receiving front end: sunfire10 with data nodes
sunfire10, sunfire6 through sunfire9 GridFTP striped transfer
globus-url-copy -vb –dbg -stripe ftp://sunfire1:50001/pvfs2/test_1G
ftp://sunfire10:50002/pvfs2/test_1G1 2>dbg1.txt
40
Four Conditions to Avoid Unnecessary Network-and-disk Contention
Know a priori how data are striped in PVFS2
PVFS2 I/O servers and GridFTP servers run on the same hosts
GridFTP stripes data across data nodes in the same sequence as PVFS2 does across PVFS2 I/O servers
GridFTP and PVFS2 have the same stripe size
41
42
The Specific Cluster Solution for TSI
Dell 5424
.
.
.
zelda1
zelda2
zelda5
zelda4
zelda3
compute-0-0
compute-0-1
compute-0-4
compute-0-3
compute-0-2
compute-0-19
controller-0(rudi)
disk-0-0
disk-3-0
disk-2-0
disk-1-0
monitoring host
disk-4-0
controller-1(orbitty)
orbitty at NCSU zelda at ORNL
Dell 5224
CHEETAH LAN
X1E at ORNL
X1E
43
Numerical Results for is dependent on/1 mC /
Conclusions: Large m (~1000): does not increase N