long distance experiments of data reservoir system

42
K.Hiraki University of Tokyo Long Distance experiments of Data Reservoir system Data Reservoir / GRAPE-DR project

Upload: sondra

Post on 21-Jan-2016

50 views

Category:

Documents


0 download

DESCRIPTION

Long Distance experiments of Data Reservoir system. Data Reservoir / GRAPE-DR project. List of experiment members. The University of Tokyo Fujitsu Computer Technologies WIDE project Chelsio Communications Cooperating networks (West to East) WIDE APAN / JGN2 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

Long Distance experiments of Data Reservoir system

Data Reservoir / GRAPE-DR project

Page 2: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

List of experiment members

The University of Tokyo

Fujitsu Computer Technologies

WIDE project

Chelsio Communications

Cooperating networks (West to East)WIDEAPAN / JGN2IEEAF / Tyco TelecommunicationsNorthwest Giga PopCANARIE (CA*net 4)StarLightAbileneSURFnetSCinetThe Universiteit van Amsterdam CERN

Page 3: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

Computing System for real Scientists

• Fast CPU, huge memory and disks, good graphics– Cluster technology, DSM technology, Graphics processors

– Grid technology

• Very fast remote file accesses– Global file system, data parallel file systems, Replication facilities

• Transparency to local computation– No complex middleware, or no small modification to existing software

• Real Scientists are not computer scientists

• Computer scientists are not work forces for real scientists

Page 4: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

Objectives of Data Reservoir / GRAPE-DR(1)

• Sharing Scientific Data between distant research institutes– Physics, astronomy, earth science, simulation data

• High utilization of available bandwidth– Transferred file data rate > 90% of available bandwidth

• Including header overheads, initial negotiation overheads

• OS and File system transparency– Storage level data sharing (high speed iSCSI protocol on stock TCP)– Fast single file transfer

Page 5: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

Objectives of Data Reservoir / GRAPE-DR(2)

• GRAPE-DR:Very high-speed processor– 2004 – 2008– Successor of Grape-6 astronomical simulator

• 2PFLOPS on 128 node cluster system– 1G FLOPS / processor– 1024 processor / chip

• Semi-general-purpose processing– N-body simulation, gridless fluid dyinamics– Linear solver, molecular dynamics– High-order database searching (genome, protein data etc.)

Page 6: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

Grape6

Very   High-speedNetwork

DataReservoir

Data analysis at University of Tokyo

Belle Experiments

X-ray astronomy Satellite ASUKA

SUBARUTelescope

NobeyamaRadio

Observatory( VLBI)

Nuclear experiments

DataReservoir

DataReservoirLocal

Accesses

DistributedShared

files

Data intensive scientific computation through global networks

Digital Sky Survey

Page 7: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

DataReservoir

DataReservoir

High latencyVery high bandwidthNetwork

Distribute   Shared   Data

(DSM like architecture)

Cache Disks

Cache Disks

Local file accesses

Disk-block levelParallel andMulti-stream transfer

Local file accesses

Basic Architecture

Page 8: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

Disk Server

Scientific Detectors

User Programs

IP Switch

File Server File Server

Disk Server

IP Switch

File Server File Server

Disk Server Disk Server

1st level striping

2nd level striping

Disk access by iSCSI

File accesses on Data Reservoir

IBM x345 (2.6GHz x 2)

Page 9: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

Scientific Detectors

User Programs

File Server File Server File Server File Server

IP Switch IP Switch

Disk Server Disk ServerDisk Server Disk Server

iSCSI Bulk Transfer

Global Network

Global Data Transfer

Page 10: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

1st Generation Data Reservoir

• 1st generation Data Reservoir (2001-2002)• Efficient use of network bandwidth (SC2002 technical paper)• Single filesystem image (by striping of iSCSI disks)• Separation of local file accesses by users to remote disk to

disk replication• Low level data-sharing (disk block level, iSCSI protocol)

• 26 servers for 570 Mbps data transfer• High sustained efficiency (≒93 %)

• Low TCP performance• Unstable TCP performance

Page 11: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

Problems of BWC2002 experiments

• Low TCP bandwidth due to packet losses– TCP congestion window size control– Very slow recovery from fast recovery phase (>20min)

• Unbalance among parallel iSCSI streams– Packet scheduling by switches and routers– User and other network users have interests only to

total behavior of parallel TCP streams

Page 12: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

2nd Generation Data Reservoir

• 2nd generation Data Reservoir (2003)• Inter-layer coordination for parallel TCP streams (SC2004

technical paper)» Transmission Rate Controlled TCP for Long Fat pie

Netowks» “Dulling Edges of Cooperative Parallel streams

(DECP (load balancing among parallel TCP streams)» LFN tunneling NIC (Comet TCP)

• 10 times performance increase from 2002 » 7.01 Gbps (disk to disk, TRC-TCP), 7.53 Gbps

(CometTCP)

Page 13: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

24,000km(15,000miles)

15,680km (9,800miles)

OC-192

OC-48 x 3GbE x 4

8,320km (5,200miles)Juniper

T320

Page 14: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

Results• BWC2002

– Tokyo → Baltimore   10,800km (6,700miles)– Peak bandwidth (on network) 600 Mbps– Average file transfer bandwidth 560 Mbps– Bandwidth-distance products 6,048   terabit-kilometers/second

• BWC 2003– Phoenix → Tokyo → Portland → Tokyo 24,000 km (15,000 miles)

– Average file transfer bandwidth    7.5   Gbps– Bandwidth-distance products ≒ 168 petabit-kilometers/second

– More than 25 times improvement from BWC2002 performance (bandwidth-distance products)

– Still low performance for each TCP stream (460 Mbps)

Page 15: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

Problems of BWC2003 experiments

• Still Low TCP single stream performance– NIC is GbE– Software overheads

• Large number of server and storage boxed for 10Gbps iSCSI disk service– Disk density– TCP single stream performance– CPU usage for managing iSCSI and TCP

Page 16: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

3rd Generation Data Reservoir

• Hardware and software basis for 100Gbps Distributed Data-sharing systems

• 10Gbps disk data transfer by a single Data Reservoir server• Transparent support for multiple filesystems (detection of

modified disk blocks)• Hardware(FPGA) implementation of Inter-layer coordination

mechanisms• 10 Gbps Long Fat pipe Network emulator and 10 Gbps data

logger

Page 17: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

Features of a single server Data Reservoir

Low cost and high-bandwidth disk service• A server + two RAID card + 50 disks• Small size, scalable to 100Gbps Data Reservoir (10 servers --- 2

racks)• Mother system for GRAPE-DR simulation engine• Compatible to existing Data Reservoir system

• Single stream TCP performance (must be more than 5 Gbps)• Automatic adaptation for network condition

– Maximize multi-stream TCP performance

– Load balancing among parallel TCP streams

Page 18: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

Utilization of 10Gbps network

• A single box 10 Gbps Data Reservoir server• Quad Opteron server with multiple PCI-X buses (prototype, SUN V40z server)• Two Chelsio T110 TCP off-loading NIC • Disk arrays for necessary disk bandwidth• Data Reservoir software (iSCSI deamon, disk driver, data transfer maneger)

ChelsioT110

TCP NICQuad Opteron

Server(SUN V40z)Linux 2.6.6

PCI-X bus

PCI-X bus

PCI-X bus

ChelsioT110

TCP NIC

SCSIadaptor

PCI-X bus

SCSIadaptor

10G EthernetSwitch

10GBASE-SR

Data Data ReservoirReservoirSoftwareSoftware

Ultra320SCSI

Page 19: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

Single stream TCP performance

• Importance of hardware-supported “Inter-layer optimization” mechanism

– CometTCP hardware (TCP tunnel by non-TCP protocol)

– Chelsio T110 NIC (pacing by network processor)– Dual ported FPGA based NIC (under development

• Standard TCP, standard Ethernet frame size

• Major problem– Conflict between long RTT and

short RTT– Detection of optimal parameters

Chelsio T110

Dual ported NIC by U. of Tokyo

Page 20: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

Tokyo-CERN experiment (Oct.2004)

• CERN-Amsterdam-Chicago-Seattle-Tokyo– SURFnet – CA*net 4 – IEEAF/Tyco – WIDE

– 18,500 km WAN PHY connection

• Performance result

– 7.21 Gbps (TCP payload) standard Ethernet frame size, iperf

– 7.53 Gbps (TCP payload) 8K Jumbo frame, iperf

– 8.8 Gbps disk to disk performance• 9 servers, 36 disks• 36 parallel TCP streams

Page 21: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

Network topology of CERN-Tokyo experiment

Tokyo

Seattle Vancouver

Minneapolis Chicago

Amsterdam CERN(Geneva)

IBM x345 serverDual Intel Xeon 2.4GHz

2GB memoryLinux 2.6.6 (No.2-7)Linux 2.4.X (No. 1)

IBM x345

IBM x345

Opteron serverDual Opteron248,2.2GHz

1GB memoryLinux 2.6.6 (No.2-6)

ChelsioT110 NIC Fujitsu XG800

12 port switch

FoundryBI MG8

Data Reservoir at Univ. of Tokyo

GbE

IBM x345 serverDual Intel Xeon 2.4GHz

2GB memoryLinux 2.6.6 (No.2-7)Linux 2.4.X (No. 1)

IBM x345

IBM x345

Opteron serverDual Opteron248,2.2GHz

1GB memoryLinux 2.6.6 (No.2-6)

ChelsioT110 NIC

Data Reservoir at CERN(Geneva)

GbE

T-LEX

Pacific Northwest Gigapop

StarLight

NetherLight

WIDE / IEEAF

CA*net 4 SURFnet

10G

BA

SE

-LW

FujitsuXG800

FoundryFEX x448

FoundryNetIron40G

ExtremeSummit 400

Page 22: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

Difficulty on CERN-Tokyo Experiment

• Detecting SONET lever errors– Synchronization error– Low rate packet losses

• Lack of tools for debugging network– Loopback at L1 is the only useful method– No tools for performance bugs

• Very long latency

Page 23: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

SC2004 bandwidth challenge (Nov.2004)

• CERN-Amsterdam-Chicago-Seattle-Tokyo – Chicago - Pittsburgh– SURFnet – CA*net 4 – IEEAF/Tyco – WIDE – JGN2 –

Abilene – 31,248 km (RTT 433 ms) (but shorter distance by LSR

rules, 20,675 km)

• Performance result

– 7.21 Gbps (TCP payload) standard Ethernet frame size, iperf

– 1.6 Gbps disk to disk performance• 1 servers, 8 disks• 8 parallel TCP streams

Page 24: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

Bandwidth Challenge: Pittsburgh-Tokyo-CERN Single stream TCP

31,248 km connection between SC2004 booth and CERN through Tokyo

Latency 433 ms RTT

Data Reservoir booth at SC2004

Data Reservoir project at CERN(Geneva)

Tokyo

T-LEX

Amsterdam

NetherLight

Tokyo

APAN

WIDE APAN/JGN II Abilene SCinet SURFnetIEEAF/Tyco/WIDE CANARIE

Router or L3 switch L3 switch used as L2

University of Amsterdam

Chicago

StarLightSCinet backborn

Opteron4

Opteron2

L1 or L2 switch

CERNFoundryNetIron

40G

Force10Switch

ONS15454

Fujitsu XG800Switch

ChelsioT110 NIC

Opteron serverDual Opteron248,

2.2GHz1GB memory

Linux 2.6.6 (No.2-6)

HDXc

SURFnet

(at CERN)

ONS15454

ONS15454

Calgary

ONS15454

ONS15454

Vancouver

Seattle

Pacific Northwest Gigapop

FoundryNetIron

40G

ProcketRouter

ProcketRouter

Force10Switch

JuniperT640

Router

Force10Switch

JuniperT320

Router

FoundryBI MG8

ChelsioT110 NIC

Opteron serverDual Opteron248,

2.2GHz1GB memory

Linux 2.6.6 (No.2-6)

ONS15454

Minneapolis

Atlantic

Ocean

Pacific

Ocean

Page 25: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

Pittsburgh-Tokyo-CERN Single stream TCP speed measurement

• Server configuration• CPU: Dual AMD Opteron 248, 2.2GHz• Mother Board: RioWorks HDAMA rev. E• Memory: 1G bytes, Corsair Twinx CMX512RE-3200LLPT x 2, PC3200, CL=2• OS: Linux kernel 2.6.6• Disk Seagate IDE 80GB, 7200r.p.m. (disk speed not essential for performance)

• Network interface card• Chelsio T110 (10GBASE-SR), TCP offload engine (TOE),• PCI-X/133 MHz I/O bus connection • TOE function ON

Technical points• Traditional tuning and optimization methods

• congestion window size, buffer size, size of queue etc. • Ethernet frame size (standard frame, jumbo frame)

• Tuning and optimization by Inter-layer coordination ( Univ. of Tokyo)

• Fine-grained pacing by offload engine• optimization at slow start phase • Utilization of flow control

• Performance of single stream TCP on 31,248 km circuit

• 7.21 Gbps (TCP payload), standard Ethernet frame ⇒  224,985 terabit meter / second

80% more than previous Land Speed Record

System

Tokyo

CERNPittsburgh

Chicago

Amsterdam

GenevaSeattle

VancouverCalgary

Minneapolis

WIDEAPAN/JGN II

IEEAF/Tyco/WIDE

CANARIE

SURFnetAbilene

Network used in the experiment

Observed bandwidth on the network(at Tokyo T-LEX NI40G

Page 26: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

Difficulty on SC2004 LSR challenge

• Method for performance debugging– Dividing the path (CERN to Tokyo, Tokyo to Pittsburgh)

• Jumbo frame performance (unknown reason)• Stability in network • QoS problem in the Procket 8800 routers

Page 27: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

Year end experiments

• Target– > 30,000 km LSR distance

– L3 switching at Chicago and Amsterdam

– Period of the experiment• 12/20 – 1/3 • Holiday season for vacant public research networks

• System configuration– A pair of opteron servers with Chelsio T110 (at N-otemachi)

– Another pair of opteron servers with Chelsion T110 for competing traffinc generation

– ClearSight 10Gbps packet analyzer for packet capturing

Page 28: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

Single stream TCP – Tokyo – Chicago – Amsterdam – NY – Chicago - Tokyo

Tokyo

T-LEX

Amsterdam

NetherLight

SURFnetIEEAF/Tyco/WIDE CANARIE

Router or L3 switch

University of Amsterdam

Chicago StarLight

L1 or L2 switch

Force10E1200

ONS15454

Vancouver

FoundryNetIron

40G

OME6550

Minneapolis

Atlantic

Ocean

Pacific

Ocean

Opteron1

Opteron server

ChelsioT110 NIC

IEEAF/Tyco

Opteron server

ChelsioT110 NIC

ClearSight10Gbpscapture

FujitsuXG800

OC-192

WAN PHY

WAN PHY

CANARIE

SURFnet

OME6550

Procket8801

Procket8812

ONS15454

ONS15454

ONS15454

ONS15454

ONS15454

HDXc

T640

T640

HDXc

CISCO12416

CISCO12416

CISCO6509

Force10E600Opteron3

SeattlePacific

Northwest Gigapop

New York

MANLAN

Chicago

SURFnet

SURFnet

OC-192

OC-192Abilene

APAN/JGN

TransPAC

SURFnet

SURFnet

SURFnet

WIDE

WIDE

Calgary

Univ of Tokyo WIDE APAN/JGN2Abilene

Page 29: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

1545

4

1545

4

6509

1241

6 A

R5

T64

0

HD

Xc

OM

E65

00

1545

4T

640

PR

O88

01

E12

00

PR

O88

1215

454

opteron1T110NIC

opteron3T110NIC

NI4

0G

T-LEXData Reservoir

Chicago NYC A’damTokyo/notemachi Seattle

Abilene

CA*net 4 SURFnet

APAN

StarLight

SURFnetMANLAN

HD

Xc

v0.06, Dec 21, 2004, [email protected]

VLAN A (911) VLAN B (10)

VLAN D (914) "operational and production-oriented high-performance research and educations networks"

E60

0

OM

E65

00

UW

Vancouver

1241

6 C

R1

OC

-192

OC

-192

OC

-192

OC

-192

10G

E-L

W

10G

E-S

R

10G

E-L

R

OC

-192

OC

-192

OC

-192

OC

-192

OC

-192

10G

E-L

W

10G

E-L

W

10G

E-L

W

10G

E-L

R10

GE

-LR

OC

-192

OC

-192

10G

E-L

R

10G

E-L

R

UvA

IEE

AF

JGN

II

TransPAC

GX

10G

E-S

R

ClearSight10GE analyzer

XG

1200

tap

Figure 1a. Network topology

Page 30: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

Tokyo

Chicago

Amsterdam

SeattleVancouver

Calgary

MinneapolisIEEAF/Tyco/WIDE

CANARIE

SURFnet

Network used in the experiment

Figure 2. Network connection

CA*net 4

APAN/JGN2Abilene

NYC

A router or an L3 switch

A L1 or L2 switch

Page 31: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

Difficulty on Tokyo-AMS-Tokyo experiment

• Packet loss and Flow control– About 1 loss/several minutes when flow control off– About 100 losses/several minutes when flow control on– Flow control at edge switch does not work properly

• Difference of network behavior by the direction• Artificial bursty behavior caused by switches and routers

– Outgoing packets from NIC are paced properly.

• Jumbo frame packet losses– Packet losses begin from MTU 4500– NI40G and Force10 E1200, E600 are on the path.

• Lack of peak network flow information– 5 min average BW is useless for TCP traffic

Page 32: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

Network Traffic on routers and switches

StarLight Force10 E1200

University of Amsterdam Force10 E600

Abilene T640 NYCM to CHIN

TransPAC Procket 8801

Submitted run

Page 33: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

History of single-stream IPv4 Internet Land Speed Record

2000 2001 2003 2004 2005 2006 2007

Year

1,000

10,000

100,000

Distance bandwidth productTbit m / s

2004/11/9Data Reservoir project

WIDE project

148850 Tbit m / sInternet Land Speed Record

2002

1,000,0002004/12/25

Data Reservoir projectWIDE project

SURFnetCANARIE

216300 Tbit m / sExperimental result

Page 34: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

Setting for instrumentation

• Capture packet at both source-end and destination-end

• Packet filtering by src-dst IP address

• Capture up to 512 MB

• Precise packet counting was very useful

SourceServer

DestinationServer

NetworkTo be tested

ClearSight10Gbps

Capture system

Fujitsu XG80010G L2 switch

ClearSight

SourceServer

DestinationServer

Tap

XG800

Page 35: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

Packet count example

Page 36: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

Traffic monitor example

Page 37: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

Possible arrangement of devices

• 1U loss-less packet capturing box developed at U. of Tokyo– 2 10GBASE-LR interface– Lossless capture up to 1GB– Precise time stamping

• Packet generating server– 3U opteron server + 10Gbps Ethernet adaptor– Pushing network by more than 5 Gbps TCP/UDP

• Taps and tap selecting switch– Currently, port mirroring is not good for instrumentation– Capturing backborn traffic and server in/out

Page 38: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

2 port packet capturing box

• Originally designed for emulating 10Gbps network– 2 10GBASE-LR interface– 0 to 1000 ms artificial latency (very long FIFO queue)– Error rate 0 to 0.1 % ( 0.001% step)

• Lossless packet capturing– Utilize FIFO queue capability – Data uploadable through USB– Cheaper than ClearSight or Agilent Technology

Page 39: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

Setting for instrumentation

• Capture packet at both source-end and destination-end

• Packet filtering by src-dst IP address

BackbornRouter

OrSwitch

Server +10G Ether

NetworkTo be tested

2 portPacket capture

Box

• Capture packet at both source-end and destination-end

• Packet filtering by src-dst IP address

NetworkTo be tested

TAP

Page 40: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

More expensive plan

• Capture packet at both source-end and destination-end

• Packet filtering by src-dst IP address

BackbornRouter

OrSwitch

Server +10G Ether

NetworkTo be tested

2 portPacket capture

Box

• Capture packet at both source-end and destination-end

• Packet filtering by src-dst IP address

NetworkTo be tested

TAP

TAP

2 portPacket capture

Box

Page 41: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

Summary• Single Stream TCP

– We removed TCP related difficulties– Now I/O bus bandwidth is the bottleneck– Cheap and simple servers can enjoy 10Gbps network

• Lack of methodology in high-performance network debugging– 3 day debugging (overnight working)– 1 day stable period (usable for measurements)– Network may feel fatigue, some trouble must happen

– We need something effective.

• Detailed issues– Flow control (and QoS)– Buffer size and policy– Optical level setting

Page 42: Long Distance experiments of Data Reservoir system

K.Hiraki University of Tokyo

Thanks