designing next generation data-centers with advanced communication protocols and systems services

70
04/26/06 D. K. Panda (The Ohio State University) Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services P. Balaji, K. Vaidyanathan, S. Narravula, H. –W. Jin and D. K. Panda Network Based Computing Laboratory (NBCL) Computer Science and Engineering Ohio State University

Upload: stian

Post on 23-Feb-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services. P. Balaji, K. Vaidyanathan, S. Narravula, H. –W. Jin and D. K. Panda Network Based Computing Laboratory (NBCL) Computer Science and Engineering Ohio State University. Introduction and Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Designing Next Generation Data-Centers withAdvanced Communication Protocols and

Systems Services

P. Balaji, K. Vaidyanathan, S. Narravula, H. –W. Jin and D. K. PandaNetwork Based Computing Laboratory (NBCL)

Computer Science and Engineering

Ohio State University

Page 2: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Introduction and Motivation

• Interactive Data-driven Applications– Scientific as well as Enterprise/Commercial Applications

• Static Datasets: Medical Imaging Modalities• Dynamic Datasets: Stock value datasets, E-commerce, Sensors

– E-science– Ability to interact with, synthesize and visualize large datasets– Data-centers enable such capabilities

• Clients initiate queries (over the web) to process specific datasets– Data-centers process data and reply to queries

Page 3: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Typical Multi-Tier Data-center Environment

• Requests are received from clients over the WAN• Proxy nodes perform caching, load balancing, resource monitoring, etc.• If not cached, the request is forwarded to the next tiers Application Server• Application server performs the business logic (CGI, Java servlets, etc.)

– Retrieves appropriate data from the database to process the requests

ProxyServer

Web-server(Apache)

Application Server (PHP)

DatabaseServer

(MySQL)

WAN

ClientsStorage

More Computation and CommunicationRequirements

Page 4: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Limitations of Current Data-centers• Communication Requirements

– TCP/IP used even in the data-center: Sub-optimal performance• InfiniBand and other interconnects provide more features

– High Performance Sockets (e.g., SDP)• Superior performance with no modifications

• Advanced Data-center Services– Minimize the computation requirements

• Improved caching of documents• Issues with caching Dynamic (or Active) Content

– Maximize compute resource utilization• Efficient resource monitoring and management• Issues with heterogeneous load characteristics of data-centers

Page 5: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Proposed Architecture

Existing Data-Center Components

RDMA Atomic Multicast

Sockets Direct Protocol

ProtocolOffload

Async. Zero-copyCommunication

PacketizedFlow-control

Dynamic ContentCaching

GlobalMemory

Aggregator

Active ResourceAdaptation

SoftSharedState

DistributedLock

Manager

PointTo

Point

Advanced System Services

Data-CenterService

Primitives

AdvancedCommunication Protocols

and Subsystems

Network

Dynamic ContentCaching

SoftSharedState

Active ResourceAdaptation

Async. Zero-copyCommunication

Page 6: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Presentation Layout

Introduction and Motivation

Advanced Communication Protocols and Subsystems

Data-center Service Primitives

Dynamic Content Caching Services

Active Resource Adaptation Services

Conclusions and Ongoing Work

Page 7: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

High Performance Sockets(e.g., SDP)

The Sockets Protocol Stack

High-speed Network

Device Driver

IP

TCP

Traditional Sockets

Sockets Interface

App #1 App #2 App #N

Berkeley Sockets Implementation High-speed Network

Device Driver

IP

TCP

Traditional Sockets

Sockets Interface

Application

OffloadedProtocol

Lower-level Interface

AdvancedFeatures

The Sockets Protocol Stack allows applications to utilize the network performance and capabilities with NO or MINIMAL modifications

Page 8: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

InfiniBand and Features• An emerging open standard high performance interconnect• High Performance Data Transfer

– Interprocessor communication and I/O– Low latency (~1.0-3.0 microsec), High bandwidth (~10-20

Gbps) and low CPU utilization (5-10%)• Flexibility for WAN communication• Multiple Operations

– Send/Recv– RDMA Read/Write– Atomic Operations (very unique)

• high performance and scalable implementations of distributed locks, semaphores, collective communication operations

• Range of Network Features and QoS Mechanisms– Service Levels (priorities)– Virtual lanes– Partitioning– Multicast

• allows to design a new generation of scalable communication and I/O subsystem with QoS

Page 9: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

SDP Latency and BandwidthLatency

0

10

20

30

40

50

60

70

2 4 8 16 32 64 128

256

512

1K 2K 4K

Message Size (Bytes)

Late

ncy

(use

c)

0

10

20

30

40

50

60

CP

U U

tiliz

atio

n %

TCP/IP CPU SDP CPUTCP/IP SDPNative IBA

Unidirectional Bandwidth

0

100

200

300

400

500

600

700

800

900

4 16 64 256 1K 4K 16K 64K

Message Size (Bytes)B

andw

idth

(Mpb

s)

0

20

40

60

80

100

120

140

160

CP

U U

tiliz

atio

n %

TCP/IP CPU SDP CPUTCP/IP SDPNative IBA

“Sockets Direct Protocol over InfiniBand in Clusters: Is it Beneficial?”, P. Balaji, S. Narravula, K. Vaidyanathan, K. Savitha, D. K. Panda. IEEE International Symposium on Performance Analysis and Systems (ISPASS), 04.

Page 10: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Zero-Copy Communication for SocketsReceiverSender

Send Complete

Buffer 2Send

Buffer 1Send

Get Data

GET COMPLETE

SRC AVAIL

SRC AVAILGet Data

GET COMPLETE

Send Complete

Application Blocks

Application Blocks

Buffer 2

Buffer 1

Page 11: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Asynchronous Zero-Copy SDP

ReceiverSender

Memory Protect

Buffer 1

Send

GET COMPLETE

Get Data

Buffer 2

SRC AVAIL

Memory Unprotect

Memory Unprotect

Buffer 2

Buffer 1

SendMemory Protect

Page 12: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Throughput and Comp./Comm. OverlapThroughput

0

2000

4000

6000

8000

10000

120001 4 16 64 256

1K 4K 16K

64K

256K 1M

Message Size (Bytes)

Thro

ughp

ut (

Mbp

s)

BSDP

ZSDP

AZ-SDP

Comp./Comm. Overlap

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

0 20 40 60 80 100

120

140

160

180

200

Delay (usec)

Thro

ughp

ut (

Mbp

s)

BSDP

ZSDP

AZSDP

“Asynchronous Zero-copy Communication for Synchronous Sockets in the Sockets Direct Protocol (SDP) over InfiniBand”. P. Balaji, S. Bhagvat, H. –W. Jin and D. K. Panda. Workshop on Communication Architecture for Clusters (CAC); with IPDPS ‘06.

Page 13: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Presentation Layout

Introduction and Motivation

Advanced Communication Protocols and Subsystems

Data-center Service Primitives

Dynamic Content Caching Services

Active Resource Adaptation Services

Conclusions and Ongoing Work

Page 14: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Data-Center Service Primitives

• Common Services needed by Data-Centers– Better resource management– Higher performance provided to higher layers

• Service Primitives– Soft Shared State– Distributed Lock Management– Global Memory Aggregator

• Network Based Designs– RDMA, Remote Atomic Operations

Page 15: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Soft Shared State

Shared State

Data-CenterApplication

Data-CenterApplication

Data-CenterApplication

Data-CenterApplication

Data-CenterApplication

Data-CenterApplication

Get

Get

Get

Put

Put

Put

Page 16: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Presentation Layout

Introduction and Motivation

Advanced Communication Protocols and Subsystems

Data-center Service Primitives

Dynamic Content Caching Services

Active Resource Adaptation Services

Conclusions and Ongoing Work

Page 17: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Active Caching

• Dynamic data caching – challenging!• Cache Consistency and Coherence

– Become more important than in static case

User Requests

Proxy Nodes

Back-End Nodes

Update

Page 18: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Active Cache Design

• Efficient mechanisms needed– RDMA based design– Load resiliency

• Our cooperation protocols– No-Dependency– Invalidate-All

• Client Polling based design

Page 19: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

RDMA based Client Polling Design

Front-End Back-End

Request

Cache Hit

Cache Miss

Response

Version Read

Response

Page 20: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Active Caching - PerformanceData-Center Throughput

0

2000

4000

6000

8000

10000

12000

14000

16000

Trace 2 Trace 3 Trace 4 Trace 5

Traces with Increasing Update Rate

Thro

ughp

ut

No Cache

Invalidate All

Dependency Lists

Effect of Load

0

2000

4000

6000

8000

10000

12000

14000

16000

0 1 2 4 8 16 32 64

Load (Compute Threads)Th

roug

hput

No Cache

DependencyLists

• Higher overall performance – Up to an order of magnitude• Performance is sustained under loaded conditions

Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-Tier Data-Centers over InfiniBand. S. Narravula, P. Balaji, K. Vaidyanathan, H. -W. Jin and D. K. Panda. CCGrid-2005

Page 21: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Multi-tier Cooperative Caching

• RDMA based schemes• Effective use of system-wide

memory from across multiple tiers• Significant performance benefits

– Our Schemes • BCC, CCWR, MTACC and

HYBCC

– Up to 2-3 times compared to the base case

00.5

11.5

22.5

3

Impr

ovem

ent R

atio

BCC CCWR MTACC HYBCC

Performance Improvement

8k 16k 32k 64k

S. Narravula, H. -W. Jin, K. Vaidyanathan and D. K. Panda, Designing Efficient Cooperative Caching Schemes for Multi-Tier

Data-Centers over RDMA-enabled Networks. IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 06).

Page 22: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Presentation Layout

Introduction and Motivation

Advanced Communication Protocols and Subsystems

Data-center Service Primitives

Dynamic Content Caching Services

Active Resource Adaptation Services

Conclusions and Ongoing Work

Page 23: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Active Resource Adaptation

• Increasing popularity of Shared data-centers• How to decide the number of proxy nodes vs. application

servers vs. database servers• Current approach

– Use a rigid configuration– Over-Provisioning

• Active Resource Adaptation– Reconfigure nodes from one tier to another tier– Allocate resources based on system load and traffic pattern– Meet QoS and Prioritization constraints– Load Resiliency

Page 24: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Active Resource Adaptation in Shared Data-Centers

WAN

Clients

Clients

Load Balancing Cluster (Site A)

Load Balancing Cluster (Site B)

Load Balancing Cluster (Site C)

Website A (low priority)

Website B (medium priority)

Website C (high priority)

Servers

Servers

Servers

Reconf-PQ reconfigures nodes for different websites but also guarantees fixed number of nodes to low priority requests

Hard QoS Maintained

Page 25: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Active Resource Adaptation Design

ServerWebsite A

LoadBalancer

ServerWebsite B

Not Loaded Loaded

Load QueryLoad Query

Successful Atomic (Lock)

Successful Atomic (Update Counter)

Reconfigure Node

Successful Atomic (Unlock)

Load Shared Load Shared

RDMARDMA

Page 26: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Dynamic Reconfigurability using RDMA operations

Throughput

0

10000

20000

30000

40000

50000

60000

1K 2K 4K 8K 16K

TPS

Rigid Reconf Over-Provisioning

“On the Provision of Prioritization and Soft QoS in Dynamically Reconfigurable Shared Data-Centers over InfiniBand”. `P. Balaji, S. Narravula, K. Vaidyanathan, H. –W. Jin and D. K. Panda. IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) ‘05.

QoS Meeting Capability

0%

20%

40%

60%

80%

100%

Case 1 Case 2 Case 3

% o

f QoS

Met

Reconf Reconf-P Reconf-PQ

Page 27: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Presentation Layout

Introduction and Motivation

Advanced Communication Protocols and Subsystems

Data-center Service Primitives

Dynamic Content Caching Services

Active Resource Adaptation Services

Conclusions and Ongoing Work

Page 28: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Conclusions• Proposed a novel framework for data-centers to address the

current limitations– Low performance due to high communication overheads– Lack of efficient support of advanced features such as active

caching, dynamic resource adaptation, etc• Three-layer Architecture

– Communication Protocol Support– Data-Center Primitives– Data-Center Services

• Novel approaches using the advanced features of InfiniBand– Resilient to the load on the back-end servers– Order of magnitude performance gain for several scenarios

Page 29: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Work-in-Progress• Data-Center Primitives

– Efficient System-Wide Soft Shared State Mechanisms

– Efficient Distributed Lock Manager Mechanisms

• Fine-Grained Active Resource Adaptation– Fine-grain resource monitoring

– Resource adaptation with database servers and multi-stage

reconfigurations

• Detailed Data-Center Evaluation with the proposed framework

Page 30: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Web Pointers

Website: http://www.cse.ohio-state.edu/~panda

Group Homepage: http://nowlab.cse.ohio-state.edu

Email: [email protected]

NBCL

Page 31: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Backup Slides(Sockets Direct Protocol)

Page 32: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Sockets Direct Protocol (SDP)• IBA Specific Protocol for Data-Streaming• Defined to serve two purposes:

– Maintain compatibility for existing applications– Deliver the high performance of IBA to the applications

• Two approaches for data transfer: Copy-based and Z-Copy• Z-Copy specifies Source-Avail and Sink-Avail messages

– Source-Avail allows destination to RDMA Read from source– Sink-Avail allows source to RDMA Write to the destination

• Current implementation limitations:– Only supports the Copy-based implementation– Does not support Source-Avail and Sink-Avail

Page 33: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

High Performance Sockets(e.g., SDP)

The Sockets Protocol Stack

High-speed Network

Device Driver

IP

TCP

Traditional Sockets

Sockets Interface

App #1 App #2 App #N

Berkeley Sockets Implementation High-speed Network

Device Driver

IP

TCP

Traditional Sockets

Sockets Interface

Application

OffloadedProtocol

Lower-level Interface

AdvancedFeatures

The Sockets Protocol Stack allows applications to utilize the network performance and capabilities with NO or MINIMAL modifications

Page 34: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Designing High-Performance Sockets• Basic Idea of High-Performance Sockets

– “Hijack” standard sockets calls to use our implementation of sockets– Hijacking is done through environment variables: non-intrusive

• TCP/IP based sockets– Uses simple yet generic approaches for data communication– Copy data to temporary buffers– Credit-based flow-control mechanism to avoid overrunning the receiver

• High-performance Sockets can use similar approaches Network deals with reliability, data integrity, etc. Some amount of performance benefits are possible

ҳ Several disadvantages

ҳ Advanced mechanisms (e.g., RDMA) are not utilized

Page 35: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

TCP/IP-like Credit-based Flow Control

ACK

Sockets Buffers

Application Buffer

Sender

Application Buffer

Receiver

Sockets Buffers

Page 36: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Limitations with Credit-based Flow Control

Sockets Buffers

Application Buffers

Sender

Application Buffers Not Posted

Receiver

Sockets BuffersCredits = 4

Application Buffer

ACK

• Receiver controlled buffer management – Statically sized temporary buffers• Can lead to excessive wastage of buffers

– E.g., if application buffers are 1 byte each and the socket buffers are 8KB each– 99.98% of the socket buffers remain unused

• All messages going out on the network are 1 byte each– Network performance is under-utilized for small messages

Page 37: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Packetized Flow-Control

Sockets Buffers

Application Buffers

Sender Receiver

Sockets Buffers

• Packetization: Socket buffer is packetized to 1 byte granularity– Sender side buffer management

• Utilizes advanced network features such as RDMA– Avoids buffer wastage when transmitting small messages– Improves throughput for small messages

Credits = 4

Application Buffers Not PostedApplication Buffer

ACK

Page 38: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

High Performance Sockets over VIALatency

0

20

40

60

80

100

120

140

160

4 8 16 32 64 128

256

512

1K 2K 4K

Message Size (bytes)

Late

ncy

(use

cs)

VIA SocketVIA TCP/IP

Unidirectional Bandwidth

0

100

200

300

400

500

600

700

800

900

4 16 64 256

1K 4K 16K

64K

Message Size (bytes)

Ban

dwid

th (M

bps)

VIA SocketVIA TCP/IP

“Impact of High Performance Sockets on Data Intensive Applications”, P. Balaji, J. Wu, T. Kurc, U. Catalyurek, D. K. Panda and J. Saltz. In the proceedings of IEEE High Performance Distributed Computing (HPDC) ’03.

Page 39: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Evaluating Sockets over VIA(Data-Cutter Library)

• Designed by University of Maryland

• Component framework

• User-defined pipeline of components

– Stream based communication

– Flow control between components

• Several applications supported

– Virtual Microscope

– ISO Surface Oil Reservoir Simulator

Virtual Microscope

TCP

HPS

01

Reqd BW

01

HPS

TCP

Page 40: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Virtual Microscope Application

• Blind run

– Performance benefits: 3.5 times

• After re-distributing data

– Read chunks are smaller

– Load balancing is more fine-grained

– Benefits: Order of magnitude

– Can reach better image fetch rates

– Note: NO application changes still !0

500

1000

1500

2000

2500

3000

3500

4000

Guaranteed Image Fetch Rate

Upd

ate

Late

ncy

(use

cs)

TCP SocketVIA SocketVIA(with DR)

“Impact of High Performance Sockets on Data Intensive Applications”, P. Balaji, J. Wu, T. Kurc, U. Catalyurek, D. K. Panda and J. Saltz. In the proceedings of IEEE High Performance Distributed Computing (HPDC) ’03.

Page 41: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Network

Parallel Virtual File System (PVFS)

ComputeNode

ComputeNode

ComputeNode

ComputeNode

Meta-DataManager

I/O ServerNode

I/O ServerNode

I/O ServerNode

MetaData

Data

Data

Data

• Relies on Striping of data across different nodes• Tries to aggregate I/O bandwidth from multiple nodes• Utilizes the local file system on the I/O Server nodes

Page 42: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Parallel I/O in Clusters via PVFS

• PVFS: Parallel Virtual File System– Parallel: stripe/access data across multiple nodes– Virtual: exists only as a set of user-space daemons– File system: common file access methods (open, read/write)

• Designed by ANL and Clemson

iodLocal file systems

iodLocal file systems

mgr…

Network

Posix MPI-IOlibpvfs

ApplicationsPosix MPI-IO

libpvfs

Applications…ControlData

“PVFS over InfiniBand: Design and Performance Evaluation”, Jiesheng Wu, Pete Wyckoff and D. K. Panda. International Conference on Parallel Processing (ICPP), 2003.

Page 43: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Evaluating Sockets over IBA(PVFS Performance)

Read Bandwidth (3 IODs)

0

200

400

600

800

1000

1200

1400

1 2 3 4 5No. of Clients

Ban

dwid

th (M

Bps

)

TCP/IP

SDP

Native IBA

Write Bandwidth (3IODs)

0

200

400

600

800

1000

1200

1 2 3 4 5No. of Clients

Ban

dwid

th (M

Bps

)

“Sockets Direct Protocol over InfiniBand in Clusters: Is it Beneficial?”, P. Balaji, S. Narravula, K. Vaidyanathan, K. Savitha, D. K. Panda. IEEE International Symposium on Performance Analysis and Systems (ISPASS), 04.

“The Convergence of Ethernet and Ethernot: A 10-Gigabit Ethernet Perspective”, P. Balaji, W. Feng and D. K. Panda. IEEE Micro Journal ’06.

Page 44: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

SDP Latency and BandwidthLatency

0

10

20

30

40

50

60

70

2 4 8 16 32 64 128

256

512

1K 2K 4K

Message Size (Bytes)

Late

ncy

(use

c)

0

10

20

30

40

50

60

CP

U U

tiliz

atio

n %

TCP/IP CPU SDP CPUTCP/IP SDPNative IBA

Unidirectional Bandwidth

0

100

200

300

400

500

600

700

800

900

4 16 64 256 1K 4K 16K 64K

Message Size (Bytes)B

andw

idth

(Mpb

s)

0

20

40

60

80

100

120

140

160

CP

U U

tiliz

atio

n %

TCP/IP CPU SDP CPUTCP/IP SDPNative IBA

“Sockets Direct Protocol over InfiniBand in Clusters: Is it Beneficial?”, P. Balaji, S. Narravula, K. Vaidyanathan, K. Savitha, D. K. Panda. IEEE International Symposium on Performance Analysis and Systems (ISPASS), 04.

Page 45: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Data-Center Response TimeClient Response Time

0

50

100

150

200

250

32K

64K

128K

256K

512K 1M 2M

Message Size (bytes)

Res

pons

e Ti

me

(ms)

IPoIB SDP

Web Server Delay

0

5

10

15

20

25

32K

64K

128k

256k

512k

1024

k20

48k

Message Size (bytes)Ti

me

Spe

nt (m

s)

IPoIB SDP

• SDP shows very little improvement: Client network (Fast Ethernet) becomes the bottleneck• Client network bottleneck reflected in the web server delay: up to 3 times improvement with

SDP

Page 46: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Data-Center Response Time (Fast Clients)

0

5

10

15

20

25

30

32K 64K 128K 256K 512K 1M 2M

Message Size (bytes)

Res

pons

e Ti

me

(ms)

IPoIBSDP

• SDP performs well for large files; not very well for small files

Page 47: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Data-Center Response Time Split-up

Init + Qtime8% Request Read

3%

Core Processing10%

URL Manipulation1%

Back-end Connect32%Request Write

2%

Reply Read14%

Cache Update2%

Response Write25%

Proxy End3%

Init + Qtime9% Request Read

3%

Core Processing12%

URL Manipulation1%

Back-end Connect14%

Request Write2%Reply Read

15%

Cache Update3%

Response Write38%

Proxy End3%

IPoIB SDP

Page 48: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Data-Center Response Time(Without Connection Time Overhead)

0

5

10

15

20

25

30

32K 64K 128K 256K 512K 1M 2M

Message Size (bytes)

Res

pons

e Ti

me

(ms)

IPoIBSDP

• Without the connection time, SDP would perform well for all file sizes

Page 49: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Zero-copy Communication• Copy-based approaches can significantly limit performance

– Excessive CPU utilization and memory traffic– Can limit performance to less than 35% of peak in some cases [jpdc05]

SRC Available

RDMA Read Data

GET Complete

Sender Receiver

SINK Available

RDMA Write Data

PUT Complete

Sender Receiver

“Exploiting NIC Architectural Support for Enhancing IP based Protocols on High Performance Networks”. H. –W. Jin, P. Balaji, C. Yoo, J. Y. Choi and D. K. Panda. Journal of Parallel and Distributed Computing (JPDC) ‘05

Page 50: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Asynchronous Zero-copy Comm.: Design Issues

• Handling a Page Fault– Block-on-Write: Wait till the communication has finished

– Copy-on-Write: Copy data to internal buffer and carry on

communication

• Handling Buffer Sharing– Buffers shared through mmap()

• Handling Unaligned Buffers– Memory protection is only in the granularity of a page

– Malloc hook overheads

Page 51: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Impact of Page-faults on AZ-SDPEffect of Page Faults (1MB Message size)

0

2000

4000

6000

8000

10000

12000

1 2 3 4 5 6 7 8 9 10Window Size

Thro

ughp

ut (M

bps)

BSDPZSDPAZ-SDP

Effect of Page Faults (64KB Message size)

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

1 2 3 4 5 6 7 8 9 10Window Size

Thro

ughp

ut (M

bps)

BSDPZSDPAZ-SDP

• AZ-SDP has performance drawbacks if data is touched too often before send completes• If applications don’t touch data frequently, AZ-SDP outperforms both the other schemes

Page 52: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Backup Slides(Shared State)

Page 53: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Backup Slides(Dynamic Content Caching)

Page 54: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Basic Client Polling Architecture

Front-End Back-End

Request

Cache Hit

Cache Miss

Response

Page 55: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Active Caching Architecture

Server Node

Mod

Server Node

Mod

Server Node

Mod

Server Node

Mod

Cooperation

Cache Lookup Counter maintained on the Application Servers

ProxyServers

ApplicationServers

Page 56: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Active Caching - Basic Design

• Home Node based Client Polling– Cache Documents assigned home nodes

• Proxy Server Modules– Client polling functionality

• Application Server Modules– Support “Version Reads” for client polling– Handle updates

Page 57: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Active Caching - Mapping Schemes

• Dependency Lists– Home node based– Complete dependency lists

• Keep track of all dependencies

• Invalidate All– Single Lookup Counter for a given class of queries– Low application server overheads

Page 58: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Active Caching - Handling Updates

DatabaseServer

Ack (Atomic)

ApplicationServer

ApplicationServer

ApplicationServer

Update Notification

VAPI SendLocal

Search andCoherentInvalidate

HTTPRequest

HTTPResponse

DB Query (TCP)

DB Response

Page 59: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Backup Slides(Active Resource Adaptation)

Page 60: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Efficient Fine-Grained Resource Monitoring

• Fine-grained resource monitoring can help in providing better system-level services like process migration, load balancing, etc

• How to provide fine-grained and accurate resource information of loaded back-end servers to the front-end node

• Current approach– Use a two-sided communication mechanism like TCP/IP– Asynchronous Vs Synchronous approach

• Can we design a fine-grained resource monitoring scheme using RDMA operations?– Use RDMA operations in the kernel space and pin kernel data structures for

capturing the system load– Synchronous by nature– Apart from accuracy and no back-end CPU involvement, this approach

provides more system information like interrupts pending on CPUs• Scheme can be used for other system-level services like reconfiguration,

process migration

Page 61: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Connection Load Accuracy and Impact on Load Balancing

-10.00%-5.00%0.00%5.00%

10.00%15.00%20.00%25.00%30.00%35.00%40.00%

α = 0.9 α = 0.75 α = 0.5 α = 0.25

Zipf alpha values

% Im

prov

emen

t

Socket-Sync

RDMA-Async

RDMA-Sync

e-RDMA-Sync

Accuracy of RDMA-Sync closely matches the actual

connection load in comparison with all other

schemes

RDMA-Sync monitoring assists load-balancing in

improving the throughput in comparison with Socket-

Async scheme

0

10

20

30

40

50

60

Time

Deviatio

n

Socket-Async

Socket-Sync

RDMA-Async

RDMA-Sync

Page 62: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Reconfiguration Implementation Details

• History Aware Reconfiguration

– Avoiding Server Thrashing by maintaining a history of the load pattern

• Reconfigurability Module Sensitivity

– Time Interval between two consecutive checks

• Maintaining a System Wide Shared State

• Shared State with Concurrency Control

• Tackling Load-Balancing Delays

Page 63: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Locking Mechanisms• We propose a two-level hierarchical locking mechanism

– Internal Lock for each web-site cluster• Only one load-balancer in a cluster can attempt a reconfiguration

– External Lock for performing reconfiguration• Only one web-site can convert any given node

– Both locks performed remotely using InfiniBand Atomic Operations

Server

Load BalancerInternal Lock

Internal Lock

External Lock

Website A

Website B

Website C

Page 64: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Tackling Load-Balancing Delays• Load-Balancing Delays

– After a reconfiguration, balancing of load might take some time– Locking mechanisms only ensure no simultaneous transitions– We need to ensure that all load-balancers are aware of

reconfigurationsServer

Website ALoad

BalancerServer

Website B

Not Loaded Loaded

Load QueryLoad Query

Successful Atomic (Lock)

Successful Atomic (SUC)

Reconfigure Node Successful Atomic

(Unlock)

Load Shared Load Shared

• Dual Counters– Shared Update Counter (SUC)

– Local Update Counter (LUC)

• On reconfiguration:– LUC should be equal to SUC

– All remote SUCs are incremented

Page 65: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Basic Dynamic Reconfigurability Performance

0

10000

20000

30000

40000

50000

1K 2K 4K 8K 16KBurst Length (requests)

Tran

sact

ions

per

Sec

ond

Rigid (Small) Reconf Rigid (Large)

• Large Burst Length allows reconfiguration of the system closer to the best case; reconfiguration time is negligible;

• Performs comparably with the static scheme for small burst sizes

0

1

2

3

4

5

6

7

0 9421 18520 23570 28570Iterations

Num

ber o

f bus

y no

des

Reconf Rigid-Small Rigid-Large

Page 66: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Reconfigurability Performance with Prioritization and QoS

High Priority Requests Performance

0

5000

10000

15000

20000

25000

Case 1 Case 2 Case 3

Tran

sact

ions

per

Sec

ond

Reconf Reconf-P Reconf-PQ

• Reconf does not perform any additional reconfiguration

• Reconf and Reconf-P allocate maximum number of nodes to the low-priority website whereas Reconf-PQ allocates nodes to the QoS guaranteed to that website.

Low Priority Requests Performance

0

5000

10000

15000

20000

25000

Case 1 Case 2 Case 3

Tran

sact

ions

per

Sec

ond

Reconf Reconf-P Reconf-PQ

Case 1: A load of high priority requests arrives when a load of low priority requests already exists

Case 2: A load of low priority requests arrives when a load of high priority requests already exists

Case 3: Both high and low priority requests arrive simultaneously

Page 67: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

QoS Meeting CapabilityHard QoS Meeting Capability (High Priority

Requests)

0%

20%

40%

60%

80%

100%

Case 1 Case 2 Case 3

% of tim

es Q

oS m

et

Reconf Reconf-P Reconf-PQ

• Reconf and Reconf-P perform well only in some cases and lack consistency in providing the guaranteed QoS requirements to both websites

• Reconf-PQ meets the guaranteed QoS requirements in all cases

Hard QoS Meeting Capability (Low Priority Requests)

0%

20%

40%

60%

80%

100%

Case 1 Case 2 Case 3%

of tim

es Q

oS m

etReconf Reconf-P Reconf-PQ

Page 68: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

QoS Meeting Capability – Zipf and Worldcup Traces

Hard QoS Meeting Capability (Low Priority Requests)

0%

20%

40%

60%

80%

100%

Case 1 Case 2 Case 3

% of tim

es Q

oS m

et

Reconf Reconf-P Reconf-PQ

• Similar trends are seen for Zipf and Worldcup traces with QoS meeting capability of nearly 100% for Reconf-PQ

Hard QoS Meeting Capability (Low Priority Requests)

0%

20%

40%

60%

80%

100%

Case 1 Case 2 Case 3% of tim

es Q

oS m

et

Reconf Reconf-P Reconf-PQ

Page 69: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Backup Slides(Soft Shared State)

Page 70: Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

D. K. Panda (The Ohio State University)04/26/06

Efficient Soft Shared State Primitive

• Higher-level services use some kind of a shared state• Current approach

– Lack of a software layer; adhoc in manner– Uses two-sided communication mechanism like TCP/IP– Does not cater to the requirements of higher-level services such as

coherency, consistency, timestamping, etc• Need for Soft Shared State Primitive

– Ease of use, simple operations like get(), put()– Better Performance using advanced operations such as RDMA and atomics

• Proposed Architecture– Coherent Shared State– Non-Coherent Shared State– Timestamp-based Shared State– Memory Stacked Shared State