tpt-raid: a high performance multi-box storage system erez zilber yitzhak birk technion

40
TPT-RAID: A High Performance Multi-Box Storage System Erez Zilber Yitzhak Birk Technion

Upload: shawn-parsons

Post on 27-Dec-2015

226 views

Category:

Documents


2 download

TRANSCRIPT

TPT-RAID: A High Performance Multi-Box Storage System

Erez ZilberYitzhak Birk

Technion

Agenda

Introduction Improving Communication Efficiency Relieving the Controller Bottleneck Performance

Basic Terminology

SCSI (Small Computer System Interface): Standard protocol between computers and

peripheral devices (mainly storage devices). Developed in the T10 working group of ANSI. Uses a client-server model.

iSCSI (Internet SCSI): Mapping of SCSI over TCP. iSCSI client (e.g., host computer) is called ‘initiator’. iSCSI server (e.g., disk box) is called ‘target’.

Basic Terminology (cont.)

RAID (Redundant Array of Inexpensive Disks): Using multiple drives for replicating data among

the drives. Specifies a number of prototype “RAID Levels“:

RAID-1: An exact copy of the data on two or more disks.

RAID-4: Uses striping with a dedicated parity disk RAID-5: Similar to RAID-4 with parity data distributed

across all member disks.

RAID - ExamplesRAID-4

RAID-5

RAID-1 (Mirroring)

New data block New parity block

Storage Trends

Originally: direct-attached storage that belongs to its computer

1990s: “mainframe” storage servers 2000: Separation of control from actual storage boxes:

Control: Network attached storage (NAS): file interface Storage area networks (SAN): block interface

Storage boxes: RAID of some type In almost all of these: an entire RAID group is within a

single box.

The problem

Storage devices are becoming cheaper.

However, highly-available single-box storage systems are still expensive.

Even such systems are susceptible to failures that affect the entire box.

RAID Controller

Disks

Single-box storage system

Multi-Box RAID

A single, fault-tolerant controller connected to multiple storage boxes (targets).

Any given parity group utilizes at most one disk drive from any given box.

The controller and the disks reside in separate machines. iSCSI may be used in order to

send SCSI commands and data. Multi-box storage system

Multi-Box RAID (cont.) Advantages:

There is no single point of storage-box failure.

Highly available expensive storage boxes are no longer needed.

Disadvantages: Transferring data over a network is

not as efficient as using the DMA engine in a single-box RAID system.

Merely using storage protocols (e.g. iSCSI) over conventional network infrastructure is not enough.

Bottleneck in the controller poor scalability.

Preserving the storage-box capacity (cost effectiveness) may be problematic for the controller.

Multi-box storage system

Agenda

Introduction Improving Communication Efficiency Relieving the Controller Bottleneck Performance

InfiniBand

InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end).

InfiniBand supports RDMA (Remote DMA) High speed Low latency Very lean + no CPU involvement

iSCSI Extensions for RDMA (iSER)

iSER is an IETF standard Maps iSCSI over a network that provides

RDMA services. Data is transferred directly into SCSI I/O

buffers without intermediate data copies. Splits control and data:

RDMA is used for data transfer. Sending of control messages is left unchanged. The same physical path may be used for both.

iSCSI Initiator iSCSI Target

Command Request

Read

SCSI ReadQueue

command

Send Data-in

RDMA Write

...

SCSI Response

...

Command Completion

RDMA Write

RDMA Write

Send Data-in

Send Data-in

Status and Sense

iSER iSER

Send Control

Send Control

Control Notify

Control Notify

Put Data

Put Data

Put Data

iSCSI Initiator iSCSI Target

Command Request

Read

SCSI ReadQueue

command

Send Data-in

Data-in

...

SCSI Response

...

Command Completion

Data-in

Final Data-in

Send Data-in

Send Data-in

Status and Sense

iSCSI over iSER: Read Requests

iSCSI iSCSI over iSER

TCP packets RDMA

iSER + Multi-Box RAID

iSCSI over iSER solves the problem of inefficient data transfer.

The separation of control and data is really a protocol separation over the same path.

The scalability problem remains: All data passes through the controller. When using RAID-4/5, the controller has to

perform parity calculations.

Agenda

Introduction Improving Communication Efficiency Relieving the Controller Bottleneck Performance

Removing the Controller from the Data Path – 3rd Party Transfer

3rd Party Transfer: one iSCSI entity instructs a 2nd iSCSI entity to read or write data to a 3rd iSCSI entity.

Data is transferred directly between hosts and targets under controller command: Lower zero-load latency, especially for large requests – one

hop instead of two. The controller’s memory, busses and InfiniBand link do not

become a bottleneck. Out-of-band controllers already exist, but:

RDMA makes out-of-band data transfers more transparent. We carry the idea into the RAID.

RDMA and Out-of-Band Controller

3rd Party Transfer is more transparent when combined with RDMA : Transparent from a host point of view. Almost transparent from a target point of view.

Adding 3rd Party Transfer to iSCSI over iSER is essential for removing the controller from the data path.

Distributed Parity Calculation

The controller is not in the data path it cannot compute parity.

Side benefit: relieves another possible controller bottleneck.

Distributed Parity Calculation – a Binary Tree

Data block 0 Data block 1 Data block 2 Data block 3 Data block N-2 Parity block

Temp result Temp result Temp result

Temp result

New parity block

XOR XOR XOR

XOR

..

.

Example: 3rd Party Transfer and Distributed Parity Calculation The host sends a

command to the RAID controller. Host

RAIDcontroller

0 1 2 3 4

CMD

CMDCMD

RDMA

RDMA

(parity calculation)

The RAID controller sends commands to the targets.

The targets perform RDMA operations to the host.

The RAID controller sends commands to recalculate the parity block (only for WRITE requests).

The targets calculate the new parity block: Target to target data

transfers XOR operation in the

receiving target

TargetsCMD

Compared Systems

Baseline RAID: Hosts In-band controller (Baseline controller) Targets iSCSI over iSER

TPT-RAID: Hosts Out-of-band controller (TPT controller) TPT targets iSCSI over iSER 3rd Party Transfer Distributed Parity Calculation

Amount of Transferred Data (READ)

Baseline system: Controller:

Read from the target: 1 Write to the host: 1 Total: 2 blocks

Targets: Write to the controller: 1 Total: 1 block

TPT system: Controller: No data transfers.

Total: 0 blocks Targets:

Write to the host:1 Total: 1 block

Amount of Transferred Data (WRITE)

Baseline system: Controller:

Read from the host: 1 Read old data from the targets: 2 Write new data and parity to the targets: 2 Total: 5 blocks

Targets: Write old data to the controller: 2 Read new data and parity from the controller: 2 Total: 4 blocks

TPT system: Controller: No data transfers.

Total: 0 blocks Targets:

Read new data from the host: 1 Parity calculation between targets: 1 Total: 2 blocks

RDP with 3rd Party Transfer

Row-Diagonal Parity (RDP) is an extension to RAID-5: Calculates two sets of parity information:

Row parity Diagonal parity

Can tolerate two failures.

RDP with 3rd Party Transfer (cont.)

READ commands: similar to RAID-5. WRITE commands:

More parity calculations are required. 3rd Party Transfer and Distributed Parity

Calculation relieve the RAID controller bottleneck.

Mirroring with 3rd Party Transfer

READ commands: similar to RAID-5.

WRITE commands may be executed in one of the following two ways: All targets read the new data directly from the host. A single target reads the new data directly from

the host and transfers it to other targets.

Using 3rd Party Transfer for mirroring relieves the RAID controller bottleneck.

Degraded Mode

When a target fails, the system moves to degraded mode.

Failure identification is similar to the Baseline system.

Execution of READ commands is similar to the execution of WRITE commands in normal mode. The same performance improvement that is

achieved for WRITE commands in normal mode, is achieved for READ commands in degraded mode.

Required Protocol Changes Host:

Minor change: The host must accept InfiniBand connection requests.

RAID controller and targets: SCSI:

Additional commands. No SCSI hardware changes are required.

iSCSI: Small changes in login/logout process. Extra field added to iSCSI Command PDU.

iSER: Added and modified iSER primitives.

InfiniBand: No changes were made. However, allowing scatter-gather of remote memory handles

could have improved performance.

Agenda

Introduction Improving Communication Efficiency Relieving the Controller Bottleneck Performance

Test Setup

Hardware: Nodes (all types): Intel dual-XEON 3.2GHz Memory disks Mellanox MHEA28-1T (10Gb/s) InfiniBand HCA Mellanox MTS2400 InfiniBand switch

Software: Linux SuSE 9.1 Professional (2.6.4-52 kernel) Voltaire InfiniBand host stack Voltaire iSER initiator and target

System Configurations

Baseline system: Host In-band RAID controller 5 targets

TPT-RAID system: Host TPT RAID controller 5 TPT targets

Both systems use iSER (RDMA) over InfiniBand.

InfiniBand switch

Host

RAID controller

Targets

Scalability

TPT-RAID (almost) doesn’t add work (relative to the Baseline): No extra disk (media) operations. No extra XOR operations.

Added communication to the targets: More commands More data transfers

The extra communication is divided among all targets.

Controller Scalability – RAID-5 (WRITE)

Unlimited number of hosts Unlimited number of targets

InfiniBand BW is not a limiting factor (multiple hosts and targets).

Req. size

Block size

Max. hosts

Baseline TPT

1MB 32KB 1 (75%) 1

1MB 64KB 1 (72%) 2

8MB 32KB 1 (78%) 2

8MB 64KB 1 (78%) 4

Max. Thpt. with One Host – RAID-5 (WRITE)

Even when only a single host is used, the Baseline controller is the bottleneck!

Single host Single target set

Controller Scalability – RDP (WRITE)

Req. size

Block size

Max. hosts

Baseline TPT

1MB 32KB 1 (33%) 1 (70%)

1MB 64KB 1 (33%) 1 (80%)

8MB 32KB 1 (33%) 2

8MB 64KB 1 (33%) 3

Unlimited number of hosts Unlimited number of targets

Controller Scalability – Mirroring (WRITE)

Unlimited number of hosts Unlimited number of targets

Req. size(Blk=32KB)

Max. hosts

Baseline TPT

256KB 1 (50%) 9

512KB 1 (50%) 18

1MB 1 (50%) 36

8MB 1 (50%) 293

Max. Thpt. with One Host - Mirroring (WRITE)

Even when a single host is used, the Baseline controller is a bottleneck.

For TPT, the bottleneck is the host or the targets.

Single host Single target set

Degraded Mode

Same as the performance of WRITE commands in normal mode.

Summary Multi-box RAID: improved availability and low cost.

Using a single controller retains simplicity.

Single-box DMA engine is replaced by RDMA.

Adding 3rd Party Transfer and Distributed Parity Calculation allows scalability: Can manage a larger system with more activity For a given workload: larger max. thpt. shorter waiting

times lower latency

Cost reduction is taken another step while retaining performance and simplicity.

InfiniBand support

InfiniBand currently allows scattering/gathering of memory.

Memory registration returns a memory handle.

Scattering/gathering of memory handles will improve the performance of the TPT-RAID dramatically.