tpt-raid: a high performance multi-box storage system erez zilber yitzhak birk technion
TRANSCRIPT
Agenda
Introduction Improving Communication Efficiency Relieving the Controller Bottleneck Performance
Basic Terminology
SCSI (Small Computer System Interface): Standard protocol between computers and
peripheral devices (mainly storage devices). Developed in the T10 working group of ANSI. Uses a client-server model.
iSCSI (Internet SCSI): Mapping of SCSI over TCP. iSCSI client (e.g., host computer) is called ‘initiator’. iSCSI server (e.g., disk box) is called ‘target’.
Basic Terminology (cont.)
RAID (Redundant Array of Inexpensive Disks): Using multiple drives for replicating data among
the drives. Specifies a number of prototype “RAID Levels“:
RAID-1: An exact copy of the data on two or more disks.
RAID-4: Uses striping with a dedicated parity disk RAID-5: Similar to RAID-4 with parity data distributed
across all member disks.
Storage Trends
Originally: direct-attached storage that belongs to its computer
1990s: “mainframe” storage servers 2000: Separation of control from actual storage boxes:
Control: Network attached storage (NAS): file interface Storage area networks (SAN): block interface
Storage boxes: RAID of some type In almost all of these: an entire RAID group is within a
single box.
The problem
Storage devices are becoming cheaper.
However, highly-available single-box storage systems are still expensive.
Even such systems are susceptible to failures that affect the entire box.
RAID Controller
Disks
Single-box storage system
Multi-Box RAID
A single, fault-tolerant controller connected to multiple storage boxes (targets).
Any given parity group utilizes at most one disk drive from any given box.
The controller and the disks reside in separate machines. iSCSI may be used in order to
send SCSI commands and data. Multi-box storage system
Multi-Box RAID (cont.) Advantages:
There is no single point of storage-box failure.
Highly available expensive storage boxes are no longer needed.
Disadvantages: Transferring data over a network is
not as efficient as using the DMA engine in a single-box RAID system.
Merely using storage protocols (e.g. iSCSI) over conventional network infrastructure is not enough.
Bottleneck in the controller poor scalability.
Preserving the storage-box capacity (cost effectiveness) may be problematic for the controller.
Multi-box storage system
Agenda
Introduction Improving Communication Efficiency Relieving the Controller Bottleneck Performance
InfiniBand
InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end).
InfiniBand supports RDMA (Remote DMA) High speed Low latency Very lean + no CPU involvement
iSCSI Extensions for RDMA (iSER)
iSER is an IETF standard Maps iSCSI over a network that provides
RDMA services. Data is transferred directly into SCSI I/O
buffers without intermediate data copies. Splits control and data:
RDMA is used for data transfer. Sending of control messages is left unchanged. The same physical path may be used for both.
iSCSI Initiator iSCSI Target
Command Request
Read
SCSI ReadQueue
command
Send Data-in
RDMA Write
...
SCSI Response
...
Command Completion
RDMA Write
RDMA Write
Send Data-in
Send Data-in
Status and Sense
iSER iSER
Send Control
Send Control
Control Notify
Control Notify
Put Data
Put Data
Put Data
iSCSI Initiator iSCSI Target
Command Request
Read
SCSI ReadQueue
command
Send Data-in
Data-in
...
SCSI Response
...
Command Completion
Data-in
Final Data-in
Send Data-in
Send Data-in
Status and Sense
iSCSI over iSER: Read Requests
iSCSI iSCSI over iSER
TCP packets RDMA
iSER + Multi-Box RAID
iSCSI over iSER solves the problem of inefficient data transfer.
The separation of control and data is really a protocol separation over the same path.
The scalability problem remains: All data passes through the controller. When using RAID-4/5, the controller has to
perform parity calculations.
Agenda
Introduction Improving Communication Efficiency Relieving the Controller Bottleneck Performance
Removing the Controller from the Data Path – 3rd Party Transfer
3rd Party Transfer: one iSCSI entity instructs a 2nd iSCSI entity to read or write data to a 3rd iSCSI entity.
Data is transferred directly between hosts and targets under controller command: Lower zero-load latency, especially for large requests – one
hop instead of two. The controller’s memory, busses and InfiniBand link do not
become a bottleneck. Out-of-band controllers already exist, but:
RDMA makes out-of-band data transfers more transparent. We carry the idea into the RAID.
RDMA and Out-of-Band Controller
3rd Party Transfer is more transparent when combined with RDMA : Transparent from a host point of view. Almost transparent from a target point of view.
Adding 3rd Party Transfer to iSCSI over iSER is essential for removing the controller from the data path.
Distributed Parity Calculation
The controller is not in the data path it cannot compute parity.
Side benefit: relieves another possible controller bottleneck.
Distributed Parity Calculation – a Binary Tree
Data block 0 Data block 1 Data block 2 Data block 3 Data block N-2 Parity block
Temp result Temp result Temp result
Temp result
New parity block
XOR XOR XOR
XOR
…
..
.
Example: 3rd Party Transfer and Distributed Parity Calculation The host sends a
command to the RAID controller. Host
RAIDcontroller
0 1 2 3 4
CMD
CMDCMD
RDMA
RDMA
(parity calculation)
The RAID controller sends commands to the targets.
The targets perform RDMA operations to the host.
The RAID controller sends commands to recalculate the parity block (only for WRITE requests).
The targets calculate the new parity block: Target to target data
transfers XOR operation in the
receiving target
TargetsCMD
Compared Systems
Baseline RAID: Hosts In-band controller (Baseline controller) Targets iSCSI over iSER
TPT-RAID: Hosts Out-of-band controller (TPT controller) TPT targets iSCSI over iSER 3rd Party Transfer Distributed Parity Calculation
Amount of Transferred Data (READ)
Baseline system: Controller:
Read from the target: 1 Write to the host: 1 Total: 2 blocks
Targets: Write to the controller: 1 Total: 1 block
TPT system: Controller: No data transfers.
Total: 0 blocks Targets:
Write to the host:1 Total: 1 block
Amount of Transferred Data (WRITE)
Baseline system: Controller:
Read from the host: 1 Read old data from the targets: 2 Write new data and parity to the targets: 2 Total: 5 blocks
Targets: Write old data to the controller: 2 Read new data and parity from the controller: 2 Total: 4 blocks
TPT system: Controller: No data transfers.
Total: 0 blocks Targets:
Read new data from the host: 1 Parity calculation between targets: 1 Total: 2 blocks
RDP with 3rd Party Transfer
Row-Diagonal Parity (RDP) is an extension to RAID-5: Calculates two sets of parity information:
Row parity Diagonal parity
Can tolerate two failures.
RDP with 3rd Party Transfer (cont.)
READ commands: similar to RAID-5. WRITE commands:
More parity calculations are required. 3rd Party Transfer and Distributed Parity
Calculation relieve the RAID controller bottleneck.
Mirroring with 3rd Party Transfer
READ commands: similar to RAID-5.
WRITE commands may be executed in one of the following two ways: All targets read the new data directly from the host. A single target reads the new data directly from
the host and transfers it to other targets.
Using 3rd Party Transfer for mirroring relieves the RAID controller bottleneck.
Degraded Mode
When a target fails, the system moves to degraded mode.
Failure identification is similar to the Baseline system.
Execution of READ commands is similar to the execution of WRITE commands in normal mode. The same performance improvement that is
achieved for WRITE commands in normal mode, is achieved for READ commands in degraded mode.
Required Protocol Changes Host:
Minor change: The host must accept InfiniBand connection requests.
RAID controller and targets: SCSI:
Additional commands. No SCSI hardware changes are required.
iSCSI: Small changes in login/logout process. Extra field added to iSCSI Command PDU.
iSER: Added and modified iSER primitives.
InfiniBand: No changes were made. However, allowing scatter-gather of remote memory handles
could have improved performance.
Agenda
Introduction Improving Communication Efficiency Relieving the Controller Bottleneck Performance
Test Setup
Hardware: Nodes (all types): Intel dual-XEON 3.2GHz Memory disks Mellanox MHEA28-1T (10Gb/s) InfiniBand HCA Mellanox MTS2400 InfiniBand switch
Software: Linux SuSE 9.1 Professional (2.6.4-52 kernel) Voltaire InfiniBand host stack Voltaire iSER initiator and target
System Configurations
Baseline system: Host In-band RAID controller 5 targets
TPT-RAID system: Host TPT RAID controller 5 TPT targets
Both systems use iSER (RDMA) over InfiniBand.
InfiniBand switch
Host
RAID controller
Targets
Scalability
TPT-RAID (almost) doesn’t add work (relative to the Baseline): No extra disk (media) operations. No extra XOR operations.
Added communication to the targets: More commands More data transfers
The extra communication is divided among all targets.
Controller Scalability – RAID-5 (WRITE)
Unlimited number of hosts Unlimited number of targets
InfiniBand BW is not a limiting factor (multiple hosts and targets).
Req. size
Block size
Max. hosts
Baseline TPT
1MB 32KB 1 (75%) 1
1MB 64KB 1 (72%) 2
8MB 32KB 1 (78%) 2
8MB 64KB 1 (78%) 4
Max. Thpt. with One Host – RAID-5 (WRITE)
Even when only a single host is used, the Baseline controller is the bottleneck!
Single host Single target set
Controller Scalability – RDP (WRITE)
Req. size
Block size
Max. hosts
Baseline TPT
1MB 32KB 1 (33%) 1 (70%)
1MB 64KB 1 (33%) 1 (80%)
8MB 32KB 1 (33%) 2
8MB 64KB 1 (33%) 3
Unlimited number of hosts Unlimited number of targets
Controller Scalability – Mirroring (WRITE)
Unlimited number of hosts Unlimited number of targets
Req. size(Blk=32KB)
Max. hosts
Baseline TPT
256KB 1 (50%) 9
512KB 1 (50%) 18
1MB 1 (50%) 36
8MB 1 (50%) 293
Max. Thpt. with One Host - Mirroring (WRITE)
Even when a single host is used, the Baseline controller is a bottleneck.
For TPT, the bottleneck is the host or the targets.
Single host Single target set
Summary Multi-box RAID: improved availability and low cost.
Using a single controller retains simplicity.
Single-box DMA engine is replaced by RDMA.
Adding 3rd Party Transfer and Distributed Parity Calculation allows scalability: Can manage a larger system with more activity For a given workload: larger max. thpt. shorter waiting
times lower latency
Cost reduction is taken another step while retaining performance and simplicity.