hardware acceleration of key-value storesskarandikar/pub/kv... · 2015. 11. 1. · hardware...

Hardware Acceleration of Key-Value StoresSagar Karandikar, Howard Mao, Albert Ou, Yunsup Lee Advisor: Krste Asanovic

Motivation

I In datacenter applications, path through CPU/kernel/application accounts for

86% of total request latency

I Goal: Serve popular requests without CPU interruption

I Solution: Hardware key-value store attached to the network interface controller

I Many workloads have an access pattern suitable for a small dedicated cache

I Per a Facebook study, 10% of keys represent 90% of requestsI Most values are relatively small in size (1 kB)

Related Work

I A 2013 paper by Lim et al. proposed a system dubbed “Thin Servers with

Smart Pipes”, which served memcached GET requests from FPGA hardware.

I However, the FPGA hardware handled GET requests by accessing DRAM, not

a local SRAM cache.

Infrastructure

SFP Cage(1 GbE SFP,not pictured)

DDR3 SDRAM

XilinxXC7Z045FPGA: Rocket Core Accelerator Traffic Manager DMA Engine

Xilinx ZC706 Evaluation Platform

I ZYNQ-7000 SoC

I Brocade 1GbE Copper SFP Transceiver

I Xilinx Tri-Mode Ethernet MACI Xilinx 1000Base-X PCS/PMA

I 64-bit RISC-V Rocket Core (50 MHz)

I Single-issue, in-order, 6-stage pipelineI ASIC version most nearly comparable

with ARM Cortex-A5

Int.RF

InstructionDecode

ITLB

I$ Access

PCGeneration

Int.EX

DTLB

D$ Access Commit

FP.RF FP.EX1 FP.EX2 FP.EX3

I No pre-existing I/O peripherals for the Rocket core

I Built first RISC-V hardware device: register-mapped NIC

I Programmed I/O with custom Linux kernel driverI First telnet/ssh session into a physical RISC-V machine

I Evolved to DMA-based NIC for performance

Software

I Manages what keys and values are stored on the accelerator

I Controls the accelerator through the RoCC co-processor interface, which

provides custom instructions for setting keys and values

I Responsible for implementing cache replacement policies

I Identification of the most popular keys as candidates for offloadingI Invalidation of stale entries

System Architecture

Baseline

Rocketscalar core

16 KiB I$ 32 KiB D$

2:1 arbiter

HTIF

DMA

L2 Coherence Agent

TileLink

512 MiB SDRAM

Tile

Uncore

AXI

NIC

Enhanced

Rocketscalar core

Key-Value Storeaccelerator

RoCC

16 KiB I$ 32 KiB D$

2:1 arbiter

2:1 arbiter

HTIF

DMA

L2 Coherence Agent

TileLink

512 MiB SDRAM

Tile

Uncore

AXI

TrafficManager NIC

Accelerator

Accelerator

Hasher

Hasher

WriterCurrent

Key CacheAll Keys

KeyCompare

ValueMemory

Key In

Value Out

MemoryHandler

Controller

RoCC Memory RoCC Command

Mux

Traffic Manager

MainWriter

MainBuffer

DeferBuffer

Splitter

Arbiter

ArbiterResponder

Controller

MAC Rx

DMA Rx

DMA Tx

Mac Tx

Key Out Result In

I The accelerator accepts a key and computes primary and secondary hash

values, which it uses to retrieve the value from its local block RAM.

I The traffic manager, interposed between the NIC and the DMA engine,

implements the specialized Memcached logic.

I For intercepted Memcached GET requests, the traffic manager queries the

accelerator and constructs the response packet if the key is found.

I Unhandled frames are forwarded to the DMA engine for processing by the

operating system.

DMA Engine

I Performs uncached memory accesses via the TileLink protocol

I Transfers a 512 bit cache block per request

I Front-end/back-end decoupling allows load prefetching to hide memory latency

I Buffer descriptor rings exposed as queues through processor control registers

I Provides 250 times the query throughput as compared to programmed I/O

TxFront-end

acquire

grant

finish

TxBack-end

data

last?

addrcount

descriptor

rptr

wptr

CSRs

IRQ

RxFront-end

data

last?

RxBack-end

acquire

grant

finish

cptr cptr

addrcountCSRs

descriptor

wptr rptr

IRQ

Floorplan

Utilization

Resource w/o A+TM w/A+TM

Slice LUTs 17.09% 21.79%

Slice Registers 6.18% 8.01%

Memory 21.65% 63.85%

Latency Evaluation

0 10 20 30 40 50 60 70 80 90Percentiles

1600

1700

1800

1900

2000

2100

2200

2300

2400

Late

ncy

(us)

Uniform Distribution Latencies

Requests served by memcached on RocketRequests served by accelerator

0 10 20 30 40 50 60 70 80 90Percentiles

1500

1600

1700

1800

1900

2000

2100

2200

2300

2400

Late

ncy

(us)

Normal Distribution Latencies


0 10 20 30 40 50 60 70 80 90Percentiles

0

500

1000

1500

2000

2500

Late

ncy

(us)

Pareto Distribution Latencies


0 10 20 30 40 50 60 70 80 90Percentiles

0

500

1000

1500

2000

2500

Late

ncy

(us)

Facebook ETC Distribution Latencies


Conclusion

I By moving some of the keys to the accelerator and serving directly to the NIC

from hardware, we gained an order of magnitude speed-up over memcached

software running on the Rocket core (1700µs vs 150µs response latency).

I The accelerator serves 40% of keys at this reduced latency for Facebook ETC.

I However, we still have a long way to go before reaching production quality.

Future Work

I Place the DMA engine, traffic manager, and accelerator in faster clock domains

with asynchronous FIFOs, rather than be constrained by the core frequency

I Widen I/O interfaces for greater throughput

I Investigate replacing the fixed-function traffic manager with a programmable

co-processor (reminiscent of mainframe channel I/O)

I Conduct torture testing for reliability

I Explore opportunities for measuring and improving energy efficiency

University of California, Berkeley ASPIRE Winter 2015 Retreat Department of Electrical Engineering & Computer Sciences

hardware acceleration of key-value storesskarandikar/pub/kv... · 2015. 11. 1. · hardware...

Documents