hardware acceleration of key-value storesskarandikar/pub/kv... · 2015. 11. 1. · hardware...

1
Hardware Acceleration of Key-Value Stores Sagar Karandikar, Howard Mao, Albert Ou, Yunsup Lee Advisor: Krste Asanovi´ c Motivation I In datacenter applications, path through CPU/kernel/application accounts for 86% of total request latency I Goal: Serve popular requests without CPU interruption I Solution: Hardware key-value store attached to the network interface controller I Many workloads have an access pattern suitable for a small dedicated cache I Per a Facebook study, 10% of keys represent 90% of requests I Most values are relatively small in size (1 kB) Related Work I A 2013 paper by Lim et al. proposed a system dubbed “Thin Servers with Smart Pipes”, which served memcached GET requests from FPGA hardware. I However, the FPGA hardware handled GET requests by accessing DRAM, not a local SRAM cache. Infrastructure SFP Cage (1 GbE SFP, not pictured) DDR3 SDRAM Xilinx XC7Z045 FPGA: Rocket Core Accelerator Traffic Manager DMA Engine Xilinx ZC706 Evaluation Platform I ZYNQ-7000 SoC I Brocade 1GbE Copper SFP Transceiver I Xilinx Tri-Mode Ethernet MAC I Xilinx 1000Base-X PCS/PMA I 64-bit RISC-V Rocket Core (50 MHz) I Single-issue, in-order, 6-stage pipeline I ASIC version most nearly comparable with ARM Cortex-A5 Int.RF Instruction Decode ITLB I$ Access PC Generation Int.EX DTLB D$ Access Commit FP.RF FP.EX1 FP.EX2 FP.EX3 I No pre-existing I/O peripherals for the Rocket core I Built first RISC-V hardware device: register-mapped NIC I Programmed I/O with custom Linux kernel driver I First telnet/ssh session into a physical RISC-V machine I Evolved to DMA-based NIC for performance Software I Manages what keys and values are stored on the accelerator I Controls the accelerator through the RoCC co-processor interface, which provides custom instructions for setting keys and values I Responsible for implementing cache replacement policies I Identification of the most popular keys as candidates for offloading I Invalidation of stale entries System Architecture Baseline Rocket scalar core 16 KiB I$ 32 KiB D$ 2:1 arbiter HTIF DMA L2 Coherence Agent TileLink 512 MiB SDRAM Tile Uncore AXI NIC Enhanced Rocket scalar core Key-Value Store accelerator RoCC 16 KiB I$ 32 KiB D$ 2:1 arbiter 2:1 arbiter HTIF DMA L2 Coherence Agent TileLink 512 MiB SDRAM Tile Uncore AXI Traffic Manager NIC Accelerator Accelerator Hasher Hasher Writer Current Key Cache All Keys Key Compare Value Memory Key In Value Out Memory Handler Controller RoCC Memory RoCC Command Mux Traffic Manager Main Writer Main Buffer Defer Buffer Splitter Arbiter Arbiter Responder Controller MAC Rx DMA Rx DMA Tx Mac Tx Key Out Result In I The accelerator accepts a key and computes primary and secondary hash values, which it uses to retrieve the value from its local block RAM. I The traffic manager, interposed between the NIC and the DMA engine, implements the specialized Memcached logic. I For intercepted Memcached GET requests, the traffic manager queries the accelerator and constructs the response packet if the key is found. I Unhandled frames are forwarded to the DMA engine for processing by the operating system. DMA Engine I Performs uncached memory accesses via the TileLink protocol I Transfers a 512 bit cache block per request I Front-end/back-end decoupling allows load prefetching to hide memory latency I Buffer descriptor rings exposed as queues through processor control registers I Provides 250 times the query throughput as compared to programmed I/O Tx Front-end acquire grant finish Tx Back-end data last? addr count descriptor rptr wptr CSRs IRQ Rx Front-end data last? Rx Back-end acquire grant finish cptr cptr addr count CSRs descriptor wptr rptr IRQ Floorplan Utilization Resource w/o A+TM w/A+TM Slice LUTs 17.09% 21.79% Slice Registers 6.18% 8.01% Memory 21.65% 63.85% Latency Evaluation Conclusion I By moving some of the keys to the accelerator and serving directly to the NIC from hardware, we gained an order of magnitude speed-up over memcached software running on the Rocket core (1700μs vs 150μs response latency). I The accelerator serves 40% of keys at this reduced latency for Facebook ETC. I However, we still have a long way to go before reaching production quality. Future Work I Place the DMA engine, traffic manager, and accelerator in faster clock domains with asynchronous FIFOs, rather than be constrained by the core frequency I Widen I/O interfaces for greater throughput I Investigate replacing the fixed-function traffic manager with a programmable co-processor (reminiscent of mainframe channel I/O) I Conduct torture testing for reliability I Explore opportunities for measuring and improving energy efficiency University of California, Berkeley ASPIRE Winter 2015 Retreat Department of Electrical Engineering & Computer Sciences

Upload: others

Post on 03-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hardware Acceleration of Key-Value Storesskarandikar/pub/kv... · 2015. 11. 1. · Hardware Acceleration of Key-Value Stores Sagar Karandikar, Howard Mao, Albert Ou, Yunsup Lee Advisor:

Hardware Acceleration of Key-Value StoresSagar Karandikar, Howard Mao, Albert Ou, Yunsup Lee Advisor: Krste Asanovic

Motivation

I In datacenter applications, path through CPU/kernel/application accounts for

86% of total request latency

I Goal: Serve popular requests without CPU interruption

I Solution: Hardware key-value store attached to the network interface controller

I Many workloads have an access pattern suitable for a small dedicated cache

I Per a Facebook study, 10% of keys represent 90% of requestsI Most values are relatively small in size (1 kB)

Related Work

I A 2013 paper by Lim et al. proposed a system dubbed “Thin Servers with

Smart Pipes”, which served memcached GET requests from FPGA hardware.

I However, the FPGA hardware handled GET requests by accessing DRAM, not

a local SRAM cache.

Infrastructure

SFP Cage(1 GbE SFP,not pictured)

DDR3 SDRAM

XilinxXC7Z045FPGA: Rocket Core Accelerator Traffic Manager DMA Engine

Xilinx ZC706 Evaluation Platform

I ZYNQ-7000 SoC

I Brocade 1GbE Copper SFP Transceiver

I Xilinx Tri-Mode Ethernet MACI Xilinx 1000Base-X PCS/PMA

I 64-bit RISC-V Rocket Core (50 MHz)

I Single-issue, in-order, 6-stage pipelineI ASIC version most nearly comparable

with ARM Cortex-A5

Int.RF

InstructionDecode

ITLB

I$ Access

PCGeneration

Int.EX

DTLB

D$ Access Commit

FP.RF FP.EX1 FP.EX2 FP.EX3

I No pre-existing I/O peripherals for the Rocket core

I Built first RISC-V hardware device: register-mapped NIC

I Programmed I/O with custom Linux kernel driverI First telnet/ssh session into a physical RISC-V machine

I Evolved to DMA-based NIC for performance

Software

I Manages what keys and values are stored on the accelerator

I Controls the accelerator through the RoCC co-processor interface, which

provides custom instructions for setting keys and values

I Responsible for implementing cache replacement policies

I Identification of the most popular keys as candidates for offloadingI Invalidation of stale entries

System Architecture

Baseline

Rocketscalar core

16 KiB I$ 32 KiB D$

2:1 arbiter

HTIF

DMA

L2 Coherence Agent

TileLink

512 MiB SDRAM

Tile

Uncore

AXI

NIC

Enhanced

Rocketscalar core

Key-Value Storeaccelerator

RoCC

16 KiB I$ 32 KiB D$

2:1 arbiter

2:1 arbiter

HTIF

DMA

L2 Coherence Agent

TileLink

512 MiB SDRAM

Tile

Uncore

AXI

TrafficManager NIC

Accelerator

Accelerator

Hasher

Hasher

WriterCurrent

Key CacheAll Keys

KeyCompare

ValueMemory

Key In

Value Out

MemoryHandler

Controller

RoCC Memory RoCC Command

Mux

Traffic Manager

MainWriter

MainBuffer

DeferBuffer

Splitter

Arbiter

ArbiterResponder

Controller

MAC Rx

DMA Rx

DMA Tx

Mac Tx

Key Out Result In

I The accelerator accepts a key and computes primary and secondary hash

values, which it uses to retrieve the value from its local block RAM.

I The traffic manager, interposed between the NIC and the DMA engine,

implements the specialized Memcached logic.

I For intercepted Memcached GET requests, the traffic manager queries the

accelerator and constructs the response packet if the key is found.

I Unhandled frames are forwarded to the DMA engine for processing by the

operating system.

DMA Engine

I Performs uncached memory accesses via the TileLink protocol

I Transfers a 512 bit cache block per request

I Front-end/back-end decoupling allows load prefetching to hide memory latency

I Buffer descriptor rings exposed as queues through processor control registers

I Provides 250 times the query throughput as compared to programmed I/O

TxFront-end

acquire

grant

finish

TxBack-end

data

last?

addrcount

descriptor

rptr

wptr

CSRs

IRQ

RxFront-end

data

last?

RxBack-end

acquire

grant

finish

cptr cptr

addrcountCSRs

descriptor

wptr rptr

IRQ

Floorplan

Utilization

Resource w/o A+TM w/A+TM

Slice LUTs 17.09% 21.79%

Slice Registers 6.18% 8.01%

Memory 21.65% 63.85%

Latency Evaluation

0 10 20 30 40 50 60 70 80 90Percentiles

1600

1700

1800

1900

2000

2100

2200

2300

2400

Late

ncy

(us)

Uniform Distribution Latencies

Requests served by memcached on RocketRequests served by accelerator

0 10 20 30 40 50 60 70 80 90Percentiles

1500

1600

1700

1800

1900

2000

2100

2200

2300

2400

Late

ncy

(us)

Normal Distribution Latencies

Requests served by memcached on RocketRequests served by accelerator

0 10 20 30 40 50 60 70 80 90Percentiles

0

500

1000

1500

2000

2500

Late

ncy

(us)

Pareto Distribution Latencies

Requests served by memcached on RocketRequests served by accelerator

0 10 20 30 40 50 60 70 80 90Percentiles

0

500

1000

1500

2000

2500

Late

ncy

(us)

Facebook ETC Distribution Latencies

Requests served by memcached on RocketRequests served by accelerator

Conclusion

I By moving some of the keys to the accelerator and serving directly to the NIC

from hardware, we gained an order of magnitude speed-up over memcached

software running on the Rocket core (1700µs vs 150µs response latency).

I The accelerator serves 40% of keys at this reduced latency for Facebook ETC.

I However, we still have a long way to go before reaching production quality.

Future Work

I Place the DMA engine, traffic manager, and accelerator in faster clock domains

with asynchronous FIFOs, rather than be constrained by the core frequency

I Widen I/O interfaces for greater throughput

I Investigate replacing the fixed-function traffic manager with a programmable

co-processor (reminiscent of mainframe channel I/O)

I Conduct torture testing for reliability

I Explore opportunities for measuring and improving energy efficiency

University of California, Berkeley ASPIRE Winter 2015 Retreat Department of Electrical Engineering & Computer Sciences