implementing ultra low latency data center services with
TRANSCRIPT
Implementing Ultra Low Latency Data Center
Services with Programmable Logic
John W. Lockwood, CEO: Algo-Logic Systems, Inc.
http://Algo-Logic.com • [email protected] • (408) 707-3740 • 2255-D Martin Ave., Santa Clara, CA 95050
© 2015 Algo-Logic Systems Inc., All rights reserved. 2
Why the Move to Programmable Logic?
• Driving Metrics in the Data Center
– Latency:
• Reduce delay
• Avoid jitter
– Throughput
• Processing packets at line rate
• Handle 10G, 25G, 40G, and 100G
– Power:
• Driving cost of OpEx
• Field Programmable Gate Array (FPGA) logic moves into the CPU
• Microsoft accelerates BING search with FPGA
• Intel acquires Altera
INC
RE
AS
ING
SIL
ICO
N D
IE A
RE
A
CPU
CPUFPGA
GPU
…
Add Logic
Processing
TIME
2016-2019
Sequential
Processing
1971
Micro
Optimized
Sequential
1985
CPU CPU
CPU
…
Multi-
core
2006
CPU
CPU
GPU
…
Add VectorProcessing
2013
= SILICONE DIE AREA“There are large challenges in scaling the
performance of software now. The question
is: ‘What’s next?’ We took a bet on
programmable hardware.”
- Doug Burger, Microsoft
© 2015 Algo-Logic Systems Inc., All rights reserved. 3
Servers Accelerated with FPGA Gateware
• FPGA Augments Existing Servers
– Can run on an expansion card (same size as a GPU)
– Or may be integrated into the CPU socket
• GDN Applications run on FPGA
– Implements low-latency, low-power, high-throughput data processing
Accelerated Server
CPU
FPGA
RAM
40G
100
GE
10G
CPU
Cores
RAM
RAM
RAM
RAM
RAM
HDD
SSD
Rack Server
© 2015 Algo-Logic Systems Inc., All rights reserved. 4
Example of Low Latency Service: Key/Value Store
• Key/Value Store (KVS)
– Simplifies implementation of large-scale
distributed computation algorithms
– Data Center Servers exchanges data
over standard Ethernet
• Challenges
– Operating System
delays packets and
limits throughput
– Per-core processing
inefficient at high-speed
packet processing
• Solutions
– Bypass kernel bypass with DPDK
– Offload of packet processing with FPGA
Company Phone #
Interface : MAC AddressIP Address
Examples:
Directory
Forwarding
Tables
Storage Block IDContent Hash
Data De-
duplication
Key Value
Algo-Logic (408) 707-3740
204.2.34.5 Eth6 : 02:33:29:F2:AB:CC
XYZ 948830038411
Symbol, Side, PriceOrder ID
Stock Trading ATY11217911101 AAPL, B, 126.75
Edge ListVirtex
Graph Search v140 v201, v206, v225
© 2015 Algo-Logic Systems Inc., All rights reserved. 5
Mobile Application Servers Need Fast Key/Value Stores
• Scalable backend services to share data
– Sensors { location, bio, movement, .. }
– Social { status, dating, updates, multi-player games .. }
– Media { video/security, audio/music, .. }
– Communication { network status, handoff, short messages .. }
– Database { users, providers, payments, travel, authentication, .. }
• Must be able to scale
– as the number of users grows
– Scale up to provide the best latency, throughput, and power
– Scale out to increase storage capacity, throughput, and redundancy
• Example:
– Mobile location sharing
。。。
© 2015 Algo-Logic Systems Inc., All rights reserved. 6
Case Study: Implementing Uber with KVS
• Uber in 2014
– 162,037 drivers in the US completed 4 or more trips
– New drivers doubled every 6 months for past 2 years
– Number of Uber Users = 8M
– Number of Cities = 290
– Total trips = 140M
– Daily Trips = 1M
• Analysis and Assumptions
– Assuming 25% of the drivers are active
– 25% of 160k drivers = 40k active cars (<48k)
– Drivers update position once per second = 40k IOW
• Implementation
– Uber with on an Algo-Logic KVS card
Washington Post, Dec. 2014
© 2015 Algo-Logic Systems Inc., All rights reserved. 7
Algo-Logic’s KVS solution for Mobile Applications
Scale UP for FASTER access to shared data
Application Server
KVS on FPGA
。。。
© 2015 Algo-Logic Systems Inc., All rights reserved. 8
Algo-Logic’s KVS solution for Mobile Applications
。。。
。。。
2U System
KVS on FPGA
2U System
KVS on FPGA
2U System
KVS on FPGA
2U System
KVS on FPGA
2U System
KVS on FPGA
App Servers
KVS on FPGA
And SCALE-OUT quickly to increase storage capacity and throughput
© 2015 Algo-Logic Systems Inc., All rights reserved. 9
Provisioning and Measurements with GDN-Switch
© 2015 Algo-Logic Systems Inc., All rights reserved. 10
Linux Software Socket KVS
OCSM
Packet
Intel
10G NIC
Kernel
Driver
Message
Process
10g Ethernet
Data Transfer =
LEGEND
Algo-Logic software
on Intel 82598 10GE NIC
and Core i7-4770k CPU
© 2015 Algo-Logic Systems Inc., All rights reserved. 11
Algo-Logic KVS with DPDK: Bypass the Kernel
OCSM
Packet
Intel
82598
DPDK
Supported
NIC
Receive
Queue
Message
Buffer
Transmit
Queue
Message
Process
Response
Generation
Note: Message read once into CPU Cache
Store
Enqueue
Dequeue
Enqueue
Dequeue
10g Ethernet
Data Transfer =
Control Handoff =
LEGEND
Algo-Logic software
on Intel 82598 10GE NIC
and Core i7-4770k CPU
© 2015 Algo-Logic Systems Inc., All rights reserved. 12
ALGO-LOGIC’S FPGA SOLUTION FOR KEY/VALUE SEARCH
OPTIMISED FOR LARGE NUMBER OF KEY/VALUE SEARCH ENTRIESAlgo-Logic GDN-Search: KVS in FPGA
OCSM
Packet
10g Ethernet
Algo-Logic gateware
on Nallatech P385 with
Altera Stratix V A7 FPGA
EMSE2
MAC0
RX
REQUEST GENERATOR
Packet
Parser
32B OCSM
Header
Identifier
Key-Value
Extractor
MAC0
TX
RESPONSE GENERATOR
Key-Value
Response
Decoder
32B OCSM
Header
Reconstruct
Packet
Reconstruct
Packet Handler
EMSE2
MAC1
RX
REQUEST GENERATOR
Packet
Parser
64B OCSM
Header
Identifier
Key-Value
Extractor
MAC1
TX
RESPONSE GENERATOR
Key-Value
Response
Decoder
64B OCSM
Header
Reconstruct
Packet
Reconstruct
On-C
hip
Mem
ory
Packet Handler
Key-Value Search
Key-Value Search
Key: 96b, Value: 96b
Key: 96b, Value: 352b
Off-C
hip
Mem
ory
Off-C
hip
Mem
ory
Off-C
hip
Mem
ory
Off-C
hip
Mem
ory
FPGA
OCSM
Packet
10g Ethernet
© 2015 Algo-Logic Systems Inc., All rights reserved. 13
Trends for Adding Storage around FPGAs
• Six banks of memory controllers
– QDR SRAM
– RLDRAM
– DDR3, DDR4
• 64 lanes of SERDES
– SATA disk and Flash
– Serial memories
• Hybrid Memory Cube (HMC)
• Mosys Bandwidth Engine (BE2, BE3)
• New Memories
– 3D Xpoint with DDR4 Interface
– Potential for Terabytes of Memory on each card
© 2015 Algo-Logic Systems Inc., All rights reserved. 14
Implementation of KVS with Socket I/O, DPDK, and FPGA
• Benchmark same application
– Key/Value Store (KVS)
• Running on the same PC
– Intel i7-4770k CPU, 82598 NIC, and Altera Stratix V A7 FPGA
• With three different implementations
– Socket I/O, DPDK, FPGA
OCSM
Packet
Intel
82598
DPDK
Supported
NIC
Receive
Queue
Message
Buffer
Transmit
Queue
Message
Process
Response
Generation
Note: Message read once into CPU Cache
Store
Enqueue
Dequeue
Enqueue
Dequeue
10g Ethernet
Data Transfer =
Control Handoff =
LEGEND
Algo-Logic software
on Intel 82598 10GE NIC
and Core i7-4770k CPU
Exact
Match
Search
Engine
(EMSE)
REQUEST GENERATOR
Packet
Parser
OCSM
Header
Identifier
Key/Value
Extractor
RESPONSE GENERATOR
Key/Value
Search
Response
Decoder
OCSM
Header
Reconstruct
Packet
Modifier
OCSM
Packet
10g Ethernet
Algo-Logic gateware
on Nallatech P385 with
Altera Stratix V A7 FPGA
OCSM
Packet
Intel
10G NIC
Kernel
Driver
Message
Process
10g Ethernet
Data Transfer =
LEGEND
Algo-Logic software
on Intel 82598 10GE NIC
and Core i7-4770k CPU
DPDK
Socket I/O
FPGA
© 2015 Algo-Logic Systems Inc., All rights reserved. 15
KVS Hardware in Data Center Rack
© 2015 Algo-Logic Systems—Public
Provision Controller
FPGA GDN-Classify
PHY MACPacket Parser
Key Extractor
Associative Rule-Match CAM
Flow or ACL
TargetQueues
PHYsMAC
s
KVS in Software
KVS in DPDK
KVS in FPGA
UPS Power
…
Rack of Search
Servers
Additional KVS Servers
40G
10G
© 2015 Algo-Logic Systems Inc., All rights reserved. 16
Load Testing off the KVS Implementations
Traffic
Generator
Intel
82599
10G NIC
Mellanox
10G NIC
Mellanox
10G NIC
Intel
82598
10G NIC
Intel
82598
10G NIC
Nallatech
P385 10G
Kernel
Driver
Process
Message
Intel
DPDK
EAL
Process
Message
Process
Message
GEN: Traffic Generator KVS Implementations
© 2015 Algo-Logic Systems Inc., All rights reserved. 17
Full-Rate
Packet Processing
(100% of TX Load)
Software
Drops
Packets
Stratix V FPGA with
EMSE-2 Drops no
Packets
DPDK
Drops
Packets
© 2015 Algo-Logic Systems Inc., All rights reserved. 18
Sockets Power Consumption Profile(10M Packets with 40 CSM Messages/Packet)
© 2015 Algo-Logic Systems Inc., All rights reserved. 19
DPDK Power Consumption Profile (10M Packets with 40 CSM Messages/Packet)
© 2015 Algo-Logic Systems Inc., All rights reserved. 20
Latency Measurement with GDN-Classify
• Round trip latency of GDN Switch is deterministic (constant)
• Round trip latency of KVS Sever = Total Round trip Time – Round
trip time with 10G loopback on GDN Switch
GDN SwitchKVS Server
40G Rx
40G Tx
10G Tx
10G Rx
Loopback
Time
Stamp
T_out-
T_in
Switch
Latency(constant for
GDN-Switch)
Search
Latency(Different for Software,
DPDK, and GDN-Search)
+
© 2015 Algo-Logic Systems Inc., All rights reserved. 21
KVS Latency in FPGA, DPDK, and Sockets
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
45.00%
50.00%
0 5.5 11 16.5 22 27.5 33 38.5 44
Peerc
en
tag
e o
f O
bserv
ed
Packets
Latency Distribution [µs]
Latency Comparison 100k packets, 1 OCSM per packet, 1k pps
RTL
Sockets
DPDK
Altera Stratix V RTL Average: 0.467µs
Sockets Average: 41.40µs
DPDK Average: 6.29µs
KVS in Software
Worst Latency
Worst Jitter
0.00%
0.10%
0.20%
0.30%
0.40%
0.50%
0.60%
0.70%
38.0 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 47.0 48.0
Perc
en
tag
e o
f P
ack
ets
Ob
serv
ed
[%
]
Latency Distribution [µs]
Socket Implementation Latency Distribution with One OCSM/Packet
Sockets
Intel i7 Average: 41.54µs
KVS in FPGA:
Best Latency,
No Jitter
KVS in DPDK:
Lowers Latency,
Some Jitter
Lower Latency = Faster ResponseLowest
Tig
hte
r S
pre
ad
= L
ess J
itte
r
© 2015 Algo-Logic Systems Inc., All rights reserved. 22
Measured Latency, Throughput, and Power Results
All Datapaths Summary
Latency (µseconds)
Tested Throughput (CSMs/sec)
Power (µJoules/CSM)
Sockets 41.54 4.0 11
DPDK 6.434 16 6.6
RTL 0.467 15 0.52
All Datapaths Summary
Latency (µseconds)
Maximum Throughput (CSMs/sec)
Power (µJoules/CSM)
GDN vs. Sockets 88x less 13x 21x less
GDN vs. DPDK 14x less 3.2x 13x less Provision Controller
FPGA GDN-Classify
PHY MACPacket Parser
Key Extractor
Associative Rule-Match CAM
Flow or ACL
TargetQueues
PHYsMACs
KVS in Software
KVS in DPDK
KVS in FPGA
UPS Power
…
Rack of Search
Servers
Additional KVS Servers
40G
10G
© 2015 Algo-Logic Systems Inc., All rights reserved. 23
Conclusions: Programmable Hardware in the Data Center
• Lowers Latency
―88x faster than Linux networking sockets
―14x faster than optimized DPDK (kernel bypass)
• Increases Throughput (IOPs)
―3x to 13x improvement in throughput
―Lowers Capital Expenditures (CapEx)
• Reduces Power
―13x to 21x reduction in power
―Reduces Operating Expenditures (OpEx)