rdma in data centers: looking back and looking...
TRANSCRIPT
RDMAinDataCenters:LookingBackandLookingForward
ChuanxiongGuo
ACMSIGCOMMAPNet 2017
August32017
MicrosoftResearch
TheRisingofCloudComputing
40 AZUREREGIONS
DataCenters
DataCenters
5
• Cloudscaleservices:IaaS,PaaS,Search,BigData,Storage,MachineLearning,DeepLearning
• Servicesarelatencysensitiveorbandwidthhungryorboth• Cloudscaleservicesneedcloudscalecomputingandcommunicationinfrastructure
Datacenternetworks(DCN)
6
Datacenternetworks(DCN)
• Singleownership• Largescale• Highbisectionbandwidth• CommodityEthernetswitches• TCP/IPprotocolsuite
Spine
Leaf
ToR
Podset
Pod
Servers
7
ButTCP/IPisnotdoingwell
8
TCPlatency
405us(P50)
716us(P90)
2132us(P99)
Longlatencytail
Pingmeshmeasurementresults
9
TCPprocessingoverhead(40G)Sender Receiver
8tcpconnections
40GNIC
10
AnRDMArenaissancestory
11
VirtualInterfaceArchitectureSpec1.0 1997
Infiniband ArchitectureSpec1.0 20001.1 20021.2 20041.3 2015RoCE 2010RoCEv22014
12
RDMA
• RemoteDirectMemoryAccess(RDMA):Methodofaccessingmemoryonaremotesystemwithout interruptingtheprocessingoftheCPU(s)onthatsystem
• RDMAoffloadspacketprocessingprotocolstotheNIC• RDMAinEthernetbaseddatacenters
13
RoCEv2:RDMAoverCommodityEthernet
• RoCEv2forEthernetbaseddatacenters
• RoCEv2encapsulatespacketsinUDP
• OSkernelisnotindatapath• NICfornetworkprotocolprocessingandmessageDMA
TCP/IP
NICdriver
User
Kernel
Hardware
RDMAtransport
IPEthernet
RDMAapp
DMA
RDMAverbs
TCP/IP
NICdriver
Ethernet
RDMAapp
DMA
RDMAverbs
Losslessnetwork
RDMAtransport
IP
14
RDMAbenefit:latencyreduction
• Forsmallmsgs (<32KB),OSprocessinglatencymatters
• Forlargemsgs (100KB+),speedmatters
15
RDMAbenefit:CPUoverheadreductionSender Receiver
OneNDconnection
40GNIC
37Gb/sgoodput
16RDMA:SingleQP,88Gb/s,1.7%CPU TCP:Eightconnections,30-50Gb/s,Client:2.6%,Server:4.3%CPU
RDMAbenefit:CPUoverheadreductionIntel(R)Xeon(R)[email protected],twosockets28cores
17
RoCEv2needsalosslessEthernetnetwork
• PFCforhop-by-hopflowcontrol• DCQCNforconnection-levelcongestioncontrol
18
Priority-basedflowcontrol(PFC)
• Hop-by-hopflowcontrol,witheightprioritiesforHOLblockingmitigation
• ThepriorityindatapacketsiscarriedintheVLANtagorDSCP
• PFCpauseframetoinformtheupstreamtostop
• PFCcausesHOLandcolleterialdamage
PFCpauseframep1
Egressport Ingressport
p0p1
p7
Datapacket
p0 p0p1
p7
XOFFthreshold
DatapacketPFCpauseframe
19
DCQCN
• CP: SwitchesuseECNforpacketmarking• NP:periodicallycheckifECN-markedpacketsarrived,ifso,notifythesender
• RP:adjustsendingratebasedonNPfeedbacks19
Sender NICReaction Point
(RP)
SwitchCongestion Point
(CP)
Receiver NICNotification Point
(NP)
DCQCN = Keep PFC + Use ECN + hardware rate-based congestion control
20
Thelosslessrequirementcausessafetyandperformancechallenges
• RDMAtransportlivelock• PFCdeadlock
• PFCpauseframestorm• Slow-receiversymptom
21
RDMAtransportlivelockRDMASend0
RDMASend1
RDMASendN+1
NAKN
RDMASend0
RDMASend1
RDMASend2
RDMASendN+2
Go-back-0 Go-back-N
RDMASend0
RDMASend1
RDMASendN+1
NAKN
RDMASendN
RDMASendN+1
RDMASendN+2
RDMASendN+2Sender Receiver
Switch
Pktdroprate1/256
Sender Receiver ReceiverSender
22
PFCdeadlock
• OurdatacentersuseClosnetwork• Packetsfirsttravelupthengodown
• Nocyclicbufferdependencyforup-downrouting->nodeadlock
• Butwedidexperiencedeadlock!
Spine
Leaf
ToR
Podset
Pod
Servers
23
PFCdeadlock
• Preliminaries• ARPtable:IPaddresstoMACaddressmapping
• MACtable:MACaddresstoportmapping
• IfMACentryismissing,packetsarefloodedtoallports
IP MAC TTL
IP0 MAC0 2h
IP1 MAC1 1h
MAC Port TTL
MAC0 Port0 10min
MAC1 - -
Input
Output
Dst:IP1
ARPtable
MACtable
24
La Lb
T0 T1
S1 S2 S3 S4Server
p0 p1
p2 p3
p0 p1
p3 p4
p0 p1p0 p1
Egressport
Ingressport
1 432PFCpauseframes
p2
S5
Packetdrop
Congestedport
Deadserver
PFCpauseframes
Path:{S1,T0,La,T1,S3}
Path:{S1,T0,La,T1,S5}
Path:{S4,T1,Lb,T0,S2}
PFCdeadlock
25
PFCdeadlock
• ThePFCdeadlockrootcause:theinteractionbetweenthePFCflowcontrolandtheEthernetpacketflooding
• Solution:dropthelosslesspacketsiftheARPentryisincomplete• Recommendation:donotfloodormulticastforlosslesstraffic
26
L0
T0
L1
S0 S1
L2
T2
L3
T1 T3
L0
T0
L1
S0 S1
L2
T2
L3
T1 T3
Tagger:practicalPFCdeadlockprevention
• TaggerAlgorithmworksforgeneralnetworktopology
• DeployableinexistingswitchingASICs
• Concept:ExpectedLosslessPath(ELP)todecoupleTaggerfromrouting
• Strategy:movepacketstodifferentlosslessqueuebeforeCBDforming
27
NICPFCpauseframestorm
• AmalfunctioningNICmayblockthewholenetwork
• PFCpauseframestormscausedseveralincidents
• Solution:watchdogsatbothNICandswitchsidestostopthestorm
ToRs
Leaflayer
Spinelayer
servers0 1 2 3 4 5 6 7MalfunctioningNIC
Podset0 Podset1
28
Theslow-receiversymptom
• ToRtoNICis40Gb/s,NICtoserveris64Gb/s
• ButNICsmaygeneratelargenumberofPFCpauseframes
• Rootcause:NICisresourceconstrained
• Mitigation• LargepagesizefortheMTT(memorytranslationtable)entry
• DynamicbuffersharingattheToR
CPU DRAM
ToR
QSFP40Gb/s
PCIeGen38x864Gb/s
MTTWQEs
QPC
NIC
Server
Pauseframes
29
Deploymentexperiencesandlessonslearned
30
Latencyreduction
• RoCEv2deployedinBingworld-widefortwoandhalfyears
• Significantlatencyreduction
• Incast problemsolvedasnopacketdrops
31
RDMAthroughput
• Usingtwopodsets eachwith500+servers• 5Tb/scapacitybetweenthetwopodsets
• Achieved3Tb/sinter-podset throughput• BottleneckedbyECMProuting• Closeto0CPUoverhead
32
Latencyandthroughputtradeoff
L0
T0
L1
T1
L1 L1
S0,0 S0,23 S1,0 S1,23
• RDMAlatenciesincreaseasdatashufflingstarted
• Lowlatencyvshighthroughput
us
Beforedatashuffling Duringdatashuffling
33
Lessonslearned
• Providinglosslessishard!• Deadlock,livelock,PFCpauseframespropagationandstormdidhappen
• Bepreparedfortheunexpected• Configurationmanagement,latency/availability,PFCpauseframe,RDMAtrafficmonitoring
• NICsarethekeytomakeRoCEv2work
34
What’snext?
35
Applications
Technologies
Architectures
Protocols
• RDMAforX(Search,Storage,HFT,DNN,etc.) • Lossyvslosslessnetwork
• Practical,large-scaledeadlockfreenetwork
• RDMAprogramming
• RDMAforheterogenouscomputingsystems
• RDMAvirtualization
• Reducingcolleterialdamage• RDMAsecurity
• Softwarevshardware
• Inter-DCRDMA
36
• Historically,softwarebasedpacketprocessingwon(multipletimes)• TCPprocessingoverheadanalysisbyDavidClark,etal.• Nonofthestateful TCPoffloadingtookoff(e.g.,TCPChimney)
• Thestoryisdifferentthistime• Moore’slawisending• Acceleratorsarecoming• Networkspeedkeepincreasing• Demandsforultralowlatencyarereal
Willsoftwarewin(again)?
37
• ThereisnobindingbetweenRDMAandlosslessnetwork• Butimplementingmoresophisticatedtransportprotocolinhardwareisachallenge
IslosslessmandatoryforRDMA?
38
RDMAvirtualizationforthecontainernetworking
• Arouteractsasaproxyforthecontainers
• Sharedmemoryforimprovedperformance
• Zerocopypossible
Container1IP:1.1.1.1
Host1
HostNetwork
vNIC
NetAPI
Application
FreeFlowNetLib
Container2IP:2.2.2.2
vNIC
NetAPI
Application
FreeFlowNetLib
PhyNIC
Container3IP:3.3.3.3
Host2
vNIC
NetAPIFreeFlowNetLib
PhyNICRDMA
ControlAgent
IPCChannel
FreeFlowRouterFr
eeFlow
NetOrchestrator
SharedMemorySpace
Application
ControlAgent
ShmSpace
39
RDMAforDNN
• TCPdoesnotworkfordistributedDNNtraining
• For16-GPU,2-hostspeechtrainingwithCNTK,TCPcommunicationsdominantthetrainingtime(72%),RDMAismuchfaster(44%)
40
• HowmanyLOCfora“helloworld”communicationusingRDMA?• ForTCP,itis60LOCforclientorservercode• ForRDMA,itiscomplicated…
• IBVerbs:600LOC• RCMACM:300LOC• Rsocket:60LOC
RDMAProgramming
41
• MakeRDMAprogrammingmoreaccessible• Easy-to-setupRDMAserverandswitchconfigurations• CanIrunanddebugmyRDMAcodeonmydesktop/laptop?• Highqualitycodesamples
• Looselycoupledvstightlycoupled(Send/Recv vsWrite/Read)
RDMAProgramming
42
Summary:RDMAfordatacenters!• RDMAisexperiencingarenaissanceindatacenters
• RoCEv2hasbeenrunningsafelyinMicrosoftdatacentersfortwoandhalfyears
• Manyopportunitiesandinterestingproblemsforhigh-speed,low-latencyRDMAnetworking
• ManyopportunitiesinmakingRDMAaccessibletomoredevelopers
43
• YanCai,GangCheng,ZhongDeng,DanielFirestone,JunchengGu,ShuihaiHu,HongqiangLiu,MarinaLipshteyn,AliMonfared,JitendraPadhye,GauravSoni,HaitaoWu,JianxiYe,YiboZhu
• Azure,Bing,CNTK,Phillycollaborators• AristaNetworks,Cisco,Dell,Mellanoxpartners
Acknowledgement
44
Questions?