network processors: building block for programmable networks
TRANSCRIPT
Page 1 Raj Yavatkar
1Raj Yavatkar
Network Processors:Network Processors:Building Block for programmable Building Block for programmable
networksnetworks
Raj YavatkarRaj YavatkarChief Software ArchitectChief Software Architect
IntelIntel®® Internet Exchange ArchitectureInternet Exchange Architecture
[email protected]@intel.com
Page 2 Raj YavatkarRaj Yavatkar
OutlineOutline
IXP 2xxx hardware architectureIXP 2xxx hardware architectureIXA software architectureIXA software architectureUsage questionsUsage questionsResearch questionsResearch questions
Page 3 Raj YavatkarRaj Yavatkar
IXP Network ProcessorsIXP Network Processors
MicroenginesMicroengines–– RISC processors RISC processors
optimized for packet optimized for packet processingprocessing
–– Hardware support for Hardware support for multimulti--threadingthreading
Embedded Embedded StrongARM/Xscale StrongARM/Xscale –– Runs embedded OS and Runs embedded OS and
handles exception taskshandles exception tasks
ME
1
ME
2
ME
n
StrongARMS
RA
M
DR
AM
Media/FabricInterface
ControlProcessor
Page 4 Raj YavatkarRaj Yavatkar
IXP: A Building Block for Network SystemsIXP: A Building Block for Network Systems
Example: IXP2800Example: IXP2800–– 16 micro16 micro--engines + XScale coreengines + XScale core–– Up to 1.4 Up to 1.4 GhzGhz ME speedME speed–– 8 HW threads/ME8 HW threads/ME–– 4K control store per ME4K control store per ME–– MultiMulti--level memory hierarchylevel memory hierarchy–– Multiple interMultiple inter--processor processor
communication channelscommunication channels
NPU vs. GPU tradeoffsNPU vs. GPU tradeoffs–– Reduce core complexityReduce core complexity
–– No hardware cachingNo hardware caching–– Simpler instructions Simpler instructions shallow shallow
pipelinespipelines–– Multiple cores with HW multiMultiple cores with HW multi--
threading per chipthreading per chip
MEv210
MEv211
MEv212
MEv215
MEv214
MEv213
MEv29
MEv216
MEv22
MEv23
MEv24
MEv27
MEv26
MEv25
MEv21
MEv28
RDRAMController
Intel®XScale™
Core
MediaSwitchFabric
I/F
PCI
QDR SRAM Controller
ScratchMemory
Hash Unit
Multi-threaded (x8) Microengine Array
Per-EngineMemory, CAM, Signals
Interconnect
Page 5 Raj YavatkarRaj Yavatkar
IXP 2400 Block DiagramIXP 2400 Block Diagram
Page 6 Raj YavatkarRaj Yavatkar
XScaleXScale Core processorCore processor
Compliant with the ARM V5TE architecture Compliant with the ARM V5TE architecture –– support for support for ARM’sARM’s thumb instructions thumb instructions –– support for Digital Signal Processing (DSP) support for Digital Signal Processing (DSP)
enhancements to the instruction setenhancements to the instruction set–– Intel’s improvements to the internal pipeline to improve Intel’s improvements to the internal pipeline to improve
the memorythe memory--latency hiding abilities of the corelatency hiding abilities of the core–– does not implement the floatingdoes not implement the floating--point instructions of point instructions of
the ARM V5 instruction setthe ARM V5 instruction set
Page 7 Raj YavatkarRaj Yavatkar
MicroenginesMicroengines –– RISC processorsRISC processors
IXP 2800 has 16 IXP 2800 has 16 microenginesmicroengines, organized into 4 clusters (4 , organized into 4 clusters (4 MEsMEsper cluster) per cluster) ME instruction set specifically tuned for processing network ME instruction set specifically tuned for processing network datadata
–– Arithmetic and Logical operations that operate at bit, byte, anArithmetic and Logical operations that operate at bit, byte, and longd long--word levelsword levels
–– can be combined with shift and rotate operations in single can be combined with shift and rotate operations in single instructions. instructions.
–– integer multiplication provided; no division or FP operations integer multiplication provided; no division or FP operations 4040--bit x 4K control storebit x 4K control storesixsix--stage pipeline in an instructionstage pipeline in an instruction
–– On an average takes one cycle to execute On an average takes one cycle to execute Each ME has eight hardwareEach ME has eight hardware--assisted threads of executionassisted threads of execution
–– can be configured to use either all eight threads or only four tcan be configured to use either all eight threads or only four threads hreads The nonThe non--preemptive hardware thread arbiter swaps between preemptive hardware thread arbiter swaps between threads in roundthreads in round--robin orderrobin order
Page 8 Raj YavatkarRaj Yavatkar
MicroEngine v2 MicroEngine v2
128GPR
Control Store
4K Instructions
128 GPR
Local Memory640 words
128 Next Neighbor
128 S Xfer Out
128 D Xfer Out
Local CSRs
CRC Unit
128 S Xfer In
128 D Xfer In
LM Addr 1LM Addr 0
D-PushBus
S-PushBus
D-Pull Bus S-Pull Bus
To Next Neighbor
From Next Neighbor
A_Operand B_Operand
ALU_Out
P-Random #
32-bit ExecutionData Path
Multiply
Find first bit
Add, shift, logical
2 per CTX
CRC remain
Lock0-15
StatusandLRULogic(6-bit)
TAGs0-15
Status Entry#
CAM
Timers
Timestamp
Prev B
B_op
Prev A
A_op
Page 9 Raj YavatkarRaj Yavatkar
Why Why multmult--ithreadingithreading? ?
Page 10 Raj YavatkarRaj Yavatkar
Packet processing using multi-threading within a MicroEngine
Page 11 Raj YavatkarRaj Yavatkar
Registers available to each ME Registers available to each ME
four different types of registersfour different types of registers–– general purpose, SRAM transfer, DRAM transfer, nextgeneral purpose, SRAM transfer, DRAM transfer, next--
neighbor (NN)neighbor (NN)–– Also, access to many Also, access to many CSRsCSRs
256, 32256, 32--bit bit GPRsGPRs–– can be accessed in threadcan be accessed in thread--local or absolute modelocal or absolute mode
256, 32256, 32--bit SRAM transfer registers.bit SRAM transfer registers.–– used to read/write to all functional units on the IXP2xxx used to read/write to all functional units on the IXP2xxx
except the DRAM except the DRAM 256, 32256, 32--bit DRAM transfer registers bit DRAM transfer registers
–– divided equally into readdivided equally into read--only and writeonly and write--only only –– used exclusively for communication between the used exclusively for communication between the MEsMEs and and
the DRAM the DRAM Benefit of having separate transfer and Benefit of having separate transfer and GPRsGPRs
–– ME can continue processing with ME can continue processing with GPRsGPRs while other while other functional units read and write the transfer registers functional units read and write the transfer registers
Page 12 Raj YavatkarRaj Yavatkar
NextNext--Neighbor RegistersNeighbor Registers
Each ME has 128, 32Each ME has 128, 32--bit nextbit next--neighbor registersneighbor registers–– makes data written in these registers available in the makes data written in these registers available in the
next next microenginemicroengine (numerically)(numerically)–– E.g., if ME 0 writes data into a nextE.g., if ME 0 writes data into a next--neighbor register, neighbor register,
ME 1 can read the data from its nextME 1 can read the data from its next--neighbor register, neighbor register, and so onand so on
In another mode, these registers are used as In another mode, these registers are used as extra extra GPRsGPRs
–– Data written into a nextData written into a next--neighbor register is read back neighbor register is read back by the same by the same microenginemicroengine
Page 13 Raj YavatkarRaj Yavatkar
Generalized thread signalingGeneralized thread signaling
Each ME thread has 15 numbered signals.Each ME thread has 15 numbered signals.Most accesses to functional units outside of the ME can Most accesses to functional units outside of the ME can cause a signal to any one signal numbercause a signal to any one signal numberThe signal number generated for any functional unit The signal number generated for any functional unit access is under the programmer’s controlaccess is under the programmer’s controlA ME thread can test for the presence or absence of any A ME thread can test for the presence or absence of any of these signalsof these signals
–– used to control branching on the signal presence used to control branching on the signal presence –– Or, to specify to the thread arbiter that a ME thread is ready Or, to specify to the thread arbiter that a ME thread is ready
to run only after the signal is receivedto run only after the signal is receivedBenefit of the approachBenefit of the approach
–– software can have multiple outstanding references to the software can have multiple outstanding references to the same unit and wait for all of them to complete using different same unit and wait for all of them to complete using different signalssignals
Page 14 Raj YavatkarRaj Yavatkar
Different Types of MemoryDifferent Types of Memory
Direct path to/fro MSF
3003002G2G88DRAMDRAM
Atomic opsAtomic ops6464--elem qelem q--arrayarray
150150256M256M44SRAMSRAM
Atomic ops Atomic ops 16 rings 16 rings w/at. get/putw/at. get/put
606016K16K44OnOn--chip chip scratchscratch
Indexed Indexed addressing addressing post post incr/decrincr/decr
332560256044Local to Local to MEME
Special Special NotesNotes
Approx Approx unloaded unloaded latency latency (cycles)(cycles)
Size in Size in bytesbytes
Logical Logical width width (bytes)(bytes)
Type of Type of MemoryMemory
Page 15 Raj YavatkarRaj Yavatkar
IXP2800 FeaturesIXP2800 FeaturesHalf Duplex OCHalf Duplex OC--192 / 10 192 / 10 GbGb/sec Ethernet Network Processor/sec Ethernet Network ProcessorXScaleXScale CoreCore
–– 700 MHz (half the ME) 700 MHz (half the ME) –– 32 Kbytes instruction cache / 32 Kbytes data cache32 Kbytes instruction cache / 32 Kbytes data cache
Media / Switch Fabric InterfaceMedia / Switch Fabric Interface–– 2 x 16 bit LVDS Transmit & Receive2 x 16 bit LVDS Transmit & Receive–– Configured as CSIXConfigured as CSIX--L2 or SPIL2 or SPI--44
PCI InterfacePCI Interface–– 64 bit / 66 MHz Interface for Control64 bit / 66 MHz Interface for Control–– 3 DMA Channels3 DMA Channels
QDR Interface (w/Parity)QDR Interface (w/Parity)–– (4) 36 bit SRAM Channels (QDR or Co(4) 36 bit SRAM Channels (QDR or Co--Processor)Processor)–– Network Processor Forum LookAsideNetwork Processor Forum LookAside--1 Standard Interface1 Standard Interface–– Using a “clamshell” topology both Memory and CoUsing a “clamshell” topology both Memory and Co--processor can be instantiated processor can be instantiated
on same channelon same channelRDR InterfaceRDR Interface
–– (3) Independent Direct (3) Independent Direct RambusRambus DRAM InterfacesDRAM Interfaces–– Supports 4i Banks or 16 interleaved BanksSupports 4i Banks or 16 interleaved Banks–– Supports 16/32 Byte burstsSupports 16/32 Byte bursts
Page 16 Raj YavatkarRaj Yavatkar
Hardware Features to ease packet Hardware Features to ease packet processingprocessing
Ring BuffersRing Buffers–– For interFor inter--block communication/synchronizationblock communication/synchronization–– ProducerProducer--consumer paradigmconsumer paradigm
Next Neighbor Registers and SignalingNext Neighbor Registers and Signaling–– Allows for single cycle transfer of context to the next logical Allows for single cycle transfer of context to the next logical
micromicro--engine to dramatically improve performanceengine to dramatically improve performance–– Simple, easy transfer of state Simple, easy transfer of state
Distributed data caching within each microDistributed data caching within each micro--engineengine–– Allows for all threads to keep processing even when Allows for all threads to keep processing even when
multiple threads are accessing the same data multiple threads are accessing the same data
Page 17 Raj YavatkarRaj Yavatkar
OutlineOutline
IXP 2xxx hardware architectureIXP 2xxx hardware architectureIXA software architectureIXA software architectureUsage questionsUsage questionsResearch questionsResearch questions
Page 18 Raj YavatkarRaj Yavatkar
IXA Portability Framework - Goals
Accelerate software development for the IXP family of network processorsProvide a simple and consistent infrastructure to write networking applicationsEnable reuse of code across applications written to the framework Improve portability of code across the IXP family Provide an infrastructure for third parties to supply code
– for example, to support TCAMs
Page 19 Raj YavatkarRaj Yavatkar
Resource Manager Library
Control Plane PDK
Control Plane Protocol Stacks
Core Components
IXA Software FrameworkIXA Software Framework
MicroenginePipeline
XScale™Core
Microblock
Microblock
Microblock
Microblock Library
Utility LibraryProtocol Library
ExternalProcessors
Hardware Abstraction Library
MicroengineC Language
C/C++ Language
Core Component Library
Page 20 Raj YavatkarRaj Yavatkar
Software Framework on the MEv2
Microengine C compiler (language)Optimized Data Plane Libraries
– Microcode and MicroC library for commonly used functionsMicroblock Programming Model
– Enables development of modular code building blocks – Defines the data flow model, common data structures, state
sharing between code blocks etc– Ensures consistency and improves reuse across different apps
Core component library– Provides a common way of writing slow-path components that
interact with their counterpart fast-path codeMicroblocks and example applications written to the microblock programming model
– IPv4/IPv6 Forwarding, MPLS, DiffServ etc.
Page 21 Raj YavatkarRaj Yavatkar
MicroMicro--engine engine C CompilerC Compiler
C language constructsC language constructs–– Basic types, pointers, bit Basic types, pointers, bit
fields fields
InIn--line assembly code line assembly code supportsupportAggregatesAggregates
–– StructsStructs, unions, arrays, unions, arrays–– IntrinsicsIntrinsics for specialized for specialized
ME functionsME functions–– Different memory models Different memory models
and special constructs and special constructs for data placement (e.g., for data placement (e.g., ____declspec(sdramdeclspec(sdram) ) structstruct msg_hdrmsg_hdr hdhd))
Page 22 Raj YavatkarRaj Yavatkar
What is a Microblock?Data plane packet processing on the microengines is divided into logical functions called microblocksCoarse Grain and statefulExample
– 5-Tuple Classification– IPv4 Forwarding– NAT
Several microblocks running on a microengine thread can be combined into a microblock group.
– A microblock group has a dispatch loop that defines the dataflow for packets between microblocks
– A microblock group runs on each thread of one or more microengines
Microblocks can send and receive packets to/from an associated Xscale Core Component
Page 23 Raj YavatkarRaj Yavatkar
XScale™ Core
Micro-engines
Core Components and MicroblocksCore Components and Microblocks
User-written code
Microblock Library
Intel/3rd party blocks
Microblock
Microblock Library
Microblock Microblock
Core Component
CoreComponent
Core Component
CoreLibraries
Core Component Library
Resource Manager Library
Page 24 Raj YavatkarRaj Yavatkar
Prefix next-hop-id
3FFF020304 N
…
…
Interface#
Flags
DMAC
….
Source Classify
(2)
IPv6
(3)
Encap
(4)
Sink
Packet Buffers
Ethernet Header
IPv6 Header
Payload
H
Offset
Size
…
H
Buffer Descriptors Route Table Next-Hop
N
Ethernet Header
IPv6 Header
Header Cache
Offset, size
Header-Type
Next-hop-id
Meta-data
dl_buff_handle H
dl_next_block
DL state
DRAM SRAM
Local Memory GPRs
Simplified Packet Flow (IPv6 example)Simplified Packet Flow (IPv6 example)
Scratch Ring Scratch Ring
234
Rx
a. Put Packet in DRAMb. Put Descriptor in SRAMc. Queue Handle on ring
d. Pull meta-data in GPRse. Set DL state in GPRsf. Set next_blk = Classify
g. Get Headers in HCacheh.Set HeaderType to IPv6i. Set next_blk = IPv6
j. Get DAddr from HCachek. Search RouteTablel. Set next-hop-id = Nm. Set next_blk = Encap
n. Get DMAC from next-hop No. Set Eth Hdr in HCachep. Flush HCache to DRAM
q. Flush Meta-data to SRAMr. Queue Handle to Ring
IPV6N
Ethernet Header
IPv6 Header
Ethernet Header
IPv6 Header
Handle H
Offset
Size
Port
Descriptor
Handle Outport
Descriptor
Rx
Classify
IPv6
Encap
Source
Sink
Offset
Size
PP Microblock-group
Animation: press PgDn 19 times (PgUp to backup)
or NextNeighbor
Page 25 Raj YavatkarRaj Yavatkar
OutlineOutline
IXP 2xxx hardware architectureIXP 2xxx hardware architectureIXA software architectureIXA software architectureUsage questionsUsage questionsResearch questionsResearch questions
Page 26 Raj YavatkarRaj Yavatkar
What can I do with an IXP?What can I do with an IXP?
Fully programmable architectureFully programmable architecture–– Implement any packet processing applicationsImplement any packet processing applications
–– Examples from customersExamples from customers– Routing/switching, VPN, DSLAM, Multi-servioce
switch, storage, content processing–– Intrusion Detection (IDS) and RMONIntrusion Detection (IDS) and RMON
– needs processing of many state elements in parallel–– Use as a research platformUse as a research platform
–– Experiment with new algorithms, protocolsExperiment with new algorithms, protocols–– Use as a teaching toolUse as a teaching tool
–– Understand architectural issuesUnderstand architectural issues–– Gain handsGain hands--on experience withy networking systemson experience withy networking systems
Page 27 Raj YavatkarRaj Yavatkar
Technical and Business ChallengesTechnical and Business Challenges
Technical ChallengersTechnical Challengers–– Shift from ASICShift from ASIC--based paradigm to softwarebased paradigm to software--based based
appsapps–– Challenges in programming an NPU (next)Challenges in programming an NPU (next)–– TradeTrade--off between power, board cost, and no. of off between power, board cost, and no. of NPUsNPUs–– How to add coHow to add co--processors for additional functions?processors for additional functions?
Business challengesBusiness challenges–– Reliance on an outside supplier for the key componentReliance on an outside supplier for the key component–– Preserving intellectual property advantagesPreserving intellectual property advantages–– Add value and differentiation through software Add value and differentiation through software
algorithms in data plane, control plane, services plane algorithms in data plane, control plane, services plane functionalityfunctionality
–– Must decrease TTM to be competitive Must decrease TTM to be competitive ((To NPU or not To NPU or not to NPUto NPU?)?)
Page 28 Raj YavatkarRaj Yavatkar
OutlineOutline
IXP 2xxx hardware architectureIXP 2xxx hardware architectureIXA software architectureIXA software architectureUsage questionsUsage questionsResearch questionsResearch questions
Page 29 Raj YavatkarRaj Yavatkar
Architectural IssuesArchitectural Issues
How to scale up to OCHow to scale up to OC--768 and beyond?768 and beyond?What is the “right” architecture?What is the “right” architecture?
–– A set of reconfigurable processing enginesA set of reconfigurable processing enginesvsvs
carefully architected pipelined stagescarefully architected pipelined stagesvsvs
a set of fixeda set of fixed--function blocksfunction blocks
Questionable hypothesesQuestionable hypotheses–– No locality in packet processing?No locality in packet processing?
–– Temporal Temporal vsvs spatialspatial–– Working set size Working set size vsvs available cache capacityavailable cache capacity
–– Little or no dependency in packets from different Little or no dependency in packets from different flows?flows?
Page 30 Raj YavatkarRaj Yavatkar
Challenges in Programming an NPChallenges in Programming an NP
Distributed, parallel programming modelDistributed, parallel programming model–– Multiple microengines, multiple threadsMultiple microengines, multiple threads
Wide variety of resourcesWide variety of resources–– Multiple memory types (latencies, sizes)Multiple memory types (latencies, sizes)–– SpecialSpecial--purpose enginespurpose engines–– Global and local synchronizationGlobal and local synchronization
Significantly different from the problem seen in Significantly different from the problem seen in scientific computingscientific computing
Page 31 Raj YavatkarRaj Yavatkar
NPU Programming Challenges NPU Programming Challenges
Conventional MP systemsConventional MP systemsRely on locality of memory Rely on locality of memory accesses to utilize memory accesses to utilize memory hierarchy effectivelyhierarchy effectively
Programmers program to a Programmers program to a singlesingle--level memory hierarchylevel memory hierarchy
Compilers are unaware of the Compilers are unaware of the memory levels or their memory levels or their performance characteristicsperformance characteristics
Network systemsNetwork systemsPacket processing applications Packet processing applications demonstrate little temporal demonstrate little temporal localitylocalityMinimizing memory access Minimizing memory access latencies is cruciallatencies is crucialCompilers should manage Compilers should manage memory hierarchy explicitlymemory hierarchy explicitly
–– Allocate data structures to Allocate data structures to appropriate memory levels appropriate memory levels
–– Allocation depends on data Allocation depends on data structure sizes, access pattern, structure sizes, access pattern, sharing requirements, memory sharing requirements, memory system characteristics, …system characteristics, …
Programming environments for NPUs and network systems are differProgramming environments for NPUs and network systems are different ent from those for conventional multifrom those for conventional multi--processorsprocessors
Automatic allocation of network system resources: MemoryAutomatic allocation of network system resources: Memory
Memory management is more complex in network systems
Memory management is more complex in network systems
Page 32 Raj YavatkarRaj Yavatkar
NPU Challenges NPU Challenges -- 22
Conventional MP systemsConventional MP systems
Parallel compilers exploit loopParallel compilers exploit loop--or functionor function--level parallelism to level parallelism to utilize multiple processors to utilize multiple processors to speedspeed--up execution of programsup execution of programs
Operating systems utilize idle Operating systems utilize idle processors to execute multiple processors to execute multiple programs in parallelprograms in parallel
Network systemsNetwork systems
Individual packet processing is Individual packet processing is inherently sequential; little loopinherently sequential; little loop--or functionor function--level parallelismlevel parallelism
Process packets belonging to Process packets belonging to different flows in paralleldifferent flows in parallel
HighHigh--throughput and robustness throughput and robustness requirementsrequirements
–– Compilers should create efficient Compilers should create efficient packet processing pipelinespacket processing pipelines
–– Granularity of pipeline stage Granularity of pipeline stage depends on instruction cache depends on instruction cache size, amount of communication size, amount of communication between stages, computational between stages, computational complexity of stages, sharing complexity of stages, sharing and synchronization and synchronization requirements, …requirements, …
Automatic allocation of network system resources: ProcessorsAutomatic allocation of network system resources: Processors
Network applications are explicitly parallel concurrency extraction is simpler;
but throughput and robustness requirementsintroduce a new problem of pipeline construction !
Network applications are explicitly parallel concurrency extraction is simpler;
but throughput and robustness requirementsintroduce a new problem of pipeline construction !
Page 33 Raj YavatkarRaj Yavatkar
Challenges (contd.) Challenges (contd.)
How to enable a wide range of network applicationsHow to enable a wide range of network applications–– TCP offload/terminationTCP offload/termination–– How to distribute functionality between SA/Xscale, Pentium, How to distribute functionality between SA/Xscale, Pentium,
and microengines?and microengines?–– Hierarchy of compute vs I/O capabilitiesHierarchy of compute vs I/O capabilities
–– How to allow use of multiple IXPs to solve more compute How to allow use of multiple IXPs to solve more compute intensive problemsintensive problems
Networking researchNetworking research–– How to take advantage of programmable, open architecture?How to take advantage of programmable, open architecture?–– Designing “right” algorithms for LPM, range matching, string Designing “right” algorithms for LPM, range matching, string
search, etcsearch, etc–– QoSQoS--related algorithms related algorithms –– TM4.1, WRED, etcTM4.1, WRED, etc
Page 34 Raj YavatkarRaj Yavatkar
Questions?Questions?