network processors: building block for programmable networks

Raj Yavatkar

1Raj Yavatkar

Network Processors:Network Processors:Building Block for programmable Building Block for programmable

networksnetworks

Raj YavatkarRaj YavatkarChief Software ArchitectChief Software Architect

IntelIntel®® Internet Exchange ArchitectureInternet Exchange Architecture

[email protected]@intel.com

Raj YavatkarRaj Yavatkar

OutlineOutline

IXP 2xxx hardware architectureIXP 2xxx hardware architectureIXA software architectureIXA software architectureUsage questionsUsage questionsResearch questionsResearch questions


IXP Network ProcessorsIXP Network Processors

MicroenginesMicroengines–– RISC processors RISC processors

optimized for packet optimized for packet processingprocessing

–– Hardware support for Hardware support for multimulti--threadingthreading

Embedded Embedded StrongARM/Xscale StrongARM/Xscale –– Runs embedded OS and Runs embedded OS and

handles exception taskshandles exception tasks

ME

1

ME

2

ME

n

StrongARMS

RA

M

DR

AM

Media/FabricInterface

ControlProcessor


IXP: A Building Block for Network SystemsIXP: A Building Block for Network Systems

Example: IXP2800Example: IXP2800–– 16 micro16 micro--engines + XScale coreengines + XScale core–– Up to 1.4 Up to 1.4 GhzGhz ME speedME speed–– 8 HW threads/ME8 HW threads/ME–– 4K control store per ME4K control store per ME–– MultiMulti--level memory hierarchylevel memory hierarchy–– Multiple interMultiple inter--processor processor

communication channelscommunication channels

NPU vs. GPU tradeoffsNPU vs. GPU tradeoffs–– Reduce core complexityReduce core complexity

–– No hardware cachingNo hardware caching–– Simpler instructions Simpler instructions shallow shallow

pipelinespipelines–– Multiple cores with HW multiMultiple cores with HW multi--

threading per chipthreading per chip

MEv210

MEv211

MEv212

MEv215

MEv214

MEv213

MEv29

MEv216

MEv22

MEv23

MEv24

MEv27

MEv26

MEv25

MEv21

MEv28

RDRAMController

Intel®XScale™

Core

MediaSwitchFabric

I/F

PCI

QDR SRAM Controller

ScratchMemory

Hash Unit

Multi-threaded (x8) Microengine Array

Per-EngineMemory, CAM, Signals

Interconnect


IXP 2400 Block DiagramIXP 2400 Block Diagram


XScaleXScale Core processorCore processor

Compliant with the ARM V5TE architecture Compliant with the ARM V5TE architecture –– support for support for ARM’sARM’s thumb instructions thumb instructions –– support for Digital Signal Processing (DSP) support for Digital Signal Processing (DSP)

enhancements to the instruction setenhancements to the instruction set–– Intel’s improvements to the internal pipeline to improve Intel’s improvements to the internal pipeline to improve

the memorythe memory--latency hiding abilities of the corelatency hiding abilities of the core–– does not implement the floatingdoes not implement the floating--point instructions of point instructions of

the ARM V5 instruction setthe ARM V5 instruction set


MicroenginesMicroengines –– RISC processorsRISC processors

IXP 2800 has 16 IXP 2800 has 16 microenginesmicroengines, organized into 4 clusters (4 , organized into 4 clusters (4 MEsMEsper cluster) per cluster) ME instruction set specifically tuned for processing network ME instruction set specifically tuned for processing network datadata

–– Arithmetic and Logical operations that operate at bit, byte, anArithmetic and Logical operations that operate at bit, byte, and longd long--word levelsword levels

–– can be combined with shift and rotate operations in single can be combined with shift and rotate operations in single instructions. instructions.

–– integer multiplication provided; no division or FP operations integer multiplication provided; no division or FP operations 4040--bit x 4K control storebit x 4K control storesixsix--stage pipeline in an instructionstage pipeline in an instruction

–– On an average takes one cycle to execute On an average takes one cycle to execute Each ME has eight hardwareEach ME has eight hardware--assisted threads of executionassisted threads of execution

–– can be configured to use either all eight threads or only four tcan be configured to use either all eight threads or only four threads hreads The nonThe non--preemptive hardware thread arbiter swaps between preemptive hardware thread arbiter swaps between threads in roundthreads in round--robin orderrobin order


MicroEngine v2 MicroEngine v2

128GPR

Control Store

4K Instructions

128 GPR

Local Memory640 words

128 Next Neighbor

128 S Xfer Out

128 D Xfer Out

Local CSRs

CRC Unit

128 S Xfer In

128 D Xfer In

LM Addr 1LM Addr 0

D-PushBus

S-PushBus

D-Pull Bus S-Pull Bus

To Next Neighbor

From Next Neighbor

A_Operand B_Operand

ALU_Out

P-Random #

32-bit ExecutionData Path

Multiply

Find first bit

Add, shift, logical

2 per CTX

CRC remain

Lock0-15

StatusandLRULogic(6-bit)

TAGs0-15

Status Entry#

CAM

Timers

Timestamp

Prev B

B_op

Prev A

A_op


Why Why multmult--ithreadingithreading? ?


Packet processing using multi-threading within a MicroEngine


Registers available to each ME Registers available to each ME

four different types of registersfour different types of registers–– general purpose, SRAM transfer, DRAM transfer, nextgeneral purpose, SRAM transfer, DRAM transfer, next--

neighbor (NN)neighbor (NN)–– Also, access to many Also, access to many CSRsCSRs

256, 32256, 32--bit bit GPRsGPRs–– can be accessed in threadcan be accessed in thread--local or absolute modelocal or absolute mode

256, 32256, 32--bit SRAM transfer registers.bit SRAM transfer registers.–– used to read/write to all functional units on the IXP2xxx used to read/write to all functional units on the IXP2xxx

except the DRAM except the DRAM 256, 32256, 32--bit DRAM transfer registers bit DRAM transfer registers

–– divided equally into readdivided equally into read--only and writeonly and write--only only –– used exclusively for communication between the used exclusively for communication between the MEsMEs and and

the DRAM the DRAM Benefit of having separate transfer and Benefit of having separate transfer and GPRsGPRs

–– ME can continue processing with ME can continue processing with GPRsGPRs while other while other functional units read and write the transfer registers functional units read and write the transfer registers


NextNext--Neighbor RegistersNeighbor Registers

Each ME has 128, 32Each ME has 128, 32--bit nextbit next--neighbor registersneighbor registers–– makes data written in these registers available in the makes data written in these registers available in the

next next microenginemicroengine (numerically)(numerically)–– E.g., if ME 0 writes data into a nextE.g., if ME 0 writes data into a next--neighbor register, neighbor register,

ME 1 can read the data from its nextME 1 can read the data from its next--neighbor register, neighbor register, and so onand so on

In another mode, these registers are used as In another mode, these registers are used as extra extra GPRsGPRs

–– Data written into a nextData written into a next--neighbor register is read back neighbor register is read back by the same by the same microenginemicroengine


Generalized thread signalingGeneralized thread signaling

Each ME thread has 15 numbered signals.Each ME thread has 15 numbered signals.Most accesses to functional units outside of the ME can Most accesses to functional units outside of the ME can cause a signal to any one signal numbercause a signal to any one signal numberThe signal number generated for any functional unit The signal number generated for any functional unit access is under the programmer’s controlaccess is under the programmer’s controlA ME thread can test for the presence or absence of any A ME thread can test for the presence or absence of any of these signalsof these signals

–– used to control branching on the signal presence used to control branching on the signal presence –– Or, to specify to the thread arbiter that a ME thread is ready Or, to specify to the thread arbiter that a ME thread is ready

to run only after the signal is receivedto run only after the signal is receivedBenefit of the approachBenefit of the approach

–– software can have multiple outstanding references to the software can have multiple outstanding references to the same unit and wait for all of them to complete using different same unit and wait for all of them to complete using different signalssignals


Different Types of MemoryDifferent Types of Memory

Direct path to/fro MSF

3003002G2G88DRAMDRAM

Atomic opsAtomic ops6464--elem qelem q--arrayarray

150150256M256M44SRAMSRAM

Atomic ops Atomic ops 16 rings 16 rings w/at. get/putw/at. get/put

606016K16K44OnOn--chip chip scratchscratch

Indexed Indexed addressing addressing post post incr/decrincr/decr

332560256044Local to Local to MEME

Special Special NotesNotes

Approx Approx unloaded unloaded latency latency (cycles)(cycles)

Size in Size in bytesbytes

Logical Logical width width (bytes)(bytes)

Type of Type of MemoryMemory


IXP2800 FeaturesIXP2800 FeaturesHalf Duplex OCHalf Duplex OC--192 / 10 192 / 10 GbGb/sec Ethernet Network Processor/sec Ethernet Network ProcessorXScaleXScale CoreCore

–– 700 MHz (half the ME) 700 MHz (half the ME) –– 32 Kbytes instruction cache / 32 Kbytes data cache32 Kbytes instruction cache / 32 Kbytes data cache

Media / Switch Fabric InterfaceMedia / Switch Fabric Interface–– 2 x 16 bit LVDS Transmit & Receive2 x 16 bit LVDS Transmit & Receive–– Configured as CSIXConfigured as CSIX--L2 or SPIL2 or SPI--44

PCI InterfacePCI Interface–– 64 bit / 66 MHz Interface for Control64 bit / 66 MHz Interface for Control–– 3 DMA Channels3 DMA Channels

QDR Interface (w/Parity)QDR Interface (w/Parity)–– (4) 36 bit SRAM Channels (QDR or Co(4) 36 bit SRAM Channels (QDR or Co--Processor)Processor)–– Network Processor Forum LookAsideNetwork Processor Forum LookAside--1 Standard Interface1 Standard Interface–– Using a “clamshell” topology both Memory and CoUsing a “clamshell” topology both Memory and Co--processor can be instantiated processor can be instantiated

on same channelon same channelRDR InterfaceRDR Interface

–– (3) Independent Direct (3) Independent Direct RambusRambus DRAM InterfacesDRAM Interfaces–– Supports 4i Banks or 16 interleaved BanksSupports 4i Banks or 16 interleaved Banks–– Supports 16/32 Byte burstsSupports 16/32 Byte bursts


Hardware Features to ease packet Hardware Features to ease packet processingprocessing

Ring BuffersRing Buffers–– For interFor inter--block communication/synchronizationblock communication/synchronization–– ProducerProducer--consumer paradigmconsumer paradigm

Next Neighbor Registers and SignalingNext Neighbor Registers and Signaling–– Allows for single cycle transfer of context to the next logical Allows for single cycle transfer of context to the next logical

micromicro--engine to dramatically improve performanceengine to dramatically improve performance–– Simple, easy transfer of state Simple, easy transfer of state

Distributed data caching within each microDistributed data caching within each micro--engineengine–– Allows for all threads to keep processing even when Allows for all threads to keep processing even when

multiple threads are accessing the same data multiple threads are accessing the same data


OutlineOutline



IXA Portability Framework - Goals

Accelerate software development for the IXP family of network processorsProvide a simple and consistent infrastructure to write networking applicationsEnable reuse of code across applications written to the framework Improve portability of code across the IXP family Provide an infrastructure for third parties to supply code

– for example, to support TCAMs


Resource Manager Library

Control Plane PDK

Control Plane Protocol Stacks

Core Components

IXA Software FrameworkIXA Software Framework

MicroenginePipeline

XScale™Core

Microblock

Microblock

Microblock

Microblock Library

Utility LibraryProtocol Library

ExternalProcessors

Hardware Abstraction Library

MicroengineC Language

C/C++ Language

Core Component Library


Software Framework on the MEv2

Microengine C compiler (language)Optimized Data Plane Libraries

– Microcode and MicroC library for commonly used functionsMicroblock Programming Model

– Enables development of modular code building blocks – Defines the data flow model, common data structures, state

sharing between code blocks etc– Ensures consistency and improves reuse across different apps

Core component library– Provides a common way of writing slow-path components that

interact with their counterpart fast-path codeMicroblocks and example applications written to the microblock programming model

– IPv4/IPv6 Forwarding, MPLS, DiffServ etc.


MicroMicro--engine engine C CompilerC Compiler

C language constructsC language constructs–– Basic types, pointers, bit Basic types, pointers, bit

fields fields

InIn--line assembly code line assembly code supportsupportAggregatesAggregates

–– StructsStructs, unions, arrays, unions, arrays–– IntrinsicsIntrinsics for specialized for specialized

ME functionsME functions–– Different memory models Different memory models

and special constructs and special constructs for data placement (e.g., for data placement (e.g., ____declspec(sdramdeclspec(sdram) ) structstruct msg_hdrmsg_hdr hdhd))


What is a Microblock?Data plane packet processing on the microengines is divided into logical functions called microblocksCoarse Grain and statefulExample

– 5-Tuple Classification– IPv4 Forwarding– NAT

Several microblocks running on a microengine thread can be combined into a microblock group.

– A microblock group has a dispatch loop that defines the dataflow for packets between microblocks

– A microblock group runs on each thread of one or more microengines

Microblocks can send and receive packets to/from an associated Xscale Core Component


XScale™ Core

Micro-engines

Core Components and MicroblocksCore Components and Microblocks

User-written code

Microblock Library

Intel/3rd party blocks

Microblock

Microblock Library

Microblock Microblock

Core Component

CoreComponent

Core Component

CoreLibraries

Core Component Library

Resource Manager Library


Prefix next-hop-id

3FFF020304 N

…

…

Interface#

Flags

DMAC

….

Source Classify

(2)

IPv6

(3)

Encap

(4)

Sink

Packet Buffers

Ethernet Header

IPv6 Header

Payload

H

Offset

Size

…

H

Buffer Descriptors Route Table Next-Hop

N

Ethernet Header

IPv6 Header

Header Cache

Offset, size

Header-Type

Next-hop-id

Meta-data

dl_buff_handle H

dl_next_block

DL state

DRAM SRAM

Local Memory GPRs

Simplified Packet Flow (IPv6 example)Simplified Packet Flow (IPv6 example)

Scratch Ring Scratch Ring

234

Rx

a. Put Packet in DRAMb. Put Descriptor in SRAMc. Queue Handle on ring

d. Pull meta-data in GPRse. Set DL state in GPRsf. Set next_blk = Classify

g. Get Headers in HCacheh.Set HeaderType to IPv6i. Set next_blk = IPv6

j. Get DAddr from HCachek. Search RouteTablel. Set next-hop-id = Nm. Set next_blk = Encap

n. Get DMAC from next-hop No. Set Eth Hdr in HCachep. Flush HCache to DRAM

q. Flush Meta-data to SRAMr. Queue Handle to Ring

IPV6N

Ethernet Header

IPv6 Header

Ethernet Header

IPv6 Header

Handle H

Offset

Size

Port

Descriptor

Handle Outport

Descriptor

Rx

Classify

IPv6

Encap

Source

Sink

Offset

Size

PP Microblock-group

Animation: press PgDn 19 times (PgUp to backup)

or NextNeighbor


OutlineOutline



What can I do with an IXP?What can I do with an IXP?

Fully programmable architectureFully programmable architecture–– Implement any packet processing applicationsImplement any packet processing applications

–– Examples from customersExamples from customers– Routing/switching, VPN, DSLAM, Multi-servioce

switch, storage, content processing–– Intrusion Detection (IDS) and RMONIntrusion Detection (IDS) and RMON

– needs processing of many state elements in parallel–– Use as a research platformUse as a research platform

–– Experiment with new algorithms, protocolsExperiment with new algorithms, protocols–– Use as a teaching toolUse as a teaching tool

–– Understand architectural issuesUnderstand architectural issues–– Gain handsGain hands--on experience withy networking systemson experience withy networking systems


Technical and Business ChallengesTechnical and Business Challenges

Technical ChallengersTechnical Challengers–– Shift from ASICShift from ASIC--based paradigm to softwarebased paradigm to software--based based

appsapps–– Challenges in programming an NPU (next)Challenges in programming an NPU (next)–– TradeTrade--off between power, board cost, and no. of off between power, board cost, and no. of NPUsNPUs–– How to add coHow to add co--processors for additional functions?processors for additional functions?

Business challengesBusiness challenges–– Reliance on an outside supplier for the key componentReliance on an outside supplier for the key component–– Preserving intellectual property advantagesPreserving intellectual property advantages–– Add value and differentiation through software Add value and differentiation through software

algorithms in data plane, control plane, services plane algorithms in data plane, control plane, services plane functionalityfunctionality

–– Must decrease TTM to be competitive Must decrease TTM to be competitive ((To NPU or not To NPU or not to NPUto NPU?)?)


OutlineOutline



Architectural IssuesArchitectural Issues

How to scale up to OCHow to scale up to OC--768 and beyond?768 and beyond?What is the “right” architecture?What is the “right” architecture?

–– A set of reconfigurable processing enginesA set of reconfigurable processing enginesvsvs

carefully architected pipelined stagescarefully architected pipelined stagesvsvs

a set of fixeda set of fixed--function blocksfunction blocks

Questionable hypothesesQuestionable hypotheses–– No locality in packet processing?No locality in packet processing?

–– Temporal Temporal vsvs spatialspatial–– Working set size Working set size vsvs available cache capacityavailable cache capacity

–– Little or no dependency in packets from different Little or no dependency in packets from different flows?flows?


Challenges in Programming an NPChallenges in Programming an NP

Distributed, parallel programming modelDistributed, parallel programming model–– Multiple microengines, multiple threadsMultiple microengines, multiple threads

Wide variety of resourcesWide variety of resources–– Multiple memory types (latencies, sizes)Multiple memory types (latencies, sizes)–– SpecialSpecial--purpose enginespurpose engines–– Global and local synchronizationGlobal and local synchronization

Significantly different from the problem seen in Significantly different from the problem seen in scientific computingscientific computing


NPU Programming Challenges NPU Programming Challenges

Conventional MP systemsConventional MP systemsRely on locality of memory Rely on locality of memory accesses to utilize memory accesses to utilize memory hierarchy effectivelyhierarchy effectively

Programmers program to a Programmers program to a singlesingle--level memory hierarchylevel memory hierarchy

Compilers are unaware of the Compilers are unaware of the memory levels or their memory levels or their performance characteristicsperformance characteristics

Network systemsNetwork systemsPacket processing applications Packet processing applications demonstrate little temporal demonstrate little temporal localitylocalityMinimizing memory access Minimizing memory access latencies is cruciallatencies is crucialCompilers should manage Compilers should manage memory hierarchy explicitlymemory hierarchy explicitly

–– Allocate data structures to Allocate data structures to appropriate memory levels appropriate memory levels

–– Allocation depends on data Allocation depends on data structure sizes, access pattern, structure sizes, access pattern, sharing requirements, memory sharing requirements, memory system characteristics, …system characteristics, …

Programming environments for NPUs and network systems are differProgramming environments for NPUs and network systems are different ent from those for conventional multifrom those for conventional multi--processorsprocessors

Automatic allocation of network system resources: MemoryAutomatic allocation of network system resources: Memory

Memory management is more complex in network systems

Memory management is more complex in network systems


NPU Challenges NPU Challenges -- 22

Conventional MP systemsConventional MP systems

Parallel compilers exploit loopParallel compilers exploit loop--or functionor function--level parallelism to level parallelism to utilize multiple processors to utilize multiple processors to speedspeed--up execution of programsup execution of programs

Operating systems utilize idle Operating systems utilize idle processors to execute multiple processors to execute multiple programs in parallelprograms in parallel

Network systemsNetwork systems

Individual packet processing is Individual packet processing is inherently sequential; little loopinherently sequential; little loop--or functionor function--level parallelismlevel parallelism

Process packets belonging to Process packets belonging to different flows in paralleldifferent flows in parallel

HighHigh--throughput and robustness throughput and robustness requirementsrequirements

–– Compilers should create efficient Compilers should create efficient packet processing pipelinespacket processing pipelines

–– Granularity of pipeline stage Granularity of pipeline stage depends on instruction cache depends on instruction cache size, amount of communication size, amount of communication between stages, computational between stages, computational complexity of stages, sharing complexity of stages, sharing and synchronization and synchronization requirements, …requirements, …

Automatic allocation of network system resources: ProcessorsAutomatic allocation of network system resources: Processors

Network applications are explicitly parallel concurrency extraction is simpler;

but throughput and robustness requirementsintroduce a new problem of pipeline construction !

Network applications are explicitly parallel concurrency extraction is simpler;

but throughput and robustness requirementsintroduce a new problem of pipeline construction !


Challenges (contd.) Challenges (contd.)

How to enable a wide range of network applicationsHow to enable a wide range of network applications–– TCP offload/terminationTCP offload/termination–– How to distribute functionality between SA/Xscale, Pentium, How to distribute functionality between SA/Xscale, Pentium,

and microengines?and microengines?–– Hierarchy of compute vs I/O capabilitiesHierarchy of compute vs I/O capabilities

–– How to allow use of multiple IXPs to solve more compute How to allow use of multiple IXPs to solve more compute intensive problemsintensive problems

Networking researchNetworking research–– How to take advantage of programmable, open architecture?How to take advantage of programmable, open architecture?–– Designing “right” algorithms for LPM, range matching, string Designing “right” algorithms for LPM, range matching, string

search, etcsearch, etc–– QoSQoS--related algorithms related algorithms –– TM4.1, WRED, etcTM4.1, WRED, etc


Questions?Questions?

network processors: building block for programmable networks

Documents