intel compiler lab and ustc, ppopp08 scalable packet classification using interpreting --...

27
Intel Compiler Lab and USTC, PPoPP’08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ. of Science & Technology of China (USTC) Xinan Tang Intel Compiler Lab.

Upload: lucas-davies

Post on 27-Mar-2015

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Intel Compiler Lab and USTC, PPoPP08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ

Intel Compiler Lab and USTC, PPoPP’08

Scalable Packet Classification Using Interpreting

-- Cross-platform Multi-core Solution

Haipeng Cheng & Bei Hua

Univ. of Science & Technology of China (USTC)

Xinan Tang

Intel Compiler Lab.

Page 2: Intel Compiler Lab and USTC, PPoPP08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ

Intel Compiler Lab and USTC, PPoPP’08

• Background• Packet Classification Problem• Review RFC Algorithm • TIC Algorithm• Experimental Results and Analysis• Future Work

Outline

Page 3: Intel Compiler Lab and USTC, PPoPP08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ

Intel Compiler Lab and USTC, PPoPP’08

10GbE Smaller, Chapter, and Denser

2006 2007 2008 2009

Switch Ports 8-12 20-24 48 96

Servers with 10GbE

10% 30-40% 50-60% >60%

Port Cost $2-5K $1-2K <$400 <$250

Page 4: Intel Compiler Lab and USTC, PPoPP08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ

Intel Compiler Lab and USTC, PPoPP’08

Background (Networking)

• 10Gbps offers too much bandwidth for the multi-core computers to handle

• Traffic complexity: triple-play (voice, video, and data) support is essential

• Traffic types: P2P packets occupy 70% of the total network traffic

• Packet classification becomes increasingly important to identify and control the traffic

Page 5: Intel Compiler Lab and USTC, PPoPP08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ

Intel Compiler Lab and USTC, PPoPP’08

Background (Multi-core)

Multi-core becomes prevalent

• Networking (Intel IXP, Cavium Octeon, RMI XLR)

• Multi-media (IBM Cell, Intel Larabee)

• General-purpose

– Intel Core 2 Duo

– AMD Barcelona

– IBM Power5

– Sun Niagara

Comment: find an efficient solution for one multi-core architecture is hard; find a cross-platform solution even harder

Page 6: Intel Compiler Lab and USTC, PPoPP08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ

Intel Compiler Lab and USTC, PPoPP’08

Classification Problem

The process of partitioning packets into “groups” is called packet classification.

• Packet classification typically uses 5-tuples

• Enable value-added services:– Security: classify packets based on security policies– QoS: sort packets and ensure the packets receiving an appropriate bandwidth share– P2P management: tame the P2P traffic

Page 7: Intel Compiler Lab and USTC, PPoPP08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ

Intel Compiler Lab and USTC, PPoPP’08

Which package does it match to ?

Packet Classification Example

Packet (000, 010)

How to match?

Page 8: Intel Compiler Lab and USTC, PPoPP08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ

Intel Compiler Lab and USTC, PPoPP’08

Why Is Packet Classification Hard?

• Packet classification is NP-hard

• Heuristic solutions seek O(1) solutions

• At 10Gbps (OC-192) speed, a 64-byte packet needs to be classified within 40ns

– one DARM access time

– 100 cycles for a 2.5Ghz CPU

Page 9: Intel Compiler Lab and USTC, PPoPP08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ

Intel Compiler Lab and USTC, PPoPP’08

Packet Classification Solutions

At 10Gbps (OC-192) speed, it is done by

• Special ASIC

• TCAM

• Algorithms (?)– Hierarchical Tries

– Recursive Flow Classification (RFC)

– Two-stage Interpreter based Classification (TIC)

Page 10: Intel Compiler Lab and USTC, PPoPP08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ

Intel Compiler Lab and USTC, PPoPP’08

RFC Example

• Even though search space is huge (2^3)*(2^3)*(2^3), for a given packet, the actual matched rules per field is limited

• Class bitmap can be used to describe the matched rules:

– 0001 means R4 is the matched rule

– 1101 means R1, R2, and R4 are the ones matched

Page 11: Intel Compiler Lab and USTC, PPoPP08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ

Intel Compiler Lab and USTC, PPoPP’08

RFC Exam.

Page 12: Intel Compiler Lab and USTC, PPoPP08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ

Intel Compiler Lab and USTC, PPoPP’08

Recursive Flow Classification

Map an S-bit string concatenated from the d fields of the packet header to a T-bit number through multiple phases (T << S )

S-IP(32b)

D-IP(32b)

S-Port(16b)

D-Port(16b)

Proto(8b)

Page 13: Intel Compiler Lab and USTC, PPoPP08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ

Intel Compiler Lab and USTC, PPoPP’08

What’s Wrong with RFC?

• Memory exploded

• Too slow to do update in practice

• However, 13-memory-access is the fastest classification algorithm

Page 14: Intel Compiler Lab and USTC, PPoPP08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ

Intel Compiler Lab and USTC, PPoPP’08

Two-stage Interpreting based Classification

Domain knowledge: divide the RFC into two stages:

• Search source-destination prefix pair– 99.9% of the time the number of rules that match a pair of source-destination prefix is no more than 5

• Search the list of port-range expressions– Range [2..14] in prefix: 001*, 01**, 10**, 110*, 1110

– Range search is based on calculation (<,>,=)

– Encoding the type of the range expressions intelligently

– Evaluating them sequentially

Page 15: Intel Compiler Lab and USTC, PPoPP08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ

Intel Compiler Lab and USTC, PPoPP’08

TIC Main Ideas

•L2 cache size is in the range of mega-bytes

•Network applications are memory intensive

•Memory is best accessed sequentially–64bytes cache line size for Core 2 Duo

–64bytes local-memory for IXP

• Can compression be used to optimize performance?

– CISC encoding for smaller memory footprint

Page 16: Intel Compiler Lab and USTC, PPoPP08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ

Intel Compiler Lab and USTC, PPoPP’08

Putting Everything Together

•Domain knowledge: two-stage classification

•Architecture features:–Plenty of CPU cycles

–Large L2 cache

–Block based sequential access

–Branch prediction can eliminate infrequent executed paths

Page 17: Intel Compiler Lab and USTC, PPoPP08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ

Intel Compiler Lab and USTC, PPoPP’08

Port-Range Expressions

• There are five type of range expressions– WC (wildcard)

– HI ([1024, 65535])

– LO ([0, 1023])

– AR (arbitrary range)

– EM (exact match)

• For (s-port, d-port, proto), there are at least 5x5x2=50 operators

Page 18: Intel Compiler Lab and USTC, PPoPP08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ

Intel Compiler Lab and USTC, PPoPP’08

Characteristics of Range Expressions for Destination Port

Classifier WC HI LO EM AR

seed1 30.42% - - 57.89% 11.6%

seed2 9.25% 13.96% - 65.75% 11.04%

seed3 8.56% 12.15% - 68.08% 11.21%

seed4 30.00% 4.08% - 60.72% 5.20%

seed5 55.46% 6.52% - 35.48% 2.53%

Page 19: Intel Compiler Lab and USTC, PPoPP08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ

Intel Compiler Lab and USTC, PPoPP’08

Encoding and Interpreting

• Eliminate WC calculation

• Introduce HI and LO operators without storing the constants

– HI ([1024, 65535])

– LO ([0, 1023])

• Store AR and EM parameters in the operand fields

• NOP for code block alignment

Page 20: Intel Compiler Lab and USTC, PPoPP08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ

Intel Compiler Lab and USTC, PPoPP’08

Can we afford to increase #operator?

•Interpreter is a big switch-case statement.

•Compiler stores the starting address of each case in a jump table.

•Interpreter executes two instructions per iteration:–load an address into a register from the jump table

–jump to the address in the indirect addressing mode

•IXP –E compiler can optimize switch-case with–Default Case Removal

–Switch Block Packing

Page 21: Intel Compiler Lab and USTC, PPoPP08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ

Intel Compiler Lab and USTC, PPoPP’08

Experimental Setup

• Intel Xeon 5160 Core 2 Duo running at 3.00GHz with 4MB L2 cache and a 1333MHz system bus

• Cycle-accurate IXP2800 simulator, and each ME runs at 1.2GHz with 8 threads

• Generate packet traces from ClassBench, and use the low locality traces to cancel the locality

Page 22: Intel Compiler Lab and USTC, PPoPP08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ

Intel Compiler Lab and USTC, PPoPP’08

Space Reduction

SIZE Classifier #Rules RFC(MB) TIC(MB)

2K DB1 1921 2.46 1.55

DB2 2020 14.86 2.79

DB3 2008 11.93 2.48

DB4 1671 2.82 2.10

DB5 2012 47.31 2.89

4K DB6 3461 2.66 1.59

DB7 3989 39.17 6.80

DB8 4009 36.81 6.23

DB9 2925 2.93 2.09

DB10 3688 82.52 2.97

Page 23: Intel Compiler Lab and USTC, PPoPP08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ

Intel Compiler Lab and USTC, PPoPP’08

Relative Speedups on Core 2 Duo1 T 2 T 3 T 4 T

DB1

RFC 15.42 21.54 24.31 26.61

TIC 12.89 20.52 25.28 30.02

Imp. -16.4% -4.7% 3.9% 12.8%

DB2

RFC 10.89 14.59 17.09 20.49

TIC 11.43 16.08 18.57 21.13

Imp. 4.9% 10.2% 8.6% 3.1%

DB3

RFC 11.47 15.57 17.96 20.96

TIC 11.72 16.38 19.73 21.27

Imp. 2.1% 5.2% 9.8% 1.5%

DB4

RFC 14.84 19.08 22.52 24.43

TIC 13.06 19.43 22.95 24.87

Imp. -12% 1.9% 1.9% 1.8%

DB5

RFC 9.05 12.14 15.19 16.44

TIC 10.64 14.82 21.63 18.59

Imp. 17.5% 22% 42.4% 13.1%

Ave.

RFC 12.33 16.58 19.42 21.78

TIC 11.94 17.44 21.63 23.18

Imp. -3.1% 5.2% 11.4% 6.39%

Page 24: Intel Compiler Lab and USTC, PPoPP08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ

Intel Compiler Lab and USTC, PPoPP’08

Speedups on IXP (RFC vs. TIC)1 T 2 T 4 T 8T 16 T 32 T

DB1

RFC 1.81 3.59 6.91 11.01 21.06 35.29

TIC 1.38 2.38 4.76 5.72 11.22 20.49

Imp. -23.7% -33.7% -31.1% -48.1% -46.7% -41.9%

DB2

RFC 1.78 3.59 6.89 10.98 20.61 34.98

TIC 1.7 3.24 6.25 9.39 18.23 29.98

Imp. -4.5% -9.7% -9.3% -14.5% -11.5% -14.3%

DB3

RFC 1.77 3.57 6.86 10.89 20.43 35.01

TIC 1.67 3.2 6.19 9.02 17.99 30.07

Imp. -5.6% -10.4% -9.8% -17.2% -11.9% -14%

DB4

RFC 1.78 3.58 6.86 10.73 20.75 35.11

TIC 1.59 3.01 5.79 9.08 17.28 26.9

Imp. -10.7% -15.9% -15.6% -15.4% -16.7% -23.4%

DB5

RFC 1.75 3.51 6.84 10.9 20.59 35.03

TIC 1.71 3.23 6.24 9.41 18.16 29.58

Imp. -2.3% -7.9% -8.8% -13.7% -11.8% -15.6%

Page 25: Intel Compiler Lab and USTC, PPoPP08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ

Intel Compiler Lab and USTC, PPoPP’08

Why RFC is better than TIC on IXP?

Block size plays an important role in the IXP architecture since SRAM is optimized for 32bit access

#SRAM Access #Words Accessed

RFC 13 13W

TIC 7+1 = 8 7+8 =15W

Page 26: Intel Compiler Lab and USTC, PPoPP08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ

Intel Compiler Lab and USTC, PPoPP’08

Block Size Impacts on IXP

Page 27: Intel Compiler Lab and USTC, PPoPP08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ

Intel Compiler Lab and USTC, PPoPP’08

Future Work

• Improve TIC performance on IXP

• Improve TIC performance on firewall rules

• Improve update speeds