parabix : boosting the efficiency of text processing on commodity processors
DESCRIPTION
Parabix : Boosting the Efficiency of Text Processing on Commodity Processors. Dan Lin, Nigel Medforth, Kenneth S. Herdy, Arrvindh Shriraman, Rob Cameron. School of Computing Science, Simon Fraser University. Text Processing is Important. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/1.jpg)
Parabix : Boosting the Efficiency of Text Processing on Commodity Processors
Dan Lin, Nigel Medforth, Kenneth S. Herdy,Arrvindh Shriraman, Rob CameronSchool of Computing Science, Simon Fraser University
![Page 2: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/2.jpg)
Text Processing is ImportantText Processing is Important
2
<?xml version="1.0" encoding="ISO-8859-1"?> <current_observation>
<credit>NOAA's National Weather Service</credit><location>New Orleans, Naval Air Station, LA</location><latitude>29.833</latitude><longitude>-90.017</longitude><weather>Mostly Cloudy</weather><temperature_string>77.0 F (25.0 C)</temperature_string><temp_f>77.0</temp_f><temp_c>25.0</temp_c><relative_humidity>69</relative_humidity><wind_string>Southwest at 17.3 MPH (15 KT)</wind_string><wind_dir>Southwest</wind_dir><visibility_mi>10.00</visibility_mi>
</current_observation>
![Page 3: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/3.jpg)
Text Processing is ImportantText Processing is Important
3
Tweets
50 million
messages per day
![Page 4: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/4.jpg)
Text Processing is ImportantText Processing is Important
4
Tweets
Emails
![Page 5: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/5.jpg)
Text Processing is ImportantText Processing is Important
5
Tweets
Emails
MobileText
Huge amount of text every
second !
![Page 6: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/6.jpg)
Text Processing is HardText Processing is Hard
• The “thirteenth dwarf” (parsers/finite state machines)• Hardest dwarf to parallelize and process efficiently. [Asanovic et al, Berkeley “Landscape” report]
• Large state machines for indexing/search• Irregular memory access.
• Parsing variable length strings and irregular data• Branches in the code.
6
Cache Misses
Branch Mispredictio
ns
“Nothing helps!”
![Page 7: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/7.jpg)
Simple Example of ParsingSimple Example of Parsing
7
Text input <name> txt <error] <err)
Traditional method to do step 1 - Ask each byte: Are you “<“?
3. Report error positions for mismatching.
1. Locate “<“
2. Scan through alphabet from “<“ to match “>”.
![Page 8: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/8.jpg)
while (gotData){ XMLSize t orgReader; const XMLTokens curToken = senseNextToken(orgReader); if (curToken == Token CharData) { scanCharData(fCDataBuf); continue; } else if (curToken == Token EOF) { if (!fElemStack.isEmpty()) { ... } gotData = false; continue; } ...switch(curToken) {case Token CData: scanCDSection(); break;case Token Comment: scanComment(); break;case Token EndTag: scanEndTag(gotData); break;case Token PI: scanPI(); break;case Token StartTag: fDoNamespaces ?scanStartTagNS(gotData) : scanStartTag(gotData); break;default: fReaderMgr.skipToChar(chOpenAngle);}
Conventional XML Parser Conventional XML Parser
(Xerces)(Xerces)
8
Byte-at-a-time
13 branches per input
byte
HighlyInefficient!
![Page 9: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/9.jpg)
Our Technology : ParabixOur Technology : Parabix• Highly parallel using bitwise SIMD
o Byte streams restructured to parallel bit streams.o 128 bytes at a time with 128-bit SIMD (SSE)
• Or more, depending on architecture.o Almost branch free.o Streaming, cache-friendly model.
• Programming supporto Character Class Compiler (CCC)
• Mark occurrence of character classes (e.g. [<]). o Parallel Block Compiler (Pablo)
• Convert Python (unbounded bitstreams) to C++ using SIMDo a portable SIMD library
• Supported many architectures
9
![Page 10: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/10.jpg)
Our Results: XML ParsingOur Results: XML Parsing
10
Conventional
Parabix
Multicore Parabix
4x
5x 2x
![Page 11: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/11.jpg)
OutlineOutline
• Parabix Framework:• Parallel Bitstream Technology• Parabix toolkit:
• CCC : Character Class Compiler• Pablo : Parallel Block Compiler• Portable SIMD library
• XML Parsing with Parabix• Performance and Energy Evaluation• Multithreaded/Multicore Parabix
11
![Page 12: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/12.jpg)
OutlineOutline
•Parabix Framework:• Parallel Bitstream Technology• Parabix toolkit:
• CCC : Character Class Compiler• Pablo : Parallel Block Compiler• Portable SIMD library
•XML Parsing with Parabix•Performance and Energy Evaluation•Multithreaded/Multicore Parabix
12
![Page 13: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/13.jpg)
What are Parallel Bitstreams?What are Parallel Bitstreams?
Given a byte-oriented character stream T, e.g., “b7<A<”.Transpose to 8 parallel bit streams b0, b1, ..., b7.
13
b 7 < A <
01100010 00110111 00111100 01000001 00111100
b0
b1 b2 b3 b4 b5 b6 b7
0 0 0 0 0
Variables that can hold unbounded
bitstreams
![Page 14: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/14.jpg)
What are Parallel Bitstreams?What are Parallel Bitstreams?
14
b 7 < A <
01100010 00110111 00111100 01000001 00111100
b0 0 0 0 0 0b1 b2 b3 b4 b5 b6 b7
1 0 0 1 0
Given a byte-oriented character stream T, e.g., “b7<A<”.Transpose to 8 parallel bit streams b0, b1, ..., b7.
![Page 15: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/15.jpg)
What are Parallel Bitstreams?What are Parallel Bitstreams?
15
b 7 < A <
01100010 00110111 00111100 01000001 00111100
b0 0 0 0 0 0b1 1 0 0 1 0b2 b3 b4 b5 b6 b7
1 1 1 0 1
Given a byte-oriented character stream T, e.g., “b7<A<”.Transpose to 8 parallel bit streams b0, b1, ..., b7.
![Page 16: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/16.jpg)
What are Parallel Bitstreams?What are Parallel Bitstreams?
16
b 7 < A <
01100010 00110111 00111100 01000001 00111100
b0 0 0 0 0 0 b1 1 0 0 1 0b2 1 1 1 0 1b3 0 1 1 0 1b4 0 0 1 0 1b5 0 1 1 0 1b6 1 1 0 0 0b7 0 1 0 1 1Given a byte-oriented character stream T, e.g., “b7<A<”.Transpose to 8 parallel bit streams b0, b1, ..., b7.
![Page 17: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/17.jpg)
Character Bitstream Character Bitstream
ClassificationClassification
17
b 7 < A <
01100010 00110111 00111100 01000001 00111100
b0 0 0 0 0 0 b1 1 0 0 1 0b2 1 1 1 0 1b3 0 1 1 0 1b4 0 0 1 0 1b5 0 1 1 0 1b6 1 1 0 0 0b7 0 1 0 1 1
Now calculate the LAngle bitstream in parallel.[<] = ¬b0 ¬b∧ 1 b∧ 2 b∧ 3 b∧ 4 b∧ 5 ¬b∧ 6 ¬b∧ 7
< 0 0 1 0 1
![Page 18: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/18.jpg)
Character Bitstream Character Bitstream
ClassificationClassification
• Minimum number of operations?o [<] : 7 opso [<] + [>] : 10 opso [<] + [>] + [a-zA-Z] : 21 ops
• Larger set of character classes?o e.g. XML parsing : about 30 character classes.
18
Not so hard.Well…I can handle.
Help !
Easy
![Page 19: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/19.jpg)
OutlineOutline
• Parabix Framework:• Parallel Bitstream Technology• Parabix toolkit:
• CCC : Character Class Compiler• Pablo : Parallel Block Compiler• Portable SIMD library
• XML Parsing with Parabix• Performance and Energy Evaluation• Multithreaded/Multicore Parabix
19
![Page 20: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/20.jpg)
Parabix Tool ChainParabix Tool Chain
20
8 parallel bitstreams
CCC
Pablo
LAngle=[<]RAngle=[>]Alpha =[a-ZA-Z]
L0 = Advance(LAngle)L1 = ScanThru(L0, Alpha)E1 = L1 &˜ RAngle
C++ code
Portable SIMD library
![Page 21: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/21.jpg)
Character Class CompilerCharacter Class Compiler
21
Generated by
CCC !
Programmer defined
temp1 = (bit0 | bit1)temp2 = (bit2 & bit3)temp3 = (temp2 &~ temp1)temp4 = (bit4 & bit5)temp5 = (bit6 | bit7)temp6 = (temp4 &~ temp5)LAngle = (temp3 & temp6)temp7 = (bit6 &~ bit7)temp8 = (temp4 & temp7)RAngle = (temp3 & temp8)temp9 = (bit6 & bit7)temp10 = (bit5 | temp9)temp11 = (bit4 & temp10)temp12 = (~temp11)temp13 = (bit4 | bit5)temp14 = (temp13 | temp5)temp15 = ((bit3 & temp12)| (~(bit3) & temp14))temp16 = (bit1 &~ bit0)Alpha = (temp15 & temp16)
LAngle = [<]RAngle = [>]Alpha = [a-ZA-Z]
![Page 22: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/22.jpg)
Source Text: <name> txt <error] <err)
Alpha .1111..111..11111...111.Rangle .....1..................Langle 1..........1.......1.... L0 = Advance(LAngle) .1..........1.......1...L1 = ScanThru(L0,Alpha).....1...........1.....1E1 = L1 ¬Rangle ∧ .................1.....1
Simple ParsingSimple Parsing
22
Advance(LAngle) : LAngle>>1
ScanThru(L0, Alpha) : (L0+Alpha) ¬ ∧ Alpha
16-bit register
![Page 23: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/23.jpg)
Parabix Tool ChainParabix Tool Chain
23
8 parallel bitstreams
CCC
Pablo
LAngle=[<]RAngle=[>]Alpha =[a-ZA-Z]
L0 = Advance(LAngle)L1 = ScanThru(L0, Alpha)E1 = L1 &˜ RAngle
C++ code
Portable SIMD library
![Page 24: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/24.jpg)
Parallel Block Compiler Parallel Block Compiler
(Pablo)(Pablo)
L0 = Advance(LAngle)L1 = ScanThru(L0, Alpha)E1 = L1 &˜ RAngle
24
CarryInit(carryQ, 2); }
void do_block(Lex & lex){
BitBlock L0, L1;
L0 = Advance_ci_co(C2, carryQ, 0);
L1 = ScanThru_ci_co(L0, C0, carryQ, 1);
E1 = simd_andc(L1, C1);
CarryQ_Adjust(carryQ, 2);
}
CarryDeclare(carryQ, 2);
C++Generated by Pablo !
Programmers write in
Python
![Page 25: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/25.jpg)
Parallel Block Compiler Parallel Block Compiler
(Pablo)(Pablo)
L0 = Advance(LAngle)L1 = ScanThru(L0, Alpha)E1 = L1 &˜ RAngle
25
CarryInit(carryQ, 2); }
void do_block(Lex & lex){
BitBlock L0, L1;
L0 = Advance_ci_co(C2, carryQ, 0);
L1 = ScanThru_ci_co(L0, C0, carryQ, 1);
E1 = simd_andc(L1, C1);
CarryQ_Adjust(carryQ, 2);
}
CarryDeclare(carryQ, 2);
C++Generated by Pablo !
Programmers write in
Python
ci : carry in from the previous
blockco: carry out to
the next block
Implemented differently with different ISAs
![Page 26: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/26.jpg)
Parabix Tool ChainParabix Tool Chain
26
8 parallel bitstreams
CCC
Pablo
LAngle=[<]RAngle=[>]Alpha =[a-ZA-Z]
L0 = Advance(LAngle)L1 = ScanThru(L0, Alpha)E1 = L1 &˜ RAngle
C++ code
Portable SIMD library
![Page 27: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/27.jpg)
Portable SIMD LibraryPortable SIMD Library• Our SIMD Library supports all power-of-2 field
widths up to the full SIMD register width on a target machine.
• Instruction sets supported are: o 128-bit Altiveco 128-bit SSEo 256-bit AVXo 128-bit Neon (ARM)o 128-bit SPU (Cell)
27
Programmers don’t need to
know any details of the ISAs !
![Page 28: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/28.jpg)
OutlineOutline
• Parabix Framework:• Parallel Bitstream Technology – Novel application of
SIMD• Parabix toolkit:
• CCC : Character Class Compiler• Pablo : Parallel Block Compiler• Portable SIMD library
• XML Parsing with Parabix• Performance and Energy Evaluation• Multithreaded/Multicore Parabix
28
![Page 29: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/29.jpg)
XML Parsing with ParabixXML Parsing with Parabix
29
XML Character Class
CCC
XML parsing with unbounded bitstreamsin Python
Pablo
![Page 30: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/30.jpg)
Performance Study: Benchmark Performance Study: Benchmark
FilesFiles
30
File Name dew jaw roads po soap
File Type doc doc data data data
File Size (kB) 66240 7343 11584 76450 2717
Markup Density
0.07 0.13 0.57 0.76 0.87
Input Document Characteristics
Document-oriented instances often contain information intended for publication.
Data-oriented instances are typically used for the exchange of database records.
Markup Density = markup bytes / the total document size.
![Page 31: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/31.jpg)
Experimental Set upExperimental Set up• 3 Parsers :
o Parabix, Xerces (IBM,Apache) and EXPAT (sourceforge)
• Platforms :o Core2Duo, Core i3 (Baseline), SandyBridgeo ARM A8 1Ghz. Neon ISA (Samsung Tablet)
• Metrics : Cycles / Byte. nSecond / Byte. nJoules / Byte.o Performance counters
• Power measurement
31
Fluke i410 current clamp
Agilent 34410a digital multimeter
![Page 32: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/32.jpg)
Performance Results: Latency & Performance Results: Latency &
EnergyEnergy
32
Parabix Expat Xerces
markup density
2x
7x
4x
Core i3Markup density has substantial influence on the performance of traditional parsers.Parabix achieves more speedup on higer markup density inputs.Parabix saves 4x energy on average.
![Page 33: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/33.jpg)
Performance FactorsPerformance Factors
33
Parabix Expat Xerces
L1 4.1 31.7 104.2
L2 0.1 12.0 1.7
Cache Misses per kB of input data
Branch Mispredictions per kB input
(↑ 8x)
(↑ 26x)
(↑ 120x)
(↑ 17x)
700 x 10 = 7000 cycles/kB
Misprediction penalty
= processing time of Parabx
![Page 34: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/34.jpg)
Parabix on Mobile (ARM Neon)Parabix on Mobile (ARM Neon)
34
Core i3
ARM Neon
Latency on ARM Neon is greater than on Core i3.Advantage of Parabix only shows up for higher density files.
![Page 35: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/35.jpg)
Parabix on AVX (Advanced Vector Parabix on AVX (Advanced Vector
Extensions)Extensions)
35
11%20%
Three operand formWider register width
SandyBridge
Two operand:a = a + bThree operand:c = a + b
![Page 36: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/36.jpg)
Parabix on AVXParabix on AVX
36
-34%-
32%
The number of “Other SIMD” instruction reduced by using 128-bit AVX.The number of “Bitwise SIMD” instruction reduced by using 256-bit AVX.
![Page 37: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/37.jpg)
Multicore Parabix (Pipeline)Multicore Parabix (Pipeline)
37
Stage 11.97
cycles/byte
Stage 21.22
cycles/byte
Stage 32.03
cycles/byte
Stage 41.32
cycles/byte
![Page 38: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/38.jpg)
Multicore Parabix (Pipeline)Multicore Parabix (Pipeline) single-core multicore
2.1x
4-core Parabix achieves >2x speedup over single core.
Core workloads are better balanced for high-density files.-Better performance and energy utilization.
2.5x
![Page 39: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/39.jpg)
SummarySummary• Parabix: A software toolchain and runtime framework
for high-performance text processing exploiting the SIMD data units found on commodity processors.
• Parabix XML parser • 2x to 7x improvement in performance.• 4x average improvement in energy.
• Multicore Parabix• further 2x improvement in performance (4 cores).
• Parabix allowed us to perform studies on AVX and Neon without having to change the application source.
39
![Page 40: Parabix : Boosting the Efficiency of Text Processing on Commodity Processors](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813715550346895d9e9f70/html5/thumbnails/40.jpg)
Questions?Questions?
40