![Page 1: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/1.jpg)
Efficient and Flexible Information Retrieval Using
MonetDB/X100
Sándor HémanCWI, Amsterdam
Marcin Zukowski, Arjen de Vries, Peter BonczJanuary 08, 2007
![Page 2: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/2.jpg)
Background
Process query-intensive workloads over large datasets efficiently within a DBMS
Application Areas Information Retrieval Data mining Scientific data analysis
![Page 3: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/3.jpg)
MonetDB/X100 Highlights
Vectorized query engine Transparent, light-weight compression
![Page 4: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/4.jpg)
Keyword Search
Inverted index: TD(termid, docid, score)
TopN( Project( MergeJoin( RangeSelect( TD1=TD, TD1.termid=10 ), RangeSelect( TD2=TD, TD2.termid=42 ), TD1.docid = TD2.docid), [docid = TD1.docid, score = TD1.scoreQ + TD2.scoreQ]), [score DESC], 20)
![Page 5: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/5.jpg)
Keyword Search
Inverted index: TD(termid, docid, score)
TopN( Project( MergeJoin( RangeSelect( TD1=TD, TD1.termid=10 ), RangeSelect( TD2=TD, TD2.termid=42 ), TD1.docid = TD2.docid), [docid = TD1.docid, score = TD1.scoreQ + TD2.scoreQ]), [score DESC], 20)
![Page 6: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/6.jpg)
Keyword Search
Inverted index: TD(termid, docid, score)
TopN( Project( MergeJoin( RangeSelect( TD1=TD, TD1.termid=10 ), RangeSelect( TD2=TD, TD2.termid=42 ), TD1.docid = TD2.docid), [docid = TD1.docid, score = TD1.scoreQ + TD2.scoreQ]), [score DESC], 20)
![Page 7: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/7.jpg)
Keyword Search
Inverted index: TD(termid, docid, score)
TopN( Project( MergeJoin( RangeSelect( TD1=TD, TD1.termid=10 ), RangeSelect( TD2=TD, TD2.termid=42 ), TD1.docid = TD2.docid), [docid = TD1.docid, score = TD1.scoreQ + TD2.scoreQ]), [score DESC], 20)
![Page 8: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/8.jpg)
Vectorized Execution [CIDR05]
Volcano based iterator pipeline
Each next() call returns collection of column-vectors of tuples Amortize overheads Introduce parallelism Stay in CPU Cache
Vectors
![Page 9: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/9.jpg)
![Page 10: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/10.jpg)
![Page 11: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/11.jpg)
![Page 12: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/12.jpg)
![Page 13: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/13.jpg)
Light-Weight Compression
Compressed buffer-manager pages: Increase I/O bandwidth Increase BM capacity
Favor speed over compression ratio CPU-efficient algorithms
>1 GB/s decompression speed Minimize main-memory overhead
RAM-CPU Cache decompression
![Page 14: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/14.jpg)
Naïve Decompression1. Read and
decompress page
2. Write back to RAM
3. Read for processing
![Page 15: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/15.jpg)
RAM-Cache Decompression1. Read and
decompress page at vector granularity, on-demand
![Page 16: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/16.jpg)
![Page 17: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/17.jpg)
![Page 18: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/18.jpg)
![Page 19: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/19.jpg)
![Page 20: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/20.jpg)
![Page 21: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/21.jpg)
2006 TREC TeraByte Track X100 compared to custom IR systems
Others prune index
System #CPUs P@20 Throughput (q/s)
Throughput /CPU
X100 16 0.47 186 13
X100 1 0.47 13 13
Wumpus 1 0.41 77 77
MPI 2 0.43 34 17
Melbourne Univ 1 0.49 18 18
![Page 22: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/22.jpg)
Thanks!
![Page 23: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/23.jpg)
MonetDB/X100 in Action
Corpus: 25M text documents, 427GB docid + score: 28GB, 9GB compressed
Hardware: 3GHz Intel Xeon 4GB RAM 10 disk RAID, 350 MB/s
![Page 24: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/24.jpg)
MonetDB/X100 [CIDR’05]
Vector-at-a-time instead of tuple-at-a-time Volcano
Vector = Array of Values (100-1000)
Vectorized Primitives• Array Computations • Loop Pipelinable very fast• Less Function call overhead
Vectors are Cache Resident
RAM considered secondary storage
![Page 25: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/25.jpg)
MonetDB/X100 [CIDR’05]
Vector-at-a-time instead of tuple-at-a-time Volcano
Vector = Array of Values (100-1000)
Vectorized Primitives• Array Computations • Loop Pipelinable very fast• Less Function call overhead
Vectors are Cache Resident
RAM considered secondary storagedecompress
![Page 26: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/26.jpg)
MonetDB/X100 [CIDR’05]
Vector-at-a-time instead of tuple-at-a-time Volcano
Vector = Array of Values (100-1000)
Vectorized Primitives• Array Computations • Loop Pipelinable very fast• Less Function call overhead
Vectors are Cache Resident
RAM considered secondary storage
decompress
![Page 27: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/27.jpg)
Vector Size vs Execution Time
![Page 28: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/28.jpg)
Compression docid: PFOR-DELTA
Encode deltas as a b-bit offset from an arbitrary base value:
deltas withinget encoded
deltas outside range are stored as uncompressed exceptions
score: Okapi -> quantize -> PFOR compress
)2,[ bbasebase
![Page 29: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/29.jpg)
Compressed Block Layout Forward growing
section of bit-packed b-bit code words
![Page 30: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/30.jpg)
Compressed Block Layout Forward growing
section of bit-packed b-bit code words
Backwards growing exception list
![Page 31: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/31.jpg)
Naïve Decompression Mark ( ) exception
positions
for(i=0; i < n; i++) { if (in[i] == ) { out[i] = exc[--j] } else { out[i]=DECODE(in[i]) }}
![Page 32: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/32.jpg)
Patched Decompression Link exceptions into
patch-list Decode:
for(i=0; i < n; i++) { out[i]=DECODE(in[i]);}
![Page 33: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/33.jpg)
Patched Decompression Link exceptions into
patch-list Decode:
for(i=0; i < n; i++) { out[i]=DECODE(in[i]);}
Patch:for(i=first_exc; i<n; i += in[i]) { out[i] = exc[--j];}
![Page 34: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/34.jpg)
Patched Decompression Link exceptions into
patch-list Decode:
for(i=0; i < n; i++) { out[i]=DECODE(in[i]);}
Patch:for(i=first_exc; i<n; i += in[i]) { out[i] = exc[--j];}
![Page 35: Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649ea95503460f94bad949/html5/thumbnails/35.jpg)
Patch Bandwidth