secompax a bitmap index compression algorithm
TRANSCRIPT
SECOMPAX-A bitmap index compression algorithm
Y. Wen, Z. Chen, G. Ma, G. Peng, J. Cao, Wenliang Huang*
China Unicom Groups*Research Institute of Information
Technology(RIIT)Tsinghua National Lab of Information Science and
technology(TNLIST)
Introduction of bitmap index Features:
– Fast and flexible for searching– Flexible query function – Huge room cost
Bitmap Coding Schemes:– Reduce room cost– Accelerate indexing & searching speed
Applications:– Scientific data management, – network traffic monitoring– Bioinformatics – High-throughput Virtual Screening– … …
Bitmap Database is under research
Former Algorithms 0-RLE: Run-Length Encoding 1-BBC: Byte-aligned Bitmap Code 2-WAH: Word-Aligned Hybrid 3-PLWAH: Position List Word-Aligned Hybrid
4-COMPAX: COMPressed Adaptive indeX format
4
Usually Occurred Terms Chunk: An unit with 31 bits Word: An unit with 32 bits (used in 32-bit computer)
Fill: A word/chunk with all of its bits 0 or 1
Literal: A word/chunk not with identical bits– Literal means exactly the same with original information at this stage.
5
1-BBC (Byte-aligned Bitmap Code)
Bitmap index coding scheme of Oracle Database
“The bitmap bytes are classified as gaps containing only zeros or only ones and maps containing a mixture of both. Continuous gaps are encoded by their byte length and a fill bit differentiating between zero and one gaps. Stretches of map bytes encode themselves and are accompanied with the stretch length in bytes. A pair (gap, map) is encoded into a single atom composed of a control byte followed by optional gap length and map. Single map byte having only one bit different than the others is encoded directly into a control byte using a “different bit” position within this map byte.” 6
2-WAH(Word-Aligned Hybrid)
7
Proposed by K. Wu et al. Divide a sequence by 31 bits (a chunk) and use chunk as the compression unit
Chunk types: fill (0-fill & 1-fill), literal
Combine adjacent chunks with the same type
2-WAH Advantages:
– Easy and fast for indexing and querying Disadvantages:
– Not fully compressed. Has a lot of room to improve.
Further research: PLWAH/COMPAX
8
3-PLWAH(Position List Word-Aligned Hybrid) PLWAH F. Deli`ege and T. B. Pedersen
1. Introduce concept of nearly identical and piggyback
2. Secondary Compression– Merge literal(s) with one dirty bit into the former
fill word9
3-PLWAH Advantages:
– Saves the total room cost compared with WAH
Drawbacks:– Every “piggyback” needs 5 bits (out of 32), so…– Less bits for counter (more apparent with big piggyback amounts)
11
4-COMPAX(COMPressed Adaptive indeX) Raised by F.Fusco .etc based on WAH 1. Introduce concept of dirty byte 2. Secondary compression with codebook– Use byte as the secondary unit– Add a codebook for secondary compression
12
4-COMPAX Advantages:
– Takes more conditions into consider– Higher compression ratio than WAH– Enhanced version-COMPAX2
• Add 1-fill based on 0-fill and take relevant combination types into consideration
Further research: SECOMPAX
14
SECOMPAX-design Full name: Scope Extended COMPAX Designed based on COMPAX 1. Expand the concept of dirty byte
– Modify the concept of dirty byte and nearly identical literal (NI-L) 2. Add combination types of LFL and FLF
– LFL: Except for 0-NI-L + F + 0-NI-L, we add three combination types: 0-NI-L + F + 1-NI-L , 1-NI-L + F + 0-NI-L and 1-NI-L + F + 1-NI-L;
– FLF: Except for 0F + 0-NI-L + 0F and 1F+0-NI-L+1F , we add six other types: 0F + 1-NI-L + 0F, 1F + 1-NI-L + 1F ; 1F + 1-NI-L + 0F , 0F + 1-NI-L + 1F ; 1F + 0-NI-L + 0F , 0F + 0-NI-L + 1F;
15
SECOMPAX-Automation
16
• A: Fill• B: Nearly identical fill
• C: Literal
• This can be easily implemented with FPGA Hardware
SECOMPAX-Advantages Advantages:
– Expand the concepts of LFL and FLF from COMPAX
– Has better compression ratio compared with COMPAX
– More obvious improvement when the amount of 0 bit and 1 bit are comparable
19
Evaluation
20
Used anonymous Internet data trace set from CAIDA in 2014.
Higher compression ratio than PLWAH and COMPAX (using COMPAX2 in experiment)
Evaluation
21
Source IP: Reduces index size by 6.74% compared with PLWAH and 4.01% compared with COMPAX
Destination IP: Reduces index size by 6.05% compared with PLWAH and 3.97% compared with PLWAH
Applications Bioinformatics (DNA aligned/query) Transit Closure for Reachability of Graph
Traffic Forensic for Network security
a lot of other applications …
29
TIFAflow
TIFA architecture Data flow
学学学学 学学学学
31
[1] Z. Chen et al., TIFAflow: enhancing traffic archiving system with flow granularity for forensic analysis in network security. Tsinghua Science and Technology, vol.18, No.4, 2013.[2] W. Huang, Z. Chen, W. Dong, H. Li, Mobile Internet big data platform in China Unicom, Tsinghua Science and Technology, Volume 19, Issue 1, pp. 95-101, Feb. 2014.[3] J. Li et al., TIFA: Enabling Real-Time Querying and Storage of Massive Stream Data. Proc. of International Conference on Networking and Distributed Computing (ICNDC), 2011.分 分 分 分
分 分 分 分
PACP分 分
分 分FastBit
Tim e M achineFLOW
FastBit分 分 分 分
分 分分 分
分 分 分 分 分
分 分 分 分PCAP分 分
分 分 分
分 分PCAP分 分
DATATim e M achine分 分 分 分
Conclusion SECOMPAX is a new bitmap coding scheme
SECOMPAX has better compression ratio and comparable encoding time
SECOMPAX has been integrated into FastBit
35
Future directions
36
Raw Bitm ap Index
W AH
PLW AH COM PAX/COM PAX2
SECOM PAX
ICX
+fill word
+piggyback +FLF/LFL
extension
+LF/NI2-LF
Run-length coding
M ASC
+carrier,-literal
References[1] Wu, Ming-Chuan, Alejandro P. Buchmann, and P. Larson. Encoded Bitmap Indexes and Their Use for Data Warehouse Optimization. Shaker, 2001.[2] Wu, Kesheng, Ekow J. Otoo, and Arie Shoshani. "Optimizing bitmap indices with efficient compression." ACM Transactions on Database Systems (TODS) 31, no. 1 (2006): 1-38.[3] F. Deli`ege and T. B. Pedersen. Position list word aligned hybrid: optimizing space and performance for compressed bitmaps. Proc. of the 13th Int. Conf. on Extending Database Technology, EDBT ‘10, 2010.[4] Fusco, Francesco, Michail Vlachos, and Marc Ph Stoecklin. "Real-time creation of bitmap indexes on streaming network data." The VLDB Journal-The International Journal on Very Large Data Bases 21, no. 3 (2012): 287-307.[5] Fusco, Francesco, Michail Vlachos, and Xenofontas Dimitropoulos. "RasterZip: compressing network monitoring data with support for partial decompression." Proceedings of the 12th ACM SIGCOMM Conference on Internet measurement, IMC’12, 2012.[6] Fusco, F., Dimitropoulos, X., Vlachos, M., & Deri, L. pcapIndex: an index for network packet traces with legacy compatibility. ACM SIGCOMM Computer Communication Review, 42(1), 47-53, 2012.[7] Papadogiannakis, Antonis, Michalis Polychronakis, and Evangelos P. Markatos. "Scap: stream-oriented network traffic capture and analysis for high-speed networks." In Proceedings of the 2013 conference on Internet measurement conference, pp. 441-454. ACM, 2013.
37
Patents [1] BBC, Byte aligned data compression, DEC/Oracle, www.google.com/patents/US5363098.
[2] WAH, Word aligned bitmap compression method, data structure, and apparatus, UC Berkeley, www.google.com/patents/US6831575.
[3] COMPAX, Network analysis, IBM, www.google.com/patents/US20120054160.
38