secompax a bitmap index compression algorithm

39
SECOMPAX- A bitmap index compression algorithm Y. Wen, Z. Chen, G. Ma, G. Peng, J. Cao, Wenliang Huang* China Unicom Groups* Research Institute of Information Technology(RIIT) Tsinghua National Lab of Information Science and

Upload: tsinghua

Post on 31-Jan-2023

3 views

Category:

Documents


0 download

TRANSCRIPT

SECOMPAX-A bitmap index compression algorithm

Y. Wen, Z. Chen, G. Ma, G. Peng, J. Cao, Wenliang Huang*

China Unicom Groups*Research Institute of Information

Technology(RIIT)Tsinghua National Lab of Information Science and

technology(TNLIST)

Outline Introduction Background SECOMPAX Evaluation Applications Conclusion

Introduction of bitmap index Features:

– Fast and flexible for searching– Flexible query function – Huge room cost

Bitmap Coding Schemes:– Reduce room cost– Accelerate indexing & searching speed

Applications:– Scientific data management, – network traffic monitoring– Bioinformatics – High-throughput Virtual Screening– … …

Bitmap Database is under research

Former Algorithms 0-RLE: Run-Length Encoding 1-BBC: Byte-aligned Bitmap Code 2-WAH: Word-Aligned Hybrid 3-PLWAH: Position List Word-Aligned Hybrid

4-COMPAX: COMPressed Adaptive indeX format

4

Usually Occurred Terms Chunk: An unit with 31 bits Word: An unit with 32 bits (used in 32-bit computer)

Fill: A word/chunk with all of its bits 0 or 1

Literal: A word/chunk not with identical bits– Literal means exactly the same with original information at this stage.

5

1-BBC (Byte-aligned Bitmap Code)

Bitmap index coding scheme of Oracle Database

“The bitmap bytes are classified as gaps containing only zeros or only ones and maps containing a mixture of both. Continuous gaps are encoded by their byte length and a fill bit differentiating between zero and one gaps. Stretches of map bytes encode themselves and are accompanied with the stretch length in bytes. A pair (gap, map) is encoded into a single atom composed of a control byte followed by optional gap length and map. Single map byte having only one bit different than the others is encoded directly into a control byte using a “different bit” position within this map byte.” 6

2-WAH(Word-Aligned Hybrid)

7

Proposed by K. Wu et al. Divide a sequence by 31 bits (a chunk) and use chunk as the compression unit

Chunk types: fill (0-fill & 1-fill), literal

Combine adjacent chunks with the same type

2-WAH Advantages:

– Easy and fast for indexing and querying Disadvantages:

– Not fully compressed. Has a lot of room to improve.

Further research: PLWAH/COMPAX

8

3-PLWAH(Position List Word-Aligned Hybrid) PLWAH F. Deli`ege and T. B. Pedersen

1. Introduce concept of nearly identical and piggyback

2. Secondary Compression– Merge literal(s) with one dirty bit into the former

fill word9

3-PLWAH

10

3-PLWAH Advantages:

– Saves the total room cost compared with WAH

Drawbacks:– Every “piggyback” needs 5 bits (out of 32), so…– Less bits for counter (more apparent with big piggyback amounts)

11

4-COMPAX(COMPressed Adaptive indeX) Raised by F.Fusco .etc based on WAH 1. Introduce concept of dirty byte 2. Secondary compression with codebook– Use byte as the secondary unit– Add a codebook for secondary compression

12

4-COMPAX

13

4-COMPAX Advantages:

– Takes more conditions into consider– Higher compression ratio than WAH– Enhanced version-COMPAX2

• Add 1-fill based on 0-fill and take relevant combination types into consideration

Further research: SECOMPAX

14

SECOMPAX-design Full name: Scope Extended COMPAX Designed based on COMPAX 1. Expand the concept of dirty byte

– Modify the concept of dirty byte and nearly identical literal (NI-L) 2. Add combination types of LFL and FLF

– LFL: Except for 0-NI-L + F + 0-NI-L, we add three combination types: 0-NI-L + F + 1-NI-L , 1-NI-L + F + 0-NI-L and 1-NI-L + F + 1-NI-L;

– FLF: Except for 0F + 0-NI-L + 0F and 1F+0-NI-L+1F , we add six other types: 0F + 1-NI-L + 0F, 1F + 1-NI-L + 1F ; 1F + 1-NI-L + 0F , 0F + 1-NI-L + 1F ; 1F + 0-NI-L + 0F , 0F + 0-NI-L + 1F;

15

SECOMPAX-Automation

16

• A: Fill• B: Nearly identical fill

• C: Literal

• This can be easily implemented with FPGA Hardware

SECOMPAX-Some examples

17

SECOMPAX-Codebook Uses at most 4 bits for instruction bits

18

SECOMPAX-Advantages Advantages:

– Expand the concepts of LFL and FLF from COMPAX

– Has better compression ratio compared with COMPAX

– More obvious improvement when the amount of 0 bit and 1 bit are comparable

19

Evaluation

20

Used anonymous Internet data trace set from CAIDA in 2014.

Higher compression ratio than PLWAH and COMPAX (using COMPAX2 in experiment)

Evaluation

21

Source IP: Reduces index size by 6.74% compared with PLWAH and 4.01% compared with COMPAX

Destination IP: Reduces index size by 6.05% compared with PLWAH and 3.97% compared with PLWAH

Evaluation-compression

22

Evaluation

23

Evaluation

24

Evaluation

25

26

Evaluation-encoding time

26

Evaluation

27

Evaluation-codeword distribution

28

Applications Bioinformatics (DNA aligned/query) Transit Closure for Reachability of Graph

Traffic Forensic for Network security

a lot of other applications …

29

Columnar Storage

30

TIFAflow

TIFA architecture Data flow

学学学学 学学学学

31

[1] Z. Chen et al., TIFAflow: enhancing traffic archiving system with flow granularity for forensic analysis in network security. Tsinghua Science and Technology, vol.18, No.4, 2013.[2] W. Huang, Z. Chen, W. Dong, H. Li, Mobile Internet big data platform in China Unicom, Tsinghua Science and Technology, Volume 19, Issue 1, pp. 95-101, Feb. 2014.[3] J. Li et al., TIFA: Enabling Real-Time Querying and Storage of Massive Stream Data. Proc. of International Conference on Networking and Distributed Computing (ICNDC), 2011.分 分 分 分

分 分 分 分

PACP分 分

分 分FastBit

Tim e M achineFLOW

FastBit分 分 分 分

分 分分 分

分 分 分 分 分

分 分 分 分PCAP分 分

分 分 分

分 分PCAP分 分

DATATim e M achine分 分 分 分

PcapIndex

32

NET-Fli

33

RasterZip

34

Conclusion SECOMPAX is a new bitmap coding scheme

SECOMPAX has better compression ratio and comparable encoding time

SECOMPAX has been integrated into FastBit

35

Future directions

36

Raw Bitm ap Index

W AH

PLW AH COM PAX/COM PAX2

SECOM PAX

ICX

+fill word

+piggyback +FLF/LFL

extension

+LF/NI2-LF

Run-length coding

M ASC

+carrier,-literal

References[1] Wu, Ming-Chuan, Alejandro P. Buchmann, and P. Larson. Encoded Bitmap Indexes and Their Use for Data Warehouse Optimization. Shaker, 2001.[2] Wu, Kesheng, Ekow J. Otoo, and Arie Shoshani. "Optimizing bitmap indices with efficient compression." ACM Transactions on Database Systems (TODS) 31, no. 1 (2006): 1-38.[3] F. Deli`ege and T. B. Pedersen. Position list word aligned hybrid: optimizing space and performance for compressed bitmaps. Proc. of the 13th Int. Conf. on Extending Database Technology, EDBT ‘10, 2010.[4] Fusco, Francesco, Michail Vlachos, and Marc Ph Stoecklin. "Real-time creation of bitmap indexes on streaming network data." The VLDB Journal-The International Journal on Very Large Data Bases 21, no. 3 (2012): 287-307.[5] Fusco, Francesco, Michail Vlachos, and Xenofontas Dimitropoulos. "RasterZip: compressing network monitoring data with support for partial decompression." Proceedings of the 12th ACM SIGCOMM Conference on Internet measurement, IMC’12, 2012.[6] Fusco, F., Dimitropoulos, X., Vlachos, M., & Deri, L. pcapIndex: an index for network packet traces with legacy compatibility. ACM SIGCOMM Computer Communication Review, 42(1), 47-53, 2012.[7] Papadogiannakis, Antonis, Michalis Polychronakis, and Evangelos P. Markatos. "Scap: stream-oriented network traffic capture and analysis for high-speed networks." In Proceedings of the 2013 conference on Internet measurement conference, pp. 441-454. ACM, 2013.

37

Patents [1] BBC, Byte aligned data compression, DEC/Oracle, www.google.com/patents/US5363098.

[2] WAH, Word aligned bitmap compression method, data structure, and apparatus, UC Berkeley, www.google.com/patents/US6831575.

[3] COMPAX, Network analysis, IBM, www.google.com/patents/US20120054160.

38

Thank you!

39