globally recorded binary encoded domain compression algorithm in column oriented databases
Post on 04-Apr-2018
224 Views
Preview:
TRANSCRIPT
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
1/79
Globally Recorded binary encoded Domain Compression
algorithm in Column Oriented Databases
A
Dissertation on
Submitted
In partial fulfillment
For the award of the Degree of
Master of Technology
In Department of Information technology
(With specialization in Information Communication)
Supervisor Name Submitted By:
Mr. Santosh Kumar Singh Mehul Mahrishi
Associate Prof. Enrollment no: SGVU091543463
Suresh Gyan Vihar University
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
2/79
Candidates Declaration
I hereby declare that the work, which is being presented in the dissertation, entitled
Globally Recorded binary encoded Domain Compression algorithm in Column
Oriented Databases in partial fulfillment for the award of Degree of Master of
Technology in Department of Information Technology with Specialization in
Information Communication, and submitted to the Department of Information
Technology, Suresh Gyan vihar University is a record of my own investigations
carried under the Guidance of Mr. S.K. Singh, Department of Information
Technology.
I have not submitted the matter presented in this project/seminar anywhere for
the award of any other Degree.
(Name and Signature of Candidate) Counter Signed by:-
Mehul Mahrishi Mr. Santosh Kumar Singh
Information Communication Supervisor (M. Tech IC)
Enrolment No.: SGVU091543463
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
3/79
DETAILS OF CANDIDATE, SUPERVISOR (S) AND EXAMINER
Name of Candidate: Mehul Mahrishi. Roll No. 104511
Deptt. Of Study: M. Tech. (Information Communication).
Enrolment No. SGVU091543463
Thesis Title: Globally Recorded binary encoded Domain Compression algorithm in
Column Oriented Databases...
.. Supervisor (s) and Examiners Recommended
(with Office Address including Contact Numbers, email ID)
Supervisor Co-Superviosr
Internal Examiner
1 2 3
Signature with Date
Programme Coordinator Dean / Principal
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
4/79
This certifies that the thesis entitled
Globally Recorded binary encoded Domain Compression algorithm
in Column Oriented Databases
is submitted by
Mehul MahrishiSGVU091543463
IV semester , M.Tech (IC) in the year 2011 in partial fulfillment of
Degree of Master of Technology inInformation Communication
SURESH GYAN VIHAR UNIVERSITY, JAIPUR.
Signature of Supervisor
Date:
Place:
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
5/79
Acknowledgement
Foremost, I would like to express my sincere gratitude to my advisor and mentor Mr.
S.K. Singh for the continuous support of my study and research, for his patience,
motivation, enthusiasm, and knowledge. His guidance helped me in all the time of
research and writing of this thesis. Besides my advisor, I would like to thank the rest
of my thesis committee, especially Mr. Vibhakar Pathak for their encouragement,
insightful comments, and hard questions.
My sincere thanks also goes to Dr. S.L. Surana (Principal, SKIT), Dr. C.M.
Choudhary(HOD CS,SKIT) and Dr. Anil Chaudhary(HOD IT,SKIT) , for supporting
my advance studies and providing opportunities in their groups and leading me
working on diverse exciting projects. My special thanks to Mr. Mukesh Gupta
(Reader, SKIT) for his invaluable advise which helps me to take this decision.
I thank my fellow mates Anita Shrotriya, Devendra Kr.Sharma, Vipin Jain, Singh
Brothers, Kamal Hiran for the stimulating discussions, for the sleepless nights we
were working together before deadlines, and for all the fun we have had in the last
two years.
Last but not the least; I would like to thank my family members: my parents (Mukesh
& Madhulika Mahrishi), uncle & Aunt (Pushpanshu & Seema Mahrishi), brothers
(Mridul & Harshit) and my grandmothers for their faith and giving me the first place
by supporting me throughout my life.
(Mehul Mahrishi)
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
6/79
i
Contents
List of Tables iv
List of Figures v
Notations vi
Abstract vii
CHAPTER 1 Introduction 1-4
1.1 Introduction.................... 11.2 Objective... . 11.3 Motivation..21.4 Research Contribution31.5 Dissertation Outline........ 3
CHAPTER 2 Theories 5-23
2.1 Introduction... .. 5
2.1.1 On-Line Transaction Processing...6
2.1.2 Query Intensive Applications.......7
2.2 The Rise of Columnar Database. 8
2.3 Definitions. ...10
2.4 Row Oriented Execution.... 12
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
7/79
ii
2.4.1 Vertical Partitioning.....12
2.4.2 Index-Only Plans. .12
2.4.3 Materialized Views.......13
2.5 Column Oriented Database......13
2.5.1 Compression........13
2.5.2 Late Materialization.........14
2.5.3 Block Iteration.........14
2.5.4 Invisible Joins......14
2.6 Query execution in Row vs. Column oriented database.....................15
2.7 Compression.... ...17
2.8 Conventional Compression.18
2.8.1 Domain Compression..19
2.8.2 Attribute Compression... 20
2.9 Layout of Compressed Tuples..........21
CHAPTER 3 Methodology 24-31
3.1 Introduction.. ............................ 24
3.2 Reasons for Data Compression................... 25
3.3 Compression Scheme.. 28
3.4 Query Execution........... 30
3.5 Decompression. .......... 30
3.6 Prerequisites... ......................... 30
CHAPTER 4 Results & Discussions 32-44
4.1 Introduction... ....... 32
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
8/79
iii
4.2 Anonymization ..33
4.2.1 Problem Definition & Contribution........................... 34
4.2.2 Quality Measure of Anonymization...........................36
4.2.3 Conclusion..... 36
4.3 Domain compression through binary conversion................ 36
4.3.1 Encoding of Distinct Values.......36
4.3.2 Paired Encoding......................... 38
4.4 Add-ons on Compression....................... 40
4.4.1 Functional dependencies... 40
4.4.2 Primary Keys......42
4.4.3 Few Distinct values................... 42
4.5 Limitations.. 43
4.6 Conclusion....43
CHAPTER 5 Conclusion & Future Work 45-47
5.1 Conclusion.....45
5.2 Future Work.................................................................................................46
APPENDIX I Infobright 48-62
References & Bibliography 63-67
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
9/79
iv
List of Tables
TABLES TITLE PAGE
2.1 A typical Row-oriented Database 6
2.2 Table representing Column storing of data 10
3.1 Employee table with type and cardinality 283.2 Code Table Example 29
3.3 Query execution 30
4.1 Published Table 34
4.2 View of published table by Global recording 35
4.3 An instance of relation Student 37
4.4 Representing Stage 1 of compression technique 38
4.5 Representing Stage 1 with binary compression 38
4.6 Representing Stage 2 compression 39
4.7 Representing Stage 2 compression coupling 40
4.8 Representing functional dependency based coupling 41
4.9 Number of distinct values in each column 41
4.10 Representing test case 1 42
4.11 Representing test case 2 42
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
10/79
v
List of Figures & Graphs
FIGURE TITLE PAGE
Figure 2.1 OLTP Access 6
Figure 2.2 OLAP Access 7
Figure 2.3 Column based data storage 11
Figure 2.4 Layout of Compressed Tuple 23
Graph I.1 Representing Load time comparison 61
Graph I.2 Representing Table size comparison 61
Graph I.3 Representing query execution comparison 61
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
11/79
vi
Notations
DBMS : Database Management System
RDBMS : Relational Database Management System
OLTP : Online Transactional Processing
SQL : Structured Query Language
ICE : Infobright Community Edition
IEE : Infobright Enterprise Edition
TB : TeraBytes
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
12/79
vii
Abstract
Warehouses contain a lot of data and hence any leak or illegal publication of
information risks the individuals privacy. This research work proposes the
compression ad abstraction of data using existing compression algorithms. Although
the technique is general and easier, it is my strong believe that it is particularly
advantageous for data warehousing. Through this study, we propose two algorithms.
The first algorithm describes the concept of compression of domains at attribute level
and we call it as Attribute Domain Compression. This algorithm can be
implemented on both row and columnar databases. The idea behind the algorithm is to
reduce the size of large databases as to store them optimally. The second algorithm is
also applicable for both concepts of databases but will optimally work for columnar
databases. The idea behind the algorithm is to generalize the tuple domains by giving
it a value say (n) such that all other n-1 tuples or at least maximum can be identified.
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
13/79
P a g e | 1
Chapter 1
Introduction
1.1IntroductionLarge operational data and information is stored by different vendors and
organizations in warehouses. Most of which is useful only when it is shared and
analyzed with other related data. However this kind of data often contains some
personal details which must be hidden from limited power users. The data can only be
allowed to be released when individuals are unidentifiable.
Moreover, if we talk about Business intelligence and analytical applications queries,
they are generally based on selection of particular attributes of a database. The
simplicity and performance characteristic of columnar approach provides a cost
effective implementation.
1.2ObjectiveThe main aim of the research is to propose a compression algorithm that is based on
the concepts of Attribute domain compression. The data is recorded globally so that
the concept of data abstraction can be preserved.
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
14/79
C h a p t e r 1 : I n t r o d u c t i o n : P a g e | 2
We will use the concept of existing two algorithms:
The first algorithm describes the concept of compression of domains atattribute level and we call it as Attribute Domain Compression. This
algorithm can be implemented on both row and columnar databases. The idea
behind the algorithm is to reduce the size of large databases as to store them
optimally.
The second algorithm is also applicable for both concepts of databases but willoptimally work for columnar databases. The idea behind the algorithm is to
generalize the tuple domains by giving it a value say (n) such that all other n-1
tuples or at least maximum can be identified.
1.3 Motivation
Data compression has been a very popular topic in the research literature and there is
a large amount of work on this subject. The most obvious reason to consider
compression in a database context is to reduce the space required in the disk.
However, the motivation behind the research is whether the processing time of
queries can be improved by reducing the amount of data that needs to be read from
disk using a compression technique.
Recently, there has been a revival of interest on employing compression techniques to
improve performance in a database which also helps me to choose this as my topic for
study. The data compression currently exists in the main databases engines, being
adopted different approaches in each one of them.
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
15/79
C h a p t e r 1 : I n t r o d u c t i o n : P a g e | 3
1.4 Research Contribution
In order to evaluate the performance speedup obtained with the compression
performed a subset of the queries were executed with the following configurations:
1. No compression
2. Proposed compression
3. Categories compression and descriptions compression
We then study about the two major compression algorithms present in row oriented
database i.e. n-anonymization and domain encoding by binary compression.
Finally the report studies two complex algorithms and embeds them to form a final
optimal algorithm for domain compression. The report will also represent the
examples that are performed practically on a columnar oriented platform named-
Infobright.
1.5 Dissertation Outline
This research work focus on the development of compression algorithm for columnar
database over a tool Infobright. We start in Chapter 2 by documenting the theories
that are relevant for understanding columnar databases and how compression is
implemented on databases by various techniques that are given to us. In chapter 3, we
study a compression technique and implemented it by query execution over MYSQL
database. This work concludes the Dissertation Part I. Chapter 4 discusses the
framework to facilitate the development of algorithm for columnar database and
introduces two concepts Global recording anonymization and binary encoded domain
compression. We conclude this chapter by developing a compression algorithm by
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
16/79
C h a p t e r 1 : I n t r o d u c t i o n : P a g e | 4
combining these two concepts. After successful implementation of the compression
algorithm, it is then tested and the output is displayed graphically. Finally, Chapter 5
illustrates the familiarity with the tool Infobright. Some basic queries and their
execution are learned on an existing columnar database. It is not just a database but
contains an inbuilt platform for compression algorithms that can be implemented on a
DB.
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
17/79
P a g e | 5
Chapter 2
Theories
2.1. Introduction
Most information systems available today are implemented by using commercially
available database management system (DBMS) products. It is software which
manages data stored in an information system, provides privacy and privileges to
users, facilitates concurrent access to multiple users and provides recovery from
system failures without the loss of system integrity. Relational database is most
commonly used DBMS which organizes the data into different relations.
Each relational database is a collection of inter-related data which is organized in a
matrix with rows and columns. Each column represents the attribute of that particular
entity which is converted into the database table, while each row of the matrix
generally called a tuple represents the different values that an attribute can possess.
Each row in a table represents a set of related data, and every row in the table has the
same structure.
For example, in a table that represents employee, each row would represent a single
employee. Columns might represent things like employee name, employee street
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
18/79
C h a p t e r 2 : T h e o r i e s : P a g e | 6
address, his SSN etc. In a table that represents the relationship of employees with
departments, each row would relate one employee with one department.
Table 2.1 A Typical Row oriented Database
Column 1 Column 2 Column 3
Row 1 Row1 & Column 1 Row1 & Column 2 Row1 & Column 3
Row 2 Row2 & Column 1 Row2 & Column 2 Row2 & Column 3
2.1.1 On-Line Transactional Processing
The popularity of RDBMS is mainly due to the support of on-line transactional
processing (OLTP). Typically the OLTP system includes Student Management
System, Bank Database etc. The queries includes, insert the new record for a new
subject that is assigned to a student. These applications involve either no or very less
analysis of data and serve the use of an information system for data preservation and
querying. An OLTP query is for a short duration and requires minimal database
resources. [3]
Figure 2.1 represents an OLTP process in which two queries insert and lookup are
executed on a student table.
http://en.wikipedia.org/wiki/Value_added_tax_identification_numberhttp://en.wikipedia.org/wiki/Value_added_tax_identification_number -
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
19/79
C h a p t e r 2 : T h e o r i e s : P a g e | 7
Figure 2.1 OLTP Access
2.1.2 Query Intensive Applications
In the mid of 1990s a new era of data management arises which was query specific
and involves large complex data volumes. Example of such query specific DBMS are
OLAP and Data mining.
OLAP
This tool summarizes the data from large data volumes and represents the query into
results using 2-D or 3-D graphics to visualize the answer. The OLAP query is like
Give the % comparison between the marks of all students in B. Tech and in M.
Tech. The answer to this query would be generally in the form of graph or chart.
Such 3-D and 2-D visualization of data is called as Data Cubes.
Figure 2.2 represents the access pattern of OLAP which requires a few attributes to be
process and access to huge volume of data. It must be noted that the execution of
number of queries per second in OLAP is very less in comparison to OLTP.
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
20/79
C h a p t e r 2 : T h e o r i e s : P a g e | 8
Figure 2.2 OLAP Access
Data Mining
Data mining is now more demanding application of databases. It is also known as
Repeated OLAP. The objective of data mining is to locate the sub groups that
require some mean values or statistical analysis of data to get result. The typical
example of data mining query is Find the dangerous drivers from a car insurance
customer database. It is left to the data mining tool to determine what the
characteristics are of those dangerous customers group [3]. This is done typically by
combining statically analysis and automated search techniques as similar to artificial
intelligence.
2.2. The rise of Columnar Database
The roots of column-store DBMSs can be traced in the 1970s, when transposed files
were first studied, followed by investigations on vertical partitioning as a form of a
table attribute clustering technique. By the mid 1980s, the advantages of a fully
decomposed storage model (DSM a predecessor to column stores) over NSM
(traditional row-based storage) were documented.[4]
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
21/79
C h a p t e r 2 : T h e o r i e s : P a g e | 9
The relational databases present today are designed predominantly to handle online
transactional processing (OLTP) applications. A transaction (e.g. an online purchasing
a laptop through internet dealer) typically maps to one or more rows in a relational
database, and all traditional RDBMS designs are based on a per row paradigm. For
transactional-based systems, this architecture is well suited to handle the input of
incoming data.
Data warehouses are used in almost every large organizations and research states that
their size doubles after every third year. Moreover the hourly workload of these
warehouses is huge and approximately 20lakhs SQL statements are encountered
hourly. [7]
Warehouses contain a lot of data and hence any leak or illegal publication of
information risks the individuals privacy. However, for applications that are very
read intensive and selective in the information being requested, the OLTP database
design isnt a model that typically holds up well. [6] Business intelligence and
analytical applications queries often analyze selected attributes in a database. The
simplicity and performance characteristic of columnar approach provides a cost
effective implementation.
Column oriented database generally known as columnar database reinvents how
data is stored in databases. Storing data in such a fashion increases the probability of
storing adjacent records on disk and hence odds of compression. This architecture
suggests a different model in which inserting and deleting transactional data are done
by a row-based system, but selective queries that are only interested in a few columns
of a table are handled by columnar approach.
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
22/79
C h a p t e r 2 : T h e o r i e s : P a g e | 10
Different methodologies such as indexing, materialistic views, horizontal partitioning
etc. are provided by row oriented databases which are rather better ways of query
execution, but they also have some disadvantages of their own. For example, in
business intelligence/analytic environments, the ad-hoc nature of such scenarios
makes it nearly impossible to predict which columns will need indexing, so tables end
up either being over-indexed (which causes load and maintenance issues) or not
properly indexed and so many queries end up running much slower than desired.
2.3. Definitions
A column-oriented DBMS is a database management system (DBMS) that stores its
content by column rather than by row.Wiki [23]
It must always be remembered that columnar database is only an approach of how
data is stored in memory, it doesnt defined any architectural implementation of
database, and rather it follows the traditional database architecture.
Table 2.2 Table representing Column storing of data
SNO SNAME SSN CITY
S1 MEHUL 200 JAIPUR
S2 VIPIN 201 HINDON
S3 DEVENDRA 300 KEKRI
http://en.wikipedia.org/wiki/Database_management_systemhttp://en.wikipedia.org/wiki/Database_management_system -
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
23/79
C h a p t e r 2 : T h e o r i e s : P a g e | 11
S4 ANITA 302 BHILWARA
The data would be stored on disk or in memory something like:
S1S2S3S4S5MEHULVIPINDEVENDRAANITAPALWIN200201300302202JAIPU
RHINDONKEKRIAJMERGANGANAGAR
This is in contrast to a traditional row based approach in which the data more like this:
S1MEHUL200JAIPURS2VIPIN201HINDONS3DEVENDRA300KEKRIS4ANITA3
02AJMERS5PALWIN202GANGANAGAR
The above example also explains that columnar database can be highly compressed,
moreover it is self-indexed and hence aggregate functions such as MIN, MAX, AVG,
and COUNT can be efficiently performed.
Figure 2.3 Column based data storage
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
24/79
C h a p t e r 2 : T h e o r i e s : P a g e | 12
As it is clearly that goal of a columnar database is to perform the write and read
operations efficiently to and from hard disk storage in order to speed up the time it
takes to return a query. In the above example, all the column 1 values are physically
together followed by all the column 2 values, etc. The data is stored in record order,
so the 100th entry for column 1 and the 100th entry for column 2 belong to the same
input record [1]. This allows individual data elements, such as customer name for
instance, to be accessed in columns as a group, rather than individually row-by-row.
2.4. Row Oriented Execution
In this section, we discuss several different techniques that can be used to implement
a column-database design in a commercial row-oriented DBMS.
2.4.1 Vertical Partitioning
The most straightforward way to emulate a column-store approach in a row-store is to
fully vertically partition each relation. This approach creates one physical table for
each column in the logical schema, where the ith
table has two columns, one with
values from column i of the logical schema and one with the corresponding value in
the position column. Queries are then rewritten to perform joins on the position
attribute when fetching multiple columns from the same relation.
2.4.2 Index-only plans
The vertical partitioning approach has two problems. Firstly, it requires the position
attribute to be stored in every column, which wastes space and disk bandwidth and
secondly, most row-stores store a relatively large header on every tuple, which further
wastes space. [7] Therefore to remove these problems we use another approach called
http://searchdatamanagement.techtarget.com/sDefinition/0,290660,sid91_gci211894,00.html -
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
25/79
C h a p t e r 2 : T h e o r i e s : P a g e | 13
as Index only plans. In this approach the base relations are stored using a standard,
row-oriented design, but an additional dispersed B+Tree index is added on every
column of every table.
2.4.3 Materialized Views
The third approach we consider uses materialized views. In this approach, we create
an optimal set of materialized views for every query flight in the workload, where the
optimal
View for a given flight has only the columns needed to answer queries in that flight.
We do not pre-join columns from different tables in these views.
2.5 Column Oriented Execution
In this section, we review three common optimizations used to improve performance
in column-oriented database systems.
2.5.1 Compression
Compressing data using column-oriented compression algorithms and keeping data in
this compressed format as it is operated upon has been shown to improve query
performance by up to an order of magnitude. Storing data in columns allows all of the
names to be stored together, all of the phone numbers together, etc. Certainly phone
numbers are more similar to each other than surrounding text fields like e-mail
addresses or names. Further, if the data is sorted by one of the columns, that column
will be super-compressible.
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
26/79
C h a p t e r 2 : T h e o r i e s : P a g e | 14
2.5.2 Late Materialization
In a column-store, information about a logical entity (e.g., a person) is stored in
multiple locations on disk (e.g. name, e-mail address, phone number, etc. are all
stored in separate columns), whereas in a row store such information is usually co-
located in a single row of a table. [7]
At some point in most query plans, data from multiple columns must be combined
together into rows of information about an entity. Consequently, this join-like
materialization of tuples (also called tuple construction) is an extremely common
operation in a column store.
2.5.3 Block Iteration
In order to process a series of tuples, row-stores first iterate through each tuple, and
then need to extract the needed attributes from these tuples through a tuple
representation interface.
In contrast to row-stores, in all column-stores, blocks of values from the same column
are sent to an operator in a single function call. Further, no attribute extraction is
needed, and if the column is fixed-width, these values can be iterated through directly
as an array. Operating on data as an array not only minimizes per-tuple overhead, but
it also exploits potential for parallelism on modern CPUs, as loop-pipelining
techniques can be used. [2-5]
2.5.4 Invisible joins
Queries over data warehouses, particularly over data warehouses, often have the
following structure:
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
27/79
C h a p t e r 2 : T h e o r i e s : P a g e | 15
Restrict the set of tuples in the fact table using selection predicates on one (or
many) dimension tables.
Then, perform some aggregation on the restricted fact table, often grouping
by other dimension table attributes.
Thus, joins between the fact table and dimension tables need to be performed for each
selection predicate and for each aggregate grouping.
As an alternative to these query plans, we introduce a technique we call the invisible
join that can be used in column-oriented databases for foreign-key/primary-key joins.
It works by rewriting joins into predicates on the foreign key columns in the fact
table. These predicates can be evaluated either by using a hash lookup (in which case
a hash join is simulated), or by using more advanced methods which are beyond the
scope of our study. [1]
2.6. Query execution in Row vs. Column oriented database
When talking about the performance of databases, query execution is the most
important and indistinct factor which can individually determines the performance of
the database either it is row based or column based. We understand the concept by a
simple example:
Suppose there are 1000 rows in a database table and the following query is executed
over it.
Until no more {
Get a row out of the buffer manager
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
28/79
C h a p t e r 2 : T h e o r i e s : P a g e | 16
Evaluate the row
Pass onward if it satisfies the predicate}
Notice that the inner loop of the executor is called 1000 times for our query above,
once per row. Since the overhead of the inner loop largely determines performance, a
row store executor will take CPU time proportional to the number of runs required to
evaluate the query.
In contrast, in a column store executor the inner loop is:
Until no more {
Pick up a column
Evaluate the column
Pass on a row range
}
Notice that the inner loop is called once per column, not once per row. Also, notice
that the algorithm complexity of processing a row is about the same as processing a
column. [17]
Hence, the column store will consume vastly less CPU resources, because its inner
loop is executed once per column, and there are a lot less columns than rows in
evaluating a typical query.
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
29/79
C h a p t e r 2 : T h e o r i e s : P a g e | 17
2.7. Compression
Data compression in databases is always been a very popular and interesting topic for
the database researchers and there is a lot of work on this context. The most obvious
reason for compression in any context is to reduce the space required in the disk and
so is in databases. However, another important issue is to improve the processing time
of queries by reducing the amount of data that needs to be read from disk using a
compression technique.
After a long time since the evolution of databases, there is a revival in the field of
compression to improve the quality and performance of databases. The data
compression currently exists in the main databases engines, being adopted different
approaches in each one of them. It is generally accepted that due to the greater
similarity and redundancy of data within columns, column stores provide superior
compression, and therefore require less storage hardware and perform faster because,
among other things, they read less data from the disk [17]. Moreover, the compression
ratio is higher in columnar database because the entries in the columns are similar to
each other.
Both Huffman encoding and Arithmetic encoding are based on the statistical
distribution of the frequencies of symbols appearing in the data. Huffman coding
assigns a shorter compression code to a frequent symbol and a longer compression
code to an infrequent symbol. For example, if there are four symbols a, b, c, and d,
each with probability1 3/16, 1/16, 1/16, and 1/16, then 2 bits are needed to represent
each symbol without compression.
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
30/79
C h a p t e r 2 : T h e o r i e s : P a g e | 18
A possible Huffman coding is the following:
a = 0, b = 10, c = 110, d = 111.
As a result, the average length of a compressed symbol equals:
1 13/16 + 2 1/16 + 3 1/16 + 3 1/16 = 1.3 bits.
Arithmetic encoding is similar to Huffman encoding except that it assigns an interval
to the whole input string based on the statistical distribution. [7]
2.8.Conventional Compression
Database Compression techniques are applied to gain the performance by decreasing
the size and increasing Input and output functional/query performance of a database.
The basic concept behind compression is that it delimits the storage and keeps the
data adjoining to each other and therefore it reduces the size and number of transfers.
This section demonstrates the two different classes of compression in databases.
a. Domain Compression
b. Attribute Compression
The classes are equally implementable in column or row based database approach.
Queries that are executed on compressed data are seen more efficient than the queries
that are executed over a decompressed database [8]. In the section below, we will
discuss each of the above section in detail.
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
31/79
C h a p t e r 2 : T h e o r i e s : P a g e | 19
2.8.1 Domain Compression
In this type of compression technique, we will discuss three compression techniques:
numeric compression in the presence of NULL values, string compression, and
dictionary-based compression. Since all three compression techniques are applicable
in domain compression, we obviously will be sticking with the compression of
domain of the attributes.
Numeric Compression in the presence of NULL values
This compression technique is used to compress those attributes which are of numeric
type such as integer and contains some NULL values in their domain. The basic idea
is that consecutive zeros or blanks of a tuple in the table are removed and a
description of how many there were and where they existed is given at the end [13].
To eliminate the difference in size of the attribute because of null values, it is
sometimes recommended to encode the data bit wise i.e. integer of 4 bytes is replaced
by 4 bits.
For example:
Bit value for 1= 0001
Bit value for 2= 0011
Bit value for 3= 0111
Bit value for 4= 1111
And all 0s for the value 0
String Compression
String in database is represented by char data type and its compression is already
proposed and implemented in SQL by providing varchar data type. An extension of
conventional string compression is provided in this technique. The suggestion is that
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
32/79
C h a p t e r 2 : T h e o r i e s : P a g e | 20
after converting the char type to varchar, it is further compressed in the second stage
by any of the given compression algorithm such as Huffman coding, LZW algorithm
etc. [24]
Dictionary Encoding
This type of encoding technique uses a special type of data structure called
Dictionary. It is very much effective in the circumstances when the database takes
limited values that repeat a lot more time [14]. Dictionary encoding algorithm first
calculates the number of bits, X, needed to encode a single attribute of the column
(which can be calculated directly from the number of unique values of the attribute).
It then calculates how many of these X-bit encoded values cannot in 1, 2, 3, or 4bytes.
For example, if an attribute has 32 values, it can be encoded in 5 bits, so 1 of these
values cannot in 1 byte, 3 in 2 bytes, 4 in 3 bytes, or 6 in 4 bytes.
2.8.2 Attribute Compression
As we know that all the compression techniques are designed especially for data
warehouses where a huge amount of data is stored which are usually composed by a
large number of textual attributes with low cardinality. But in such section we will
demonstrate those techniques which can also be used in conventionally old databases
such as MYSQL, SQL SERVER etc.[5]
The main objective of this technique is to allow the encryption for reduction of the
space occupied by dimension tables with number of rows, reducing the total space
occupied and leading to a consequent gains on performance.
In this type of compression technique, we will discuss two compression techniques:
compression of categories and compression of comments.
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
33/79
C h a p t e r 2 : T h e o r i e s : P a g e | 21
Compression of Categories
Categories are textual attributes with low cardinality. Examples of category attributes
are: city, country, type of product, etc.
Categories coding is done through the following steps:
1. The data in the attribute is analysed and a frequency histogram is build.
2. The table of codes is build based on the frequency histogram: the most frequent
values are encoded with a one byte code; the least frequent values are coded using a
two bytes code. In principle, two bytes are enough, but a third byte could be used if
needed.
3. The codes table and necessary metadata is written to the database.
4. The attribute is updated, replacing the original values by the corresponding codes
(the compressed values).
2.9 Layout of Compressed Tuples
Figure 2.4 shows the overall layout of a compressed tuple [7]. The figure shows that a
tuple can be composed of up to five parts:
1. The first part of a tuple keeps the (compressed) values of all fields that are
compressed using dictionary-based compression or any other fixed length
compression technique. [5-7]
2. The second part keeps the encoded length information of all fields compressed
using a variable-length compression technique such as the numerical
compression techniques described above.
3. The third part contains the values of (uncompressed) fields of fixed length;
e.g., integers, doubles, CHARs, but not VARCHARs or CHARs that were
turned into VARCHARs as a result of compression.
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
34/79
C h a p t e r 2 : T h e o r i e s : P a g e | 22
4. The fourth part contains the compressed values of fields that were compressed
using a variable-length compression technique; for example, compressed
integers, doubles, or dates. The fourth part would also contain the compressed
value of the size of a VARCHAR field if this value was chosen to be
compressed. (If the size information of a VARCHAR field is not compressed,
then it is stored in the third part of a tuple as a fixed-length, uncompressed
integer value.)
5. The fifth part of a tuple, finally, contains the string values (compressed or not
compressed) of VARCHAR fields.
While all this sounds quite complicated, the separation in five different parts is very
natural. First of all, it makes sense to separate fixed-sized and variable-sized parts of
tuples, and this separation is standard in most database systems today. The first three
parts of a tuple are fixed-sized which means that they have the same size for every
tuple of a table. As a result, compression information and/or the value of a field can
directly be retrieved from these parts without further address calculations [24]. In
particular, uncompressed integer, double, date . . . fields can directly be accessed
regardless of whether other fields are compressed or not [5]. Furthermore, it makes
sense to pack all the length codes of compressed fields together because we will
exploit this bundling in our fast decoding algorithm, as we will see soon.
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
35/79
C h a p t e r 2 : T h e o r i e s : P a g e | 23
Figure 2.4 Layout of Compressed Tuple
Finally, we separate small variable-length (compressed) fields from potentially large
variable-length string fields because the length information of small fields can be
encoded into less than a byte whereas the length information of large fields is encoded
in a two step process. Obviously, not every tuple of the database consists of these five
parts [5]. For example, tuples that have no compressed fields consist only of the third
and, maybe, the fifth part. Furthermore keep in mind that all tuples of the same table
have the same layout and consist of the same number of parts because all the tuples of
a table are compressed using the same techniques.
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
36/79
P a g e | 24
Chapter 3
Methodology
3.1 Introduction
As we discuss the compression techniques in chapter 2, by apply these techniques,
queries are executed on a platform in which query rewriting and data decompression
is done when necessary. In fact, the query execution is on a very small basis, it rather
produces a very better result when compared with the uncompressed queries on the
same platform. This chapter demonstrates the different compression methods that are
applied on the tables and then compare the results graphically as well as in tabular
forms.
It must be noted that the queries with WHERE clause must only be rewritten because
selection and projection operations dont requires searching of particular tuple of a
particular attribute.
Despite the fact that the development of data storage has increases, a similar increase
of disk access development has not happened. On the other hand, speed of RAM
memories and CPUs has improved. This technological trend led to the use of data
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
37/79
C h a p t e r 3 : M e t h o d o l o g y : P a g e | 25
compression, trading some execution overhead (to compress and decompress data) for
the reduction of space occupied by data.
The compression techniques works both statically and dynamically i.e. data is
compressed when it is read from the disk or compressed when it is executed in the
form of queries. In databases, and particularly in warehouses, the reduction in the
size of the data obtained by compression normally gains speed, as the extra cost in
execution time (to compress and decompress the data) is compensated by the
reduction in size of the data that have to be read/stored in the disks. [1]
3.2 Reasons for Data Compression
Data compression in data warehouses is particularly interesting for two main reasons:
1) The quantity of data in a warehouse is huge and hence compression is suitable and
preferred over normal databases.
2) The data warehouses are used for querying only (i.e., only read accesses, as the
data warehouse updates are done offline),
This means that compression overhead is not relevant. Furthermore, if data is
compressed using techniques that allow searching over the compressed data, then the
gains in performance could be quite significant, as the decompression operation are
only done when is strictly necessary.
In spite of the potential advantages of compression in databases, most of the
commercial relational database management systems (DBMS) either do not have
compression or just provide data compression at the physical layer (i.e., database
blocks), which is not flexible enough to become a real advantage. Flexibility in
database compression is essential, as the data that could be advantageously
compressed is frequently mixed in the same table with data whose compression is not
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
38/79
C h a p t e r 3 : M e t h o d o l o g y : P a g e | 26
particularly helpful. Nonetheless, recent work on attribute-level compression methods
has shown that compression can improve the performance of database systems in
read-intensive environments such as data warehouses. [18]
Data compression and data coding techniques transform a given set of data into a new
set of data containing the same information, but occupying less space than the original
data (ideally, the minimum space possible). Data compression is heavily used in data
transmission and data storage. In fact, reducing the amount of data to be transmitted
(or stored) is equivalent to the increase of the bandwidth of the transmission channel
(or the size of the storage device).
The first data compression proposals appeared in the 40s, namely proposed by D.
Huffman, but these earlier proposals have evolved dramatically since then [7]. The
main emphasis of previous work has been on the compression of numerical attributes,
where coding techniques have been employed to reduce the length of integers,
floating point numbers, and dates. However, string attributes (i.e., attributes of type
CHAR (n) or VARCHAR (n) in SQL) often comprise a large portion of database
records and thus have significant impact on query performance.
The compression of data in databases offers two main advantages:
1. less space occupied by data and2. Potentially better query response time.
If the benefit in terms storage is easily understandable, the gain in performance is not
so obvious. This gain is due to the fact that less data had to be read of the storage,
which is clearly the most time-consuming operation during the query processing. The
most interesting use of data compression and codification techniques in Databases are
surely in data warehouses, given the huge amount of data normally involved and its
clear orientation for the query processing. As in the data warehouses all the insertions
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
39/79
C h a p t e r 3 : M e t h o d o l o g y : P a g e | 27
and updates are done during the update window , when the data warehouse is not
available for users, off-line compression algorithms are more adequate, as the gain in
query response time usually compensates the extra costs to codify the data before
being loaded into the data warehouse. In fact, off-line compression algorithms
optimize the decompression time, which normally implies more costs in the
compression process. The technique presented in this report follow these ideas, as it
takes advantage of the specific features of data warehouses to optimize the use of
traditional text compression techniques.
In addition to the observations regarding when to use each of the various compression
schemes, our results also illustrate the following important points:
Physical database design should be aware of the compression subsystem.Performance is improved by compression schemes that take advantage of data
locality. Queries on columns in projections with secondary and tertiary sort
orders perform well, and it is generally beneficial to have low cardinality
columns serve as the leftmost sort orders in the projection (to increase the
average run-lengths of columns to the right). The more order and locality in a
column, the better the database is. It is a good idea to operate directly on
compressed data.
The optimizer needs to be aware of the performance implications of operatingdirectly on compressed data in its cost models. Further, cost models that only
take into account I/O costs will likely perform poorly in the context of
column-oriented systems since CPU cost is often the dominant factor.
3.3 Compression Scheme
Compression is done through the following steps:
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
40/79
C h a p t e r 3 : M e t h o d o l o g y : P a g e | 28
1. Attributes are analyzed and a frequency histogram is build.
2. The table of codes is build based on the frequency histogram: the most frequent
values are encoded with a one byte code; the least frequent values are coded using a
two bytes code. In principle, two bytes are enough, but a third byte could be used if
needed.[5]
3. The codes table and necessary metadata is written to the database.
4. The attribute is updated, replacing the original values by the corresponding codes
(the compressed values).
The below example of an employee table represents the compression technique:
Table 3.1 Employee table with type and cardinality
Attribute name Attribute Type Cardinility
SSN TEXT 1000000
EMP_NAME VARCHAR(20) 500
EMP_ADD TEXT 200
EMP_SEX CHAR 2
EMP_SAL INTEGER 5000
EMP_DOB DATE 50
EMP_CITY TEXT 95000
EMP_REMARKS TEXT 600
Table 3.1 presents an example of typical attributes of a client dimension in a data
warehouse, which may be a large dimension in many businesses (e.g., e-business).
For example, we can find several attributes that are candidates to coding, such as:
EMP_NAME, EMP_ADD, EMP_SEX, EMP_SAL, EMP_DOB, EMP_CITY, and
EMP_REMARKS.
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
41/79
C h a p t e r 3 : M e t h o d o l o g y : P a g e | 29
Table 3.2 Code Table Example
City name City Postal Code Code
DELHI 011 00000010
MUMBAI 022 00000100
KOLKATA 033 00000110
CHENNAI 044 00001000
BANGALORE 080 00001000 00001000
JAIPUR 0141 00000110 00000110
COIMBATORE 0422 00001000 00001000
00001000
COCHIN 0484 00010000 00010000
00010000
Assuming that we want to code the EMP_CITY attribute, an example of possible
resulting codes table is shown in Table 3.2. The codes are represented in binary to
better understand the idea. As the attribute has more than 256 distinct values, we will
have codes of one byte to represent the 256 most frequent values (e.g. Delhi and
Mumbai) and codes of two bytes to represent the least frequent values (e.g. Jaipur and
Bangalore). The values shown in Table 2 (represented in binary) would be the ones
stored in the database, instead of the larger values. For example, instead of storing
Jaipur, which corresponds to 6 ASCII chars, we just stores one byte with the binary
cone 00000110 00000110.
3.4 Query Execution
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
42/79
C h a p t e r 3 : M e t h o d o l o g y : P a g e | 30
Query rewriting is necessary in queries where the coded attributes are used in the
WHERE clause for filtering. In these queries the values used for filter the result must
be replaced by the correspondent coded values. Following are some simple examples
of the type of query rewriting needed.
The value JAIPUR is replaced by the corresponded code, fetched from t he codes
table, shown in Table 3.2.
Table 3.3 Query execution
Original Query Modified Query
Select EMP_NAME
From EMPLOYEE
Where EMP_CITY = JAIPUR
Select EMP_NAME
From EMPLOYEE
Where EMP_CITY = 00000110 00000110
3.5 Decompression
The decompression of the attributes is only made when the coded attributes are in the
query select list. In these cases the query is executed and after that the result set is
processed in order to decompress the attributes that contain compressed values. As the
typical data warehousing queries return small result sets the decompression time will
represent a very small amount of the total query execution time.
3.6 Prerequisites
The goal of the experiments performed is to measure experimentally the gains in
storage and performance obtained using the proposed technique.
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
43/79
C h a p t e r 3 : M e t h o d o l o g y : P a g e | 31
The experiments were divided in two phases. In the first phase only categories
compression was used. In the second phase we used categories compression in
conjunction with descriptions compression.
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
44/79
P a g e | 32
Chapter 4
Results & Discussions
4.1 Introduction
Many theories regarding improvements in CPU speed have focused in last decades
which overtaken improvements in disk access rates by orders of magnitude and thus
inspiring the us for generating new data compression techniques in database systems
to trade reduced disk I/O against additional CPU overhead for compression and
decompression of data.
After the development of compression technique in chapter 3 I propose a compression
algorithm which integrates domain & attribute compression based on dictionary based
anonymization and implementing global recording generalization.
In this chapter, I demonstrate how to compress data that achieve better performance
than conventional database systems. We address the following two issues.
First, we implement a new proposed N-Anonymization technique embedded with
global recording generalization. After evaluating, the report presents the algorithm for
data compression and finally demonstrates that our approach gives a comparable
result over the existing algorithms.
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
45/79
C h a p t e r 4 : R e s u l t s & D i s c u s s i o n s : P a g e | 33
Second, we use a Binary Encoded pairing of attributes for data compression that we
discuss in the previous chapter for string compression in the database and modify it so
that it intelligently selects the most effective compression method for string-valued
attributes.
Moreover we also use the concept of data hiding and equivalent sets before
compressing the data so that the private information of the users is not revealed
publically.
4.2 Anonymization
Warehouses contain a lot of data and hence any leak or illegal publication of
information risks the individuals privacy. N-Anonymity is a major technique to
deidentify a data set. The idea behind the technique is to determine the value of a
tuple, say n, such that other remaining n-1 tuples or at least maximum tuples can be
identified by the value of n.
The intensity of protection increases with increase the number of n. One way to
produce n identical tuples within the identifiable attributes is to generalize values
within the attributes, for example, removing city and street information in a address
attribute. [6]
There are many ways through which data unidentification can be done and one of the
most appropriate approaches is generalization. Various generalization techniques
include global recoding generalization multidimensional recoding generalization, and
local recoding generalization [15].
Global recoding generalization maps the current domain of an attribute to a more
general domain. For example, ages are mapped from years to 10-year intervals.
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
46/79
C h a p t e r 4 : R e s u l t s & D i s c u s s i o n s : P a g e | 34
Multidimensional recoding generalization maps a set of values to another set of
values, some or all of which are more general than the corresponding premapping
values. For example, {male, 32, divorce} is mapped to {male, [30, 40), unknown}.
Local recoding generalization modifies some values in one or more attributes to
values in more general domains [6].
4.2.1 Problem definition and Contribution
From the very beginning we have cleared that our objective is to make every tuple of
a published table identical to at least n-1 other tuples. Identity-related attributes are
those which potentially identify individuals in a table. For example, the record of an
old-aged male in the rural area with the postcode of 302033 is unique in Table 4.1,
and hence, his problem of asthma may be revealed if the table is published. To
preserve his privacy, we may generalize Gender and Postcode attribute values such
that each tuple in attribute set {Gender, Age, Postcode} has at least two occurrences.
Table 4.1 Published Table
No. Gender Age Postcode Problem
01 Male Young 302020 Heart
02 Male Old 302033 Asthma
03 Female Young 302015 Obesity
04 Female Young 302015 Obesity
A view after this generalization is given in Table 4.2. Since various countries use
different postcode schemes, we adopt a simplified postcode scheme, where its
hierarchy {302033, 3020*, 30**, 3***, *} corresponds to {rural, city, region, state,
unknown}, respectively.
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
47/79
C h a p t e r 4 : R e s u l t s & D i s c u s s i o n s : P a g e | 35
Table 4.2 View of published table by Global recording
No. Gender Age Postcode Problem
01 * Young 3020* Heart
02 * Old 3020* Asthma
03 * Young 3020* Obesity
04 * Young 3020* Obesity
Identifier attribute setA set of attributes that potentially identifies the individuals in a table is a set of
identifier attribute. For example, attribute set {Gender, Age, Postcode} in Table 1a is
an identifier attribute set.
Equivalent Set ()An equivalent set of a table with respect to an attribute set is the set of all tuples in the
table containing identical values for the attribute set. Table 4.1 forms a equivalent set
with respect to attributes {Gender, Age, Postcode, Problem}. Therefore table 4.2 is
the 2-Anonymity view of the table 4.1 since two attribute are used to deidentify the
published table.
4.2.2 Quality measure of Anonymization
After the study we can easily conclude that larger the size of equivalent set easier the
compression and obviously cost of anonymization is a factor of equivalent set. On the
basis of this theory, we can determine that:
RECORDSCAVG
(4.1)
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
48/79
C h a p t e r 4 : R e s u l t s & D i s c u s s i o n s : P a g e | 36
4.2.3 Conclusion
Another name for global recoding is domain generalization as because generalization
happens at the domain level. A specific domain is replaced by a more general domain.
There are no mixed values from different domains in a table generalized by global
recoding. When an attribute value is generalized, every occurrence of the value is
replaced by the new generalized value. A global recoding method may over
generalize a table. An example of global recoding is given in Table 4.2. Two
attributes Gender and Postcode are generalized. All gender information has been lost.
It is not necessary to generalize the Gender and the Postcode attribute as a whole. So,
we say that the global recoding method over generalizes this table.
4.3 Domain compression through binary conversion
We integrate two key methods, namely binary encoding of distinct values and pair
wise encoding of attributes, to build our compression technique.
4.3.1 Encoding of Distinct values
This compression technique is based on the assumption that the table we have
published contains minimum distinct domain of attributes and these values repeat
over the huge number of tuples present in the database. Therefore, binary encoding of
the distinct values of each attribute, followed by representation of the tuple values in
each column of the relation with the corresponding encoded values would transform
the entire relation into bits and thus compress it [16].
We will find out the number of distinct values in each column and encode the data
into bits accordingly. For example consider an instant given below which represents
the two major attributes of a relation Patients.
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
49/79
C h a p t e r 4 : R e s u l t s & D i s c u s s i o n s : P a g e | 37
Table 4.3 an instance of relation Student
Age Problem
10 Cough & Cold
20 Cough & Cold
30 Obesity
50 Diabetes
70 Asthma
Now if we adopt the concept of N-Anonymization with global recording (refer 4.2),
we can map the current domain of attributes to more general domain. For example
Age can be mapped into 10-Age interval as shown in the Table 4.4.
To examine the compression benefits achieved by this method assume that Age is of
integer type and has 5 distinct values as in Table 4.3. Suppose if there are 50 patients
then the total storage required by Age attribute will be 50*size of (int) = 50*4 = 200
bytes [9].
With our compression technique, we find that there are 9 distinct values for age
therefore we need the upper bound of log (9) i.e. 4 bits to represent each data value in
the Age field. It is easy to calculate that we would need 50*4 (bits) = 200 bits = 25
bytes which are reasonably less [9].
We call this as our stage 1 of our compression which just transforms one column into
bits. If we apply this compression to all columns of the table, the result will be
significant.
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
50/79
C h a p t e r 4 : R e s u l t s & D i s c u s s i o n s : P a g e | 38
Table 4.4 Representing Stage 1 of compression technique
Age Problem
10-20 Cough & Cold
30-40 Obesity
50-60 Diabetes
70-100 Asthma
Table 4.5 Representing Stage 1 with binary compression
Age Problem
00 Cough & Cold
01 Obesity
10 Diabetes
11 Asthma
4.3.2 Paired Encoding
It can be easily seen from the above example that besides optimizing the memory
requirement of the relations, above encoding technique is also helpful in reducing
redundancy (repetition values) from the relation. That is, it is likely that they are few
distinct values of even (column1, column2) taken together, in addition to just
column1s distinct values or column2s distinct values. We then represent the two
columns together as a single column with pair values transformed according to the
encoding. This constitutes Stage 2 of our compression in which we use the bit-
encoded database from Stage 1 as input and further compress it by coupling columns
in pairs of two, applying the distinct-pairs technique outlined. To examine the further
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
51/79
C h a p t e r 4 : R e s u l t s & D i s c u s s i o n s : P a g e | 39
compression advantage achieved, suppose that we couple Age and Problem
columns. We can see in our table 4.3 that there are 5 distinct pairs (10, Cough &
Cold), (20, cough & cold), (30, obesity), (50, Diabetes), (70, Asthma) and hence our
upper bound is log (5) = 2 bits approx. Table 4.6 shows the result of stage 2
compression.
Table 4.6 Representing Stage 2 compression
Age Problem
00 00
01 01
10 10
11 11
After compressing the attribute, pairing or coupling of attributes is done. All the
columns are coupled in pair of two in a similar manner. If the database contains even
number of columns it is straightforward. If the columns are odd, we can intelligently
choose any of the columns to be uncompressed.
Table 4.7 Representing Stage 2 compression coupling
Age- Problem
00
01
10
11
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
52/79
C h a p t e r 4 : R e s u l t s & D i s c u s s i o n s : P a g e | 40
After this compression technique is applied we can easily calculate the space required
i.e.
Before compression: 5*(4) +4*(4) = 36 bytes
After Compression and coupling: 4*2 = 8 bits.
4.4 Add-ons to compression
After performing successful compression over relation and domains, some of the
conclusions were derived by varying the coupling of attributes with each other. Some
of those possibilities are shown by the following points.
4.4.1 Functional Dependencies
Functional dependencies exists between attributes and states that:
Given a relation R, a set of attributes Y in R is said to be functional dependent on
another attribute X if and only if each value of X is associated with at most one value
of Y. This implies that attributes in set X can correspondingly determine the value of
attributes in set Y [15]. By rearranging the attributes we determine that clubbing
columns with relationships similar to functional dependencies proves better results in
compression.
Table 4.8 shows an example of functional dependencies based compression.
Table 4.8 Representing functional dependency based coupling
Name Gender Age Problem
Harshit M 10 Cough & Cold
Naman M 20 Cough & Cold
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
53/79
C h a p t e r 4 : R e s u l t s & D i s c u s s i o n s : P a g e | 41
Aman M 30 Obesity
Rajiv M 50 Diabetes
Rajni F 70 Asthma
Two different test cases were used to check the level of compression. Test case
couples the attributes {(name, age), (Gender, problem)} then individual and coupled
distinct values are checked as shown in figure 4.9. Whereas in test case 2, coupling is
done with the given attributes {(name, gender), (Age, Problem)}.
Table 4.9 representing the number of distinct values in each column
Column name Distinct values
Name 19
Gender 2
Age 19
Problem 19
Table 4.10 representing test case 1
Column name Distinct values
Name, Age 285
Gender, Problem 35
Table 4.11 representing test case 2
Column name Distinct values
Name, Gender 22
Age, Problem 312
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
54/79
C h a p t e r 4 : R e s u l t s & D i s c u s s i o n s : P a g e | 42
4.4.2 Primary Key
A primary key is an attribute which uniquely identifies a row in a table. The
observation regarding the primary key is that coupling of the primary key column
with a column having a large number of distinct values would be advantageous
because each primary key value gets associated with each distinct value in the table
and hence the resulting number of distinct tuples of the combination of the two will
always be equal to the number of primary key values in the table.
4.4.3 Few distinct values
Sometimes database contains columns with a very few distinct values. For example
Gender attribute will always contain either male or female as domain. Therefore it is
recommended that such type of attributes must be coupled with those attributes which
contains a large number of distinct values. For example consider 4 attributes {name,
gender, age, problem} where name= 200, gender= 2, age=200, problem= 20
Consider the coupling, {gender, name} and {age, problem}. The result would be
200*2 + 200*20= 4400 distinct tuples. Whereas coupling {gender, problem} and
{name, age}. The result would be 2*20 + 200*200= 40040 distinct tuples.
4.5 Limitations
Two of the most-often cited disadvantages of our approach are write operations and
tuple construction. Write operations are generally considered problematic for two
reasons:
Inserted tuples have to be broken up into their component attributes and eachattribute must be written separately, and
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
55/79
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
56/79
C h a p t e r 4 : R e s u l t s & D i s c u s s i o n s : P a g e | 44
determined, i.e. we need to decide the point at which the extra compression achieved
is not worth the performance overhead involved.
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
57/79
P a g e | 45
Chapter 5
Conclusion & Future Work
5.1 Conclusion
In this thesis we study how to use compression techniques so that the performance of
database can be improved. Moreover after comparing we also propose an algorithm for
compressing columnar databases. We studied the following research issues:
Compression different domains of databases: We studied how different domains of a database
such as varchar, int, NULL values can be dealt while compressing a database. Compared to
existing compression methods, our approach considers the heterogeneous nature of string
attributes, and uses a comprehensive strategy to choose the most effective encoding level
for each string attribute. Our experimental results show that using HDE methods achieves
better compression ratio than using a single existing method, and using HDE also
achieves the best balance between I/O saving and decompression overhead.
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
58/79
C h a p t e r 5 : C o n c l u s i o n & F u t u r e W o r k : P a g e | 46
Compression-aware query optimization: We observedthat when to decompress string
attributes is a very crucial issue for query performance. A traditional optimizer enhanced
with a cost model that takes both Input/output benefits of compression and the CPU
overhead of decompression into account, does not necessarily achieve good plans. My
experiments show that the combination of effective compression methods and
compression-aware query optimization is crucial for query performance therefore use of
our compression methods and optimization algorithms achieves up to an order
improvement in query performance over existing techniques. The significant gain in
performance suggests that a compressed database system should have the query optimizer
modified for better performance.
Compressing query results: We proposed how to use domain knowledge about the
query to improve the effect of compression on query results. Our approach uses a
combination of compression methods and we represented such combination using an
algebraic framework.
5.2 Future Work
There are several interesting future dimensions for this research work.
Compression-aware query optimization: First, it would be interesting to study how
caching of intermediate (decompressed) results can reduce the overhead of transient
decompression. Second, we plan to study how our compression techniques can handle
updates. Third, we will study the impact of hash join on our query optimization work.
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
59/79
C h a p t e r 5 : C o n c l u s i o n & F u t u r e W o r k : P a g e | 47
Result compression: We plan to explore the joint optimization problem of query plans
and compression plans. Currently, the compression optimization is based on the query
plan returned by the query optimization. However, the overall cost of a combination of a
query plan and a compression plan is different from the cost of the query plan. For
instance, a more expensive query plan may sort the result in an order such that the sorted-
normalization method can be applied and the overall cost will be lower.
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
60/79
P a g e | 48
APPENDIX I
Infobright
I.1 Introduction
The demand for business analytics and intelligence has grown dramatically across all
industries. This demand is outpacing the availability of technical expertise and
budgets to successfully implement. Infobright helps solve these problems by
providing a solution those implements and manages a scalable analytic database.
Infobright offers two versions of their software: Infobright Community Edition (ICE)
and Infobright Enterprise Edition (IEE). ICE is an open source product that can be
freely downloaded. IEE is the commercial version of the software. It offers enhanced
features that are often necessary for production and operational support.
The Infobright database is designed as an analytic database. It can handle business
driven, ad-hoc queries in a fraction of the time the same queries would take on a
transaction database. Infobright achieves its high analytic performance by organizing
the data in columns instead of rows.
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
61/79
A p p e n d i x I : I n f o b r i g h t : P a g e | 49
Infobright combines a columnar database with our Knowledge Grid architecture to
deliver a self-managing, self-tuning database optimized for analytics. Infobright
eliminates the need to create indexes, partition data, or do any manual tuning to
achieve fast response for queries and reports.
The Infobright database resolves complex analytic queries without the need for
traditional indexes, data partitioning, projections, manual tuning or specific schemas.
Instead, the Knowledge Grid architecture automatically creates and stores the
information needed to quickly resolve these queries. Infobright organizes the data into
2 layers: the compressed data itself that is stored in segments called Data Packs, and
information about the data which comprises the components of the Knowledge Grid.
For each query, the Infobright Granular Engine uses the information in the
Knowledge Grid to determine which Data Packs are relevant to the query before
decompressing any data.
Infobright technology is based on the following concepts:
Column orientation
Data Packs
Knowledge Grid
The Granular Computing Engine
I.2 Infobright Architecture
Column Orientation
Infobright is, at its core, is a highly compressed column-oriented database. This
means that instead of the data being stored row-by-row, it is stored column-by-
column. There are many advantages to column-orientation, including the ability to do
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
62/79
A p p e n d i x I : I n f o b r i g h t : P a g e | 50
more efficient data compression because each column stores a single data type (as
opposed to rows that typically contain several data types), and allowing compression
to be optimized for each particular data type. Infobright, which organizes each column
into Data Packs (as described below) has greater compression than other column-
oriented databases, as it applies a compression algorithm based on the content of each
Data pack, not just column.
Most queries only involve a subset of the columns of the tables and so a column-
oriented database focuses on retrieving only the data that is required.
Data Packs and the Knowledge Grid
Data is stored in 65K Data Packs. Data Pack Nodes contain a set of statistics about the
data that is stored and compressed in each of the Data Packs. Knowledge Nodes
provide a further set of metadata related to Data Packs or column relationships.
Together, Data Pack Nodes and Knowledge Nodes form the Knowledge Grid. Unlike
traditional database indexes, they are not manually created, and require no ongoing
"care and feeding". Instead, they are created and managed automatically by the
system. In essence, they create a high level view of the entire content of the database.
This is what makes Infobright so well-suited for ad hoc analytics, unlike other
databases that require pre-work such as indexes, projections, partitioning or aggregate
tables in order to deliver fast query performance.
Granular Computing Engine
The Granular Engine processes queries uses the Knowledge Grid information to
optimize query processing. The goal is to eliminate or significantly reduce the amount
of data that needs to be decompressed and accessed to answer a query. IEE can often
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
63/79
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
64/79
A p p e n d i x I : I n f o b r i g h t : P a g e | 52
Infobright is compatible with major Business Intelligence tools such as
Jaspersoft, Actuate/BIRT, Cognos, Business Objects, Microstrategy,
Pentaho and others.
High performance and scalability
Infobright loads data extremely fast - up to 280GB/hour.
Infobright's columnar approach results in fast response times for
complex analytic queries.
As you database goes, the query and load performance remains
constant.
Infobright scales up to 50TB of data.
Low Cost
The cost of Infobright is very low compared to closed source,
proprietary solutions.
Using Infobright eliminates the need for complex hardware
infrastructure.
Infobright runs on low cost, industry standard servers. A single server
can scale to support 50TB of data.
Infobright's industry-leading data compression (10:1 up to 40:1)
significantly reduces the amount of storage required.
I.4 MySQL Integration
MySQL is the world's most popular open source database software, with over 11
million active installations. Infobright brings scalable analytics to MySQL users
through its integration as a MySQL storage engine. If your MySQL database is
growing and query performance is suffering, Infobright is the ideal choice.
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
65/79
A p p e n d i x I : I n f o b r i g h t : P a g e | 53
Many users of MySQL turn to Infobright as their data volumes and analytic needs
grow since Infobright offers exceptional query performance for analytic applications
against large amounts of data. Migrating from MySQLs MyISAM storage engine, or
other MySQL storage engines, to the Infobright column-oriented analytic database
is quite straightforward.
Infobright contains a bundled version of MySQL and installing Infobright installs a
new instance of MySQL along with Infobright's Optimizer, Knowledge Grid, the
Infobright Loader and the underlying columnar storage architecture. This installation
also includes MySQLs MyISAM storage engine. Unlike other storage engines that
work with MySQL, it is not necessary to have an existing MySQL installation nor can
Infobright be added to an existing MySQL Server installation. When installing
Infobright, the assumption is that any previously existing MySQL or MyISAM
database will exist in a separate installation of MySQL, installed in a different
directory with a unique data path, configuration files, socket and port values.
In the data warehouse marketplace, the database must integrate with a variety of tools.
By integrating with MySQL, Infobright leverages the extensive tool connectivity
provided by MySQL connectors (C, JDBC, ODBC, .NET, Perl, etc.).
It also enables MySQL users to leverage the mature, tested BI tools with which
they're already familiar. You'll also benefit from MySQL's legendary ease of use and
low maintenance requirements.
Infobright-MySQL integration includes the following features:
Industry standard interfaces that include ODBC, JDBC, C API, PHP,
Visual Basic, Ruby, Perl and Python;
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
66/79
A p p e n d i x I : I n f o b r i g h t : P a g e | 54
Comprehensive management services and utilities;
Robust connectivity with BI tools such as Actuate/BIRT, Business
Objects, Cognos, Microstrategy, Pentaho, Jaspersoft and SAS.
I.5 Practical Implementation
Infobright neither needs nor allows the manual creation of performance structures
with duplicated data such as indexes or table partitioning based on expected usage
patterns of the data. When preparing the MySQL schemadefinition for execution in
Infobright, the first thing to do is simplify the schema. This means removing all
references to indexes and otherconstraints expressed as indexes including PRIMARY
and FOREIGN KEYs, andUNIQUE and CHECK constraints.
In addition, due to Infobrights extremely high query performance levels on large
volumes of data, one should consider removing all aggregate, reporting and summary
tables that may be in the data model as they are unnecessary.
I have done a little work with an existing airline database which has tables with many
columns. The database contains a number of columns. Basic SQL queries are
executed to check the performance of the database, but these are ad-hoc queries i.e.
any column can be accessed by it.
The Airline database is then tested with two existing database management softwares
INFOBRIGHT and MYSQL. I created a table with large number of columns (around
50) with different data types. Then I tried the SQL statements to fill the data in the
columns by using LOAD DATA INFILE instead.
-
7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases
67/79
A p p e n d i x I : I n f o b r i g h t : P a g e | 55
Creating table airline_info
CREATE TABLE `airline_info` (
`Year` year(4) DEFAULT NULL,
`Quarter` tinyint(4) DEFAULT NULL,
`Month` tinyint(4) DEFAULT NULL,
`DayofMonth` tinyint(4) DEFAULT NULL,
`DayOfWeek` tinyint(4) DEFAULT NULL,
`FlightDate` date DEFAULT NULL,
`UniqueCarrier` char(7) DEFAULT NULL,
`AirlineID` int(11) DEFAULT NULL,
`Carrier` char(2) DEFAULT NULL,
`TailNum` varchar(50) DEFAULT NULL,
`FlightNum` varchar(10) DEFAULT NULL,
`Origin` char(5) DEFAULT NULL,
`OriginCityName` varchar(100) DEFAULT NULL,
`OriginState` char(2) DEFAULT NULL,
`OriginStateFips` varchar(10) DEFAULT NULL,
`OriginStateName` varchar(100) DEFAULT NULL,
`OriginWac` int(11) DEFAULT NULL,
`Dest` char(5) DEFAULT NULL,
`DestCityName` varchar(100) DEFAULT NULL,
`DestState` char(2) DEFAULT NULL,
`DestStateFips` varchar(10) DEFAULT NULL,
`DestStateName` varchar(100) DEFAULT NULL,
`DestWac` int(11) DEFAULT NULL,
`CRSDepTime` int(11) DEFAULT NULL,
`DepTime` int(11) DEFAULT NULL,
`DepDelay` int(11) DEFAULT NULL,
`DepDelayMinutes` int(11) DEFAULT NULL,
`DepDel15` int(11) DEFAULT NULL,
`DepartureDelayGroups` int(11) DEFAULT NULL,
-
7/30/2019 Globally Recorded Binary Encode
top related