cheetah:data warehouse on top of mapreduce
DESCRIPTION
Cheetah: A High Performance, Custom DWH onTop of MapReduceTRANSCRIPT
![Page 1: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/1.jpg)
Cheetah: A High Performance, Custom DWH on
Top of MapReduce
Tilani Gunawardena
![Page 2: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/2.jpg)
Content• Introduction• Background• Schema design,Query language• Architecture• Data Storage and Compression• Query plan and execution• Query Optimization• Integration• Performance Evaluation• Conclusions & Future Works
![Page 3: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/3.jpg)
Introduction• Hadoop-impressive scalability & flexibility to
handle structured as well as unstructured data• A shared-nothing MPP architecture built upon
cheap, commodity hardware– MPP relational data warehouses• Aster-Data, DATAllegro, Infobright, Greenplum,
ParAccel, Vertica
– MapReduce system
![Page 4: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/4.jpg)
• Relational data ware-houses– Highly optimized for storing and querying relational data.– Hard to scale to 100s,1000s of nodes
• MapReduce– Handles failures & scale to 1000s node– Lacks a declarative query inter-face
• Users have to write code to access the data• May result in redundant code• Requires a lot of effort & technical skills.
• Recently we can see convergence of these system– AsterData, GreenPlum- are extended with some MapReduce capabilities.– Number of data analytics tools have been built on top of Hadoop
• Pig - translates a high-level data flow language into MapReduce jobs• HBase - similar to BigTable provides random read and write access to a huge distributed
(key, value) store• Hive & HadoopDB - support SQL-like query language.
![Page 5: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/5.jpg)
Background and Motivation• www.turn.com- use platform to more effectively,
scale, optimize performance & centrally manage campaigns
• Data management challenges– Data : schema changes, not relational data, daata size– Simple yet Powerful Query Language– Data Mining Applications:simple yet efficient data
access method.– Performance
![Page 6: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/6.jpg)
Cheetah• A custom data warehouse system, built on top of
Hadoop.– Succinct Query Language
• Virtual views• Clients with no prior SQL knowledge are able to quickly grasp
– High Performance– Seamless integration of MapReduce and Data
Warehouse• Advantage of the power of both MapReduce (massive
parallelism & scalability) & data warehouse (easy & efficient data access) technologies
![Page 7: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/7.jpg)
Similar Works
• HadoopDB -PostgreSQL as the underlying storage and query processor for Hadoop.
• Cheetah-new data warehouse on top of Hadoop– Specical purpose– Most data are not relational and fast evelving
schema
• Hive -is the most similar work to cheetah
![Page 8: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/8.jpg)
Schema Design, Query Language
• Virtual View over Warehouse Schema• Query Language• Security and Anonymity
![Page 9: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/9.jpg)
Virtual View over Warehouse SchemaVirtual view on top of the star or snow-flake schema
• Virtual view – contains attributes from fact
tables, dimension tables– are exposed to the users for query.
• At runtime, only the tables that the query referred to are accessed & joined
• Users no longer have to understand the underlying schema design
• There may be multiple fact tables and subsequently multiple virtual views
![Page 10: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/10.jpg)
Handle Big Dimension Tables• Filtering,grouping and aggregation operators easily & efficiently
supported by Hadoop• Implementation of JOIN operator on top of MapReduce is not as
straight-forward • 2 ways
– Reduce phase:more general & applicable to all scenarios• Partitions the input tables in Map Phase & Actual Join in Reduce Phase
– Map phase :• one of join tables is small (m - small tables with one big table)
– load the small table into memory & perform a hash-lookup while scanning the other big table
• Input tables are already partitioned on the join column– Read the same partitions from both tables and perform the join ????
» Problem :HDFS no facility to force two data blocks with the same partition key to be stored on the same node
– One of the join tables still has to transferred over the network Cost ???
• Denormalize big dimension tables.– Assume big dimension tables are either insertion-only or slowly changing
dimensions
![Page 11: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/11.jpg)
• Handle Schema Changes – Schema versioned table -to efficiently support schema
changes on the fact table
• Handle Nested Tables– single user may have different type of events
– Cheetah support fact tables with a nested relational data model.
– Define a nested relational virtual view– Query language that is much simpler than the standard
nested relational query
![Page 12: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/12.jpg)
Schema Design, Query Language
• Virtual View over Warehouse Schema• Query Language• Security and Anonymity
![Page 13: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/13.jpg)
Query Language
• Cheetah supports single block, SQL-like query
– Fact tables / virtual views-Impressions, Clicks and Actions.
• Cheetah supports Multi-Block Query
• Query language – Fairly simple
• Users do not have to understand the underlying schema design• They do not have to repetitively write any join predicates.
![Page 14: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/14.jpg)
Schema Design, Query Language
• Virtual View over Warehouse Schema• Query Language• Security and Anonymity
![Page 15: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/15.jpg)
Security and Anonymity
• Supports row level security based on virtual views.
• Provides ways to anonymize data
![Page 16: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/16.jpg)
Architecture• Simple yet efficient• Open:also provide a simple, non-SQL interface
![Page 17: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/17.jpg)
Query MR Job• Issue query through either Web UI, or CLI or Java
code via JDBC• Query is sent to the node that runs Query Driver• Query Driver
Query MapReduce job
• Each node in the Hadoop cluster provides a data access primitive (DAP) interface
Ad-hoc MR Job• Can issue a similar API call for fine-grained data
access
![Page 18: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/18.jpg)
Data Storage & Compression
• Storage Format• Columnar Compression
![Page 19: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/19.jpg)
Storage Format
• Text (in CSV format)– Simplest storage format & commonly used in web access logs.
• Serialized java object• Row-based binary array– Commonly used in row-oriented database systems
• Columnar binary array
Storage format -huge impact on both compression ratio and query performance.
In Cheetah, we store data in columnar format whenever possible
![Page 20: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/20.jpg)
Data Storage & Compression
• Storage Format• Columnar Compression
![Page 21: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/21.jpg)
Columnar Compression
• Compression type for each column set is dynamically determined based on data in each cell
• ETL phase- best compression method is chosen• After one cell is created, it is further compressed
using GZIP.
![Page 22: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/22.jpg)
Query Plan & Execution
• Input Files• Map Phase Plan• Reduce Phase Plan
![Page 23: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/23.jpg)
Input Files
• Input files to the MapReduce job are always fact tables.
• Fact tables are partitioned by date• Fact tables are further partitioned by a
dimension key attribute, referred as DID
![Page 24: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/24.jpg)
Query Plan & Execution
• Input Files• Map Phase Plan• Reduce Phase Plan
![Page 25: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/25.jpg)
Map Phase Plan
• Each node in the cluster stores some portion of the fact table data blocks and (small) dimension files
• Query contains two operators scanner and aggregation.
![Page 26: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/26.jpg)
• Scanner operator has an interface which resembles a SELECT followed by PROJECT operator over the virtual view
• Scanner operator translates the request to an equivalent SPJ query to pick up the attributes on the dimension tables.
• As optimization-Dimensions are loaded into in-memory hash tables only once if different map tasks share the same JVM.
• Hash-based implementation of group by operator-Default
![Page 27: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/27.jpg)
Query Plan & Execution
• Input Files• Map Phase Plan• Reduce Phase Plan
![Page 28: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/28.jpg)
Reduce Phase Plan• First performs global aggregation over the results from map phase.• Then it evaluates any residual expressions over the aggregate values and/or
the HAVING clause• If the ORDER BY columns are group by columns
– They are already sorted by Hadoop framework during the reduce phase.• If the ORDER BY columns are the aggregation columns
– Then we sort the results within each reduce task & merge final results after MapReduce job completes.
![Page 29: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/29.jpg)
Query Optimization
• MapReduce Job Configuration• MultiQuery Optimization• Exploiting Materialized Views• LowLatency Query Optimization
![Page 30: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/30.jpg)
MapReduce Job Configuration
• # of map tasks - based on the # of input files & number of blocks per file.
• # of reduce tasks -supplied by the job itself & has a big impact on performance.
• query output – Small:map phase dominates total cost.– Large:it is mandatory to have sufficient number of reducers to partition
the work.
• Heuristics– #of reducers is proportional to the number of group by columns in the
query.– if the group by column includes some column with very large cardinality,
we increase # of reducers as well.
![Page 31: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/31.jpg)
Query Optimization
• MapReduce Job Configuration• MultiQuery Optimization• Exploiting Materialized Views• LowLatency Query Optimization
![Page 32: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/32.jpg)
MultiQuery Optimization
• In Cheetah allow users to simultaneously submit multiple queries & execute them in a single batch, as long as these queries have the same FROM and DATES clauses
![Page 33: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/33.jpg)
Map Phase• Shared scanner-shares the scan of the fact tables
& joins to the dimension tables• Scanner will attach a query ID to each output row• Output from different aggregation operators will
be merged into a single output stream.
![Page 34: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/34.jpg)
Reduce Phase• Split the input rows based on their query Ids• Send them to the corresponding query operators.
![Page 35: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/35.jpg)
Query Optimization
• MapReduce Job Configuration• MultiQuery Optimization• Exploiting Materialized Views• LowLatency Query Optimization
![Page 36: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/36.jpg)
Exploiting Materialized Views(1)
• Definition of Materialized Views– Each materialized view only includes the columns in
the face table, i.e., excludes those on the dimension tables.
– It is partitioned by date
• Both columns referred in the query reside on the fact table, Impressions
• Resulting virtual view has two types of columns - group by columns & aggregate columns.
![Page 37: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/37.jpg)
Exploiting Materialized Views(2)• View Matching and Query Rewriting– To make use of materialized view
• Refer virtual view that corresponds to same fact table that materialized view is defined upon.
• Non-aggregate columns referred in the SELECT and WHERE clauses in the query must be a subset of the materialized view’s group by columns
• Aggregate columns must be computable from the materialized view’s aggregate columns.
![Page 38: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/38.jpg)
• Replace the virtual view in the query with the matching materialized view
![Page 39: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/39.jpg)
Query Optimization
• MapReduce Job Configuration• MultiQuery Optimization• Exploiting Materialized Views• Low Latency Query Optimization
![Page 40: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/40.jpg)
LowLatency Query Optimization
• Current Hadoop implementation has some non-trivial overhead itself– Ex:job start time,JVM start time
Problem :For small queries, this becomes a significant extra overhead.– In query translation phase: if size of the input file is
small it may choose to directly read the file from HDFS and then process the query locally.
![Page 41: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/41.jpg)
Integration• Cheetah provides JDBC interface -user program
can submit query & iterate through the output results.
• If query results are too big for a single program to consume, user can write a MapReduce job to analyze the query output files which are stored on HDFS
• Cheetah provides a non-SQL interface that can be easily integrated into any ad-hoc MapReduce jobs
![Page 42: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/42.jpg)
• Achieve ad-hoc MapReduce– Specify the input files, which can be one or more fact
table partitions.– In the Map program, it needs to include a scanner to
access individual raw record– After that, user has complete freedom to decide what
to do with the data.
![Page 43: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/43.jpg)
• Advantages to have low-level, non-SQL interface open to external applications– It provides ad-hoc MapReduce programs efficient and more
importantly local data access– The virtual view-based scanner hides most of the schema
complexity from MapReduce developers– The data is well compressed and the access method is fairly
optimized inside the scanner operator
Summery :• Ad-hoc MapReduce programs can now
automatically take full advantage of both MapReduce (massive parallelism and scalability) and data warehouse (easy and efficient data access) technologies.
![Page 44: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/44.jpg)
Performance Evaluation
• Implementation• Storage Format• Small .vs. Big Queries• MultiQuery Optimization
![Page 45: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/45.jpg)
Implementation
Important :• Query engine must have low CPU overhead.– choosing the right data format – efficient implementation of various components on the data
processing path.• Efficient hashing method
• All the experiments are performed on a cluster with 10 nodes.
• Each node has two quad core, 8GBytes memory and 4x1TBytes 7200RPM hard disks
• Cloudera’s Hadoop distribution version 0.20.1.• The size of the data blocks for fact tables is
512MBytes.
![Page 46: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/46.jpg)
Performance Evaluation
• Implementation• Storage Format• Small .vs. Big Queries• MultiQuery Optimization
![Page 47: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/47.jpg)
Storage Format
• We store one fact table partition into four files formats,– Text (in CSV format)– Java object (equivalent to row-based binary array),– Column-based binary array, – Column-based binary array with compressions
• Each file is further compressed by Hadoop’s GZIP library at block level
![Page 48: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/48.jpg)
![Page 49: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/49.jpg)
• We run a simple aggregate query over three different data formats, namely, Text, Java Object and Columnar (with compression).
• We use iostat to monitor the CPU utilization and IO throughput at each node
![Page 50: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/50.jpg)
Performance Evaluation
• Implementation• Storage Format• Small .vs. Big Queries• MultiQuery Optimization
![Page 51: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/51.jpg)
Small .vs. Big Queries
• Impact of query complexity on query performance.• We create a test query – two joins – one predicate, – 3 group by columns – 7 aggregate functions.
![Page 52: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/52.jpg)
Performance Evaluation
• Implementation• Storage Format• Small .vs. Big Queries• MultiQuery Optimization
![Page 53: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/53.jpg)
MultiQuery Optimization
• Randomly pick 40 queries from our query workload.
• The number of output rows for these queries ranges from few hundred to 1M.
• We compare the time for executing these queries in a single batch to the time for executing them separately.
![Page 54: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/54.jpg)
Conclusions• Cheetah -data warehouse system built on top
of the MapReduce technology.• The virtual view abstraction plays a central
role in designing the Cheetah system.• Multi-query optimization.
![Page 55: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/55.jpg)
Future Work
• Current IO throughput 130MBytes (Figure 12) has not reached the maximum possible speed of hard disks.
• Current multi-query optimization only exploits shared data scan and shared joins.– further explore predicate sharing and aggregation
sharing • Previous query results for answering similar
queries later.
![Page 56: Cheetah:Data Warehouse on Top of MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022070316/55581e7ad8b42a25588b4a83/html5/thumbnails/56.jpg)
Thank You!