yang, lin cse hkust

YANG, Lin

COMP 6311 Spring 2012

Outline Background

Overview of Big Data Management

Comparison btw PDB and MR

DB Solution on MapReduce

Conclusion

Data-driven World Science

Data bases from astronomy, genomics, environmental data, transportation data, …

Humanities and Social Sciences Scanned books, historical documents, social interactions

data, …

Business & Commerce Corporate sales, stock market transactions, census,

airline traffic, …

Entertainment Internet images, Hollywood movies, MP3 files, …

Medicine MRI & CT scans, patient records, …

Statistics from 2009

350 Million Named users

175 Million Active users in one day

35 Million Users updating status each day

55 Million Status each day

2.5 Billion Photos/Month

1.6 Million Active pages

Growth: 12 TB/day, 2 PB/year

Global data volume: 8.7 PB

A Case of Big Data

What can we do with these data? What can we do?

Scientific breakthroughs Business process

efficiencies Realistic special effects Improve quality-of-life:

healthcare, transportation, environmental disasters, daily life, …

Could We Do More?

YES: but need major advances in our capability to analyze those data

Challenges

Demand

Capacity

Demand

Capacity

Basic Requirements

Capacity Elasticity

Challenges(Cont.) More Requirements

High performance

Fault tolerance

Load Balance

Cost-efficient parallelization

High availability and disaster recovery

Outline Background

Conclusion

The State of The Art Parallel DBMS technologies

Proposed in the late eighties

Matured over the last two decades

Multi-billion dollar industry: Proprietary DBMS Engines intended as Data Warehousing solutions for very large enterprises

MapReduce

pioneered by Google

popularized by Yahoo! (Hadoop)

Parallel DBMS Popularly used for more than two decades

Research Projects: Gamma, Grace, …

Commercial: Multi-billion dollar industry but access to only a privileged few

Relational Data Model

Indexing

SQL interface

Advanced query optimization

MapReduce Dean et al., OSDI’04

Key contribution:

Propose a simple but powerful programming model

Parallelization

Fault-tolerance

Load balancing

Implement the framework of this model and provide simple interface to users.

MapReduce(Cont.) Data flow of MR

Input Data M splits M pieces of intermediate output

R splits R pieces of outputs

Partition_1

Partition_2 Reduce

M = input size / 64 MB R = #machine * 2

MapReduce(Cont.) Execution overview

MapReduce(Cont.) Word counter example

14 Credit: Martijn van Groningen. Introduction to Hadoop.

MapReduce(Cont.) Parallelization

Map and Reduce

Fault tolerance Master pings workers periodically Master write periodic checkpoint Re-execute the unfinished Reduce and all Map tasks of failed

machine

Load balance Break tasks into small granularity, schedule by master

Optimization Locality: Try to schedule Map task to the node has the

corresponding data Backup Tasks: When the job is close to completion, run backup

executions of remaining tasks, the first completed one wins

Introduction to Hadoop is a popular open-source map-reduce

implementation which is being used as an alternative to store and process extremely large data sets on commodity hard-ware.

Inspired by Google MapReduce and GFS

16 Architecture of Hadoop

Trend of Big Data Management

17 Credit: Big Philippe Julio. Big Data Architecture.

Outline Background

Conclusion

Comparison btw PDB & MR Pavlo et al., SIGMOD’09

Candidates

Hadoop: an open-source implementation of MapReduce.

DBMS-X: a parallel relational database.

Vertica: a parallel DBMS which stores data in column-base format.

Task 1

Grep specific-pattern string from large scale of records

5.6 million records, 100 bytes per record

Test 1: Fixes the size of the data per node as 535MB

Test 2: Fixes the total dataset size as 1TB

Comparison(Cont.) Data Loading

Observation: 1. Hadoop outperforms both of PDB 2. For DBMS-X, the data was actually loaded on each node sequentially, while the additional housekeeping can be done in parallel across nodes;

Comparison(Cont.) Grep

Observations: 1. For Figure 4, such little data is being

processed that the Hadoop start-up costs become the limiting factor in its performance( 10–25 seconds);

2. For Figure5, as the data volume increased, Hadoop performs almost as fast as PDB;

Comparison(Cont.) Task 2: analytical works

Data Schema & Volume

HTML Doc(600,000 records)

Attributes Type

url VARCHAR(100)

Contents TEXT

UserVisits(155 M records, 20GB/Node)

Attributes Type

srcIP VARCHAR(16)

dstURL VARCHAR(100)

visitDate DATE

adRevenue FLOAT

userAgent VARCHAR(64)

countryCode VARCHAR(3)

langCode VARCHAR(6)

searchWord VACHAR(32)

Duration INT

Rankings(18 M records, 1GB/node)

Attributes Type

pageURL VARCHAR(100)

pageRank INT

avgDuration INT

Comparison(Cont.) Data loading Selection

Find the pageURL in the Rankings table with a pageRank above a user-defined threshold.

Observations: 1. With the help of index, PDB

outperform hadoop significantly; 2. As the number of node increased,

the start-up cost would increase too.

Comparison(Cont.) Aggregation

Calculate the total adRevenue generated for each srcIP in the UserVisits table (20GB/node), grouped by the srcIP (Fig.7) or 7-character prefix of srcIP(Fig.8)

Observations: 1. PDB outperform hadoop; 2. For PDB, communication cost

dominate the execution time 3. Since Vertica is column-store, it

could perform better from not reading unused parts of table.

Comparison(Cont.) Join

Observations: 1. PDB outperforms Hadoop again! 2. PDB use indexes while Hadoop

can only perform complete scan. 3. UserVisits and Rankings in PDB

are partitioned by the join key, so PDB could do the join locally on each node without any network overhead

Find the srcIP that generated the most revenue within a particular date range, then calculate the average pageRank of all the pages visited during this interval.

SQL MR

SELECT INTO Temp srcIP, AVG(pageRank) as AvgRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(‘2000-01-15’) AND Date(‘2000-01-22’) GROUP BY UV.sourceIP; SELECT sourceIP, totalRevenue, avgRank FROM Temp ORDER BY totalRevenue DESC LIMIT 1;

1. Filter on UserVisits and join wth Rankings

2. Compute the total adRevenue and avgRank based on srcIP

3. Find the one with largest adRevenu

Comparison(Cont.) UDF Aggregation

Scan the HTML documents and search for all the URLs appeared, and count the reference number across the entire set for each unique URL

Observations: 1. Bottom segment is the time to

execute the UDF and upper segment represent the query time;

2. Both DBMS-X and Hadoop have approximately performance

SQL MR

SELECT INTO Temp F(contents) FROM Documents; SELECT url, SUM(value) FROM Temp GROUPBY url; F is a user-defined function, which parses the contents of each record in the Documents table and emits URLs into the database.

Similar to Grep task

Quick Summary At the scale of experiences above, parallel DBMS perform

much better than Hadoop

the result of a number of technologies developed over the past 25 years: B-Tree index, column-store, compression algorithm, sophisticated query engine and etc.

Hadoop get a better fault tolerance with performance penalty.

Hadoop is more easy to set up and use the PDB at large scale of nodes.

PDB don’t do a good job in UDF aggregation.

The trend of both system is to move toward each other

Outline Background

Overview of big data management

Conclusion

Hive Thusoo et al., VLDB’09

Problem MapReduce requires developers to write custom

programs which are hard to maintain and reuse.

Analyst are familiar with SQL

Key contribution Proposed an open-source data warehouse solution built

on top of Hadoop.

Hive provides HiveQL, a SQL-like query language which support select, join, aggregate, union all and sub-queries in from clause.

Hive(Cont.) Metastore: system catalog.

Thrift Server: a framework for cross-language services.

Driver: manages the life cycle of a HiveQL statement during compilation, optimization and execution.

Hive(Cont.) FROM (SELECT a.status, b.school, b.gender FROM status_updates a JOIN profiles b ON (a.userid = b.userid and a.ds='2009-03-20' ) ) subq1 INSERT OVERWRITE TABLE gender_summary PARTITION(ds='2009-03-20') SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender INSERT OVERWRITE TABLE school_summary PARTITION(ds='2009-03-20') SELECT subq1.school, COUNT(1) GROUP BY subq1.school

Hive(Cont.) Future work

Make HiveQL subsume SQL

Add cost-based optimizer

Columnar storage and more intelligent data placement

Performance enhancement

Integrate with commercial BI tools

Multi-query optimization and generic n-way joins in a single MR job

HadoopDB Azza Abouzeid et al. VLDB’09

Problem

It’s the age of big data.

Properties for large data analysis :

Performance

Flexible query interface

Fault tolerance

Load balance

Parallel database

MapReduce

Can we have both of them?

HadoopDB(Cont.) Key contribution:

To build a hybrid system to archive all the required properties.

Basic idea:

Connect multiple single-node database using Hadoop as task coordinator and net communication layer (Fault tolerance & Load balance)

Queries are parallelized across nodes using MapReduce framework

Queries are pushed inside of corresponding database note as much as possible (Performance & Multi-interface)

HadoopDB(Cont.)

Translate HiveQL to MapReduce, and translate some of MR back to SQL

Repartition data on a given key or breaking apart single node data into chunks

maintains meta information about the databases

Connect to database and execute SQL and return key/value pairs

HadoopDB(Cont.) SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY

YEAR(saleDate);

HadoopDB(Cont.)

Outline Background

Overview of big data management

Conclusion

Conclusion It is the age of big Data now

Major technologies for big data management

Parallel DBMS: Well studied and wildly used

MapReduce: Attracting new technology with bright future while exist lots of drawbacks, especially in the aspect of performance

Hive and HadoopDB give a good paradigm for build DB solution on the top of MapReduce

There are still lots of work could be done in the field of big data management.

Reference Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified Data

Processing on Large Clusters. OSDI’04

Andrew Pavlo, Erik Paulson, Alexander Rasin. A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD’09

Ashish Thusoo, Joydeep Sen Sarma, Namit Jain. Hive - A Warehousing Solution Over a Map-Reduce Framework. VLDB’09

Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi. HadoopDB: An Architectural Hybrid of MapReduce and

DBMS Technologies for Analytical Workloads. VLDB’09

Divyakant Agrawal, Sudipto Das, AmrEl Abbadi. Big Data and Cloud Computing: Current State and Future Opportunities. EDBT’11

yang, lin cse hkust

Documents

1 sept 1, 2009 comp660l fall 2009 hkust lin gu...

mscit 5210/mscbd 5002: knowledge discovery and data...

using machine learning trading suggestion system hkust cse...

hkust cse dept. research text mining comp630p selected paper...

9 february 2011 - hkust

1 knowledge management with documents qiang yang hkust...

hkust in action

venue signage hkust

final year proejct - hkust talk 2014 april.pdf ·...

newsletter issue 15 hkust

hkust iprs - student guide

information-agnostic flow scheduling for commodity...

hkust alumni

enabling wide-spread communications on optical fabric...

localization technology - hkust

1 sept 3, 2009 comp660l fall 2009 hkust lin gu...

hkust campus map - hkust ug admissions · title: hkust...

hkust business school shenzhen mba pt class of 2013 yearbook...

selective sampling on probabilistic labels peng peng,...

comp4911: it entrepreneurship › ~dekai › 4911_2016q3 ›...