cs525 : big data analytics

25
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1

Upload: tynice

Post on 22-Feb-2016

74 views

Category:

Documents


0 download

DESCRIPTION

CS525 : Big Data Analytics. MapReduce Languages Fall 2013 Elke A. Rundensteiner. Languages for Hadoop. Java: Hadoop’s Native Language Pig (Yahoo): Query and Workflow Language unstructured dat a Hive (Facebook): SQL-Based Language s tructured data/ data warehousing. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS525 : Big  Data  Analytics

1

CS525: Big Data AnalyticsMapReduce Languages

Fall 2013

Elke A. Rundensteiner

Page 2: CS525 : Big  Data  Analytics

2

Languages for Hadoop

• Java: Hadoop’s Native Language• Pig (Yahoo): Query and Workflow Language• unstructured data

• Hive (Facebook): SQL-Based Language• structured data/ data warehousing

Page 3: CS525 : Big  Data  Analytics

3

Java is Hadoop’s Native Language

• Hadoop itself is written in Java

• Provides Java APIs:• For mappers, reducers, combiners,

partitioners• Input and output formats

• Other languages, e.g., Pig or Hive, convert their queries to Java MapReduce code

Page 4: CS525 : Big  Data  Analytics

4

Levels of Abstraction

Java

Pig

Hive

HBase

Write map-reduce

functions

Query and workflow language

SQL-Like language

Queries against tables

More map-reduce view

More DB view

Page 5: CS525 : Big  Data  Analytics

5

Apache Pig

Page 6: CS525 : Big  Data  Analytics

6

What is Pig ?

High-level language and associated platform for expressing data analysis programs.

Compiles down to MapReduce jobs Developed by Yahoo but open-source

Page 7: CS525 : Big  Data  Analytics

8

Pig Components• High-level language (Pig Latin)

• Set of commands

• Two execution modes• Local: reads/write to local file system• Mapreduce: connects to Hadoop

cluster and reads/writes to HDFS

Two Main Components

Two modes• Interactive mode

• Console

• Batch mode• Submit a script

Page 8: CS525 : Big  Data  Analytics

9

Why Pig? Common design patterns as key words (joins,

distinct, counts) Data flow analysis (script can map to multiple

map-reduce jobs) Avoid Java-level errors (for none-java experts) Interactive mode (Issue commands and get

results)

Page 9: CS525 : Big  Data  Analytics

10

Example

raw = LOAD 'excite.log' USING PigStorage('\t') AS (user, id, time, query);

clean1 = FILTER raw BY id > 20 AND id < 100;

clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.sanitze(query) as query;

user_groups = GROUP clean2 BY (user, query);

user_query_counts = FOREACH user_groupsGENERATE group, COUNT(clean2), MIN(clean2.time), MAX(clean2.time);

STORE user_query_counts INTO 'uq_counts.csv' USING PigStorage(',');

Read file from HDFS The input format (text, tab delimited) Define run-time schema

Filter the rows on predicates

For each row, do some transformation

Grouping of recordsCompute aggregation for each group

Store the output in a file Text, Comma delimited

Page 10: CS525 : Big  Data  Analytics

11

Pig Language• Keywords• Load, Filter, Foreach Generate, Group By, Store, Join, Distinct,

Order By, …

• Aggregations• Count, Avg, Sum, Max, Min

• Schema• Defines at query-time and not when files are loaded

• Extension of Logic • UDFs

• Data• Packages for common input/output formats

Page 11: CS525 : Big  Data  Analytics

12

A Parameterized Template

A = load '$widerow' using PigStorage('\u0001') as (name: chararray, c0: int, c1: int, c2: int);

B = group A by name parallel 10;

C = foreach B generate group, SUM(A.c0) as c0, SUM(A.c1) as c1, AVG(A.c2) as c2;

D = filter C by c0 > 100 and c1 > 100 and c2 > 100;

store D into '$out';

Script can take arguments Data are “ctrl-A” delimited Define types of the columns

Specify the need of 10 parallel tasks

Page 12: CS525 : Big  Data  Analytics

13

Run independent jobs in parallel

D1 = load 'data1' …

D2 = load 'data2' …

D3 = load 'data3' …

C1 = join D1 by a, D2 by b

C2 = join D1 by c, D3 by dC1 and C2 are two independent jobs that can run in parallel

Page 13: CS525 : Big  Data  Analytics

14

Pig Latin vs. SQL• Pig Latin is dataflow programming model (step-by-step)• SQL is declarative (set-based approach)

SQL

Pig Latin

Page 14: CS525 : Big  Data  Analytics

15

Pig Latin vs. SQL• In Pig Latin• An execution plan can be explicitly defined (user hints but no

clever opt)• Data can be stored at any point during the pipeline• Schema and data types are lazily defined at run-time• Lazy evaluation (data not processed prior to STORE command)

• In SQL:• Query plans are decided by the system (powerful opt)• Data not stored in the middle (or, at least not user-accessible)• Schema and data types are defined at creation time

Page 15: CS525 : Big  Data  Analytics

Logic Plan

A=LOAD 'file1' AS (x, y, z);

B=LOAD 'file2' AS (t, u, v);

C=FILTER A by y > 0;

D=JOIN C BY x, B BY u;

E=GROUP D BY z;

F=FOREACH E GENERATE group, COUNT(D);

STORE F INTO 'output';

LOAD

FILTERLOAD

JOIN

GROUP

FOREACH

STORE

Page 16: CS525 : Big  Data  Analytics

18

Physical Plan• Mostly 1:1 correspondence with the logical plan • Some optimizations available

Page 17: CS525 : Big  Data  Analytics

19

Hive

Page 18: CS525 : Big  Data  Analytics

Apache Hive (Facebook)

• A data warehouse infrastructure built on top of Hadoop for providing data summarization, retrieval, and analysis

• Hive Provides :• Structure• ETL• Access to different storage (HDFS or HBase)• Query execution via MapReduce

• Key Principles :– SQL is a familiar language– Extensibility – Types, Functions, Formats, Scripts– Performance

20

Page 19: CS525 : Big  Data  Analytics

Hive Data Model : Structured

3-Levels: Tables Partitions Buckets

• Table: maps to a HDFS directory• Table R: Users all over the world

• Partition: maps to sub-directories under the table• Partition R: by country name

• Bucket: maps to files under each partition• Divide a partition into buckets based on a hash function

22

Page 20: CS525 : Big  Data  Analytics

Hive DDL Commands

23

CREATE TABLE sample (foo INT, bar STRING) PARTITIONED BY (ds STRING);

SHOW TABLES '.*s';

DESCRIBE sample;

ALTER TABLE sample ADD COLUMNS (new_col INT);

DROP TABLE sample;

Schema is known at creation time (like DB schema)

Partitioned tables have “sub-directories”, one for each partition

Each table in HIVE is HDFS directory in Hadoop

Page 21: CS525 : Big  Data  Analytics

HiveQL: Hive DML Commands

LOAD DATA LOCAL INPATH './sample.txt' OVERWRITE INTO TABLE sample;

LOAD DATA INPATH '/user/falvariz/hive/sample.txt’ INTO TABLE partitioned_sample PARTITION (ds='2012-02-24');

24

Load data from local file system Delete previous data from that table

Load data from HDFS Augment to the existing data

Must define a specific partition for partitioned tables

Page 22: CS525 : Big  Data  Analytics

Query Examples

SELECT MAX(foo) FROM sample;

SELECT ds, COUNT(*), SUM(foo) FROM sample GROUP BY ds;

FROM sample s INSERT OVERWRITE TABLE bar SELECT s.bar, count(*) WHERE s.foo > 0 GROUP BY s.bar;

SELECT * FROM customer c JOIN order_cust o ON (c.id=o.cus_id);

25

Page 23: CS525 : Big  Data  Analytics

User-Defined Functions

26

Page 24: CS525 : Big  Data  Analytics

Hadoop Streaming Utility

• Hadoop streaming is a utility to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer• C, Python, Java, Ruby, C#, perl, shell commands

• Map and Reduce classes written in different languages

27

Page 25: CS525 : Big  Data  Analytics

28

Summary : Languages

• Java: Hadoop’s Native Language• Pig (Yahoo): Query/Workflow Language• unstructured data

• Hive (Facebook): SQL-Based Language• structured data