joe hummel, phd visiting researcher: u. of california, irvine adjunct professor: u. of illinois,...

22
“Introducing Hadoop on Azure: Joe Hummel, PhD Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago Materials: http://www.joehummel.net/downloads.html Email: [email protected] hello Map-Reduce!”

Upload: sharlene-rose

Post on 27-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

“Introducing Hadoop on Azure:

Joe Hummel, PhD

Visiting Researcher: U. of California, IrvineAdjunct Professor: U. of Illinois, Chicago &

Loyola U., Chicago

Materials: http://www.joehummel.net/downloads.htmlEmail: [email protected]

hello Map-Reduce!”

Hadoop on Azure 2

A little history… Why Hadoop? How it works Demos Summary

Agenda

Hadoop on Azure 3

Map-Reduce is from functional programming

A little history…

// function returns 1 if i is prime, 0 if not:let isPrime(i) = ...

// sums 2 numbers:let sum(x, y) = return x + y

// count the number of primes in 1..N:let countPrimes(N) = let L = [ 1 .. N ] // [ 1, 2, 3, 4, 5, 6, ... ] let T = map isPrime L // [ 0, 1, 1, 0, 1, 0, ... ] let count = reduce sum T // 42 return count

4

Created by to drive internet search

◦ BIG data ― scalable to TBs and beyond

◦ Parallelism: to get the performance

◦ Data partitioning: to drive the parallelism

◦ Fault tolerance: at this scale, machines are going to crash, a lot…

A little more history…

BIGData

pagehits

Hadoop on Azure 5

Search engines: Google, Yahoo, Bing Facebook Twitter Financials Health industry Insurance Credit card companies Just about any company collecting user data…

Who’s using Hadoop

6

Freely-available framework for big data◦ http://hadoop.apache.org/

Based on concept of Map-Reduce:

Hadoop today

BIGdata

Map

Map

Map

Map...

Reduce R

map function reduce intermediate results

...

Hadoop on Azure 7

Massively-parallel

Mapper

Mapper

Mapper

Mapper

Mapper

Mapper

Mapper

Mapper

Mapper

Mapper

Mapper

Mapper

Mapper

Mapper

Mapper

Mapper

Mapper

Mapper

Reducer

Reducer

Reducer

Reducer

Reducer

Reducer

8

Workflow

Map

Sort

Reduce

Merge

[ <key1, [value,value,…]>, <key2, [value,value,…]>, … ]

[ <key1, value>, <key2, value>… ] R

Data

Map

Sort

Map

Sort

[ <key1,value>, <key4,value>, <key2,value>, … ]

[ <key1,value>, <key1,value>, … ]

Hadoop on Azure 9

Netflix data-mining…

Example

NetflixMovieReview

s(.txt)

Netflix Data

Mining App

Average rating…

movieid,userid,rating,date1,2390087,3,2005-09-06217,5567801,5,2006-01-0342,1121098,3,2006-03-251,8972234,5,2003-12-02...

10

Map

Sort

Reduce

Merge

[ <1, [3,5]>, <42, [3,1]>, <134, [2, …]>, <217, [5, …]>, … ]

[ <1, 4>, <42, 2>, <134, ?>, … ] R

Data

Map

Sort

Map

Sort

[ <1,3>, <217,5>, <42,3>, <1,5>, <134,2>, <42,1>, … ]

[ <1,3>, <1,5>, <42,3>, <42,1>, <134,2>, <217,5>, … ]

NetflixWorkflow

Hadoop on Azure 11

To compute average rating for every movie:

Netflix map/ reduce functions?

// Javascript version:

var map = function (key, value, context){ var values = value.split(","); // field 0 contains movieid, field 2 the rating: context.write(values[0], values[2]);};

var reduce = function (key, values, context) { var sum = 0; var count = 0;

while (values.hasNext()) { count++; sum += parseInt(values.next()); } context.write(key, sum/count);};

Hadoop on Azure 12

Traditional use of Hadoop Upload data to HDFS

◦ Hadoop file system

Write map / reduce functions◦ default is to use Java

◦ most languages supported: C, C++, C#, JavaScript, Python, …

Compile and upload code◦ For Java, you upload .jar file

◦ For others, .exe or script

Submit MapReduce job

Wait for job to complete

Hadoop on Azure 13

When to use Hadoop? Queries against big datasets Embarrassingly-parallel problems

◦ Solution must fit into map-reduce framework

Non-real-time demands

Hadoop is not for:◦ Small datasets (< 1GB?)

◦ Sub-second / real-time needs (though clearly Google makes it work)

14

We’ll be working with Chicago crime data…◦ https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2

◦ http://www.cityofchicago.org/city/en/narr/foia/CityData.html

Data set for demo

1 GB

5M rows

15

Compute top-10 crimes…

Goal?

0486 3669030820 308074...0890 166916

IUCR Count

IUCR = Illinois Uniform Crime Codes

https://data.cityofchicago.org/Public-Safety/Chicago-Police-Department-Illinois-Uniform-Crime-R/c7ck-438e

16

Hadoop on Azure… Supports traditional Hadoop usage

◦ Upload data

◦ Write MapReduce program

◦ Submit job

Additional features:◦ Allows access to persistent data from Azure Storage Vault

◦ Provides interactive JavaScript console

◦ Built-in higher-level query languages (PIG, HIVE)

Demo

Hadoop on Azure

Hadoop on Azure 17

Demo: map reduce functions

// Javascript version:

var map = function (key, value, context){ var values = value.split(","); context.write(values[4], 1);};

var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum);};

0486 3669030820 308074...

Hadoop on Azure 18

Demo: PIG command

// interactive PIG with explicit Map-Reduce functions:

pig.from("asv://datafiles/CC-from-2001.txt"). mapReduce("scripts/IUCR-Count.js", "IUCR, Count:long"). orderBy("Count DESC"). take(10). to("output-from-2001")

// visualize the results:

file = fs.read("output-from2001/part-r-00000")data = parse(file.data, "IUCR, Count:long")graph.bar(data)

19

Microsoft is offering free access to Hadoop◦ Request invitation @ http://www.hadooponazure.com/

Hadoop connector for Excel◦ Process data using Hadoop, analyze/visualize using Excel

Hadoop on Azure

Hadoop on Azure

20

That’s it!

Hadoop on Azure

21Hadoop on Azure

Summary Hadoop is all about big data processing

◦ Scalable, parallel, fault-tolerant

Easy to understand programming model◦ Map-Reduce

◦ But then solution must fit into this framework…

Rich ecosystem developing around Hadoop◦ Technologies: PIG, HIVE, HBase, …

◦ Companies: Cloudera, Hortonworks, MapR, …