introduction to mapreduce data transformations
Post on 11-May-2015
9.020 Views
Preview:
DESCRIPTION
TRANSCRIPT
Introduction to Map/ReduceData Transformations
Tasso ArgyrosCTO and Co-FounderAster Data Systems
tasso@asterdata.com
A Brief History of MapReduce
2 Confidential and proprietary. Copyright © 2008 Aster Data Systems
What is MapReduce?
It’s the simplest API you have ever seen
It has just two functions 1. Map() and 2. Reduce()
Plus: it’s language independent (Java, Perl, Python, …)
3 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Why is MapReduce Useful?
It simplifies distributed applications…
…by abstracting the details of data distribution (where is the data I need?) and process distribution (where should I run this process?)…
…behind two simple functions.
But let’s see an example
4 Confidential and proprietary. Copyright © 2008 Aster Data Systems
The quick brown fox
jumps over the lazy dog.
The quick brown fox
jumps over the lazy dog.
To be or not to be: that is the
question.
To be or not to be: that is the
question.
Server A Server B Server C Server D
Switch
The world only needs five computers.
The world only needs five computers.
Hello world.Hello world.
In-Database MapReduce is
the future.
In-Database MapReduce is
the future.
MapReduce is a very
powerful programming
paradigm.
MapReduce is a very
powerful programming
paradigm.
5 Confidential and proprietary. Copyright © 2008 Aster Data Systems
GoalWe Want to Count
the # of Times Each Word Occurs
6 Confidential and proprietary. Copyright © 2008 Aster Data Systems
1st ApproachNo MapReduce
1st ApproachNo MapReduce
7 Confidential and proprietary. Copyright © 2008 Aster Data Systems
The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog
To be or not to be: that is the question.
To be or not to be: that is the question.
Server A Server B Server C Server D
Switch
The world only needs
five computers.
The world only needs
five computers.
Hello world.Hello world.
In-Database MapReduce is the future.
In-Database MapReduce is the future.
MapReduce is a very powerful concept.
MapReduce is a very powerful concept.
thequickbrownfoxjumpsoverthelazydog
in databasemapreduceisthefuture
theworldonlyneedsfivecomputers
helloworld
mapreduceisaverypowerfulconcept
tobeornottobethatisthequestion
thequickbrownfoxjumpsoverthelazydogin databasemapreduceisthefuturetheworldonlyneedsfivecomputershelloworldmapreduceisaverypowerfulconcepttobeornottobethatisthequestion
Confidential and proprietary. Copyright © 2008 Aster Data Systems8
Server 4 Final Result Filethe 5
is 3
mapreduce 2
… …
9 Confidential and proprietary. Copyright © 2008 Aster Data Systems
What Did We Do?
1. Write a script to parse the documents and output word lists
2. FTP all the word lists to server 43. Write another script to count each word on
Server 4
Problem: (2) and (3) do not scale!
10 Confidential and proprietary. Copyright © 2008 Aster Data Systems
2nd ApproachNo MapReduce
Fully Distributed
11 Confidential and proprietary. Copyright © 2008 Aster Data Systems
The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog
To be or not to be: that is the question.
To be or not to be: that is the question.
Server A Server B Server C Server D
Switch
The world only needs
five computers.
The world only needs
five computers.
Hello world.Hello world.
In-Database MapReduce is the future.
In-Database MapReduce is the future.
MapReduce is a very powerful concept.
MapReduce is a very powerful concept.
thequickbrownfoxjumpsoverthelazydog
in databasemapreduceisthefuture
theworldonlyneedsfivecomputers
helloworld
mapreduceisaverypowerfulconcept
tobeornottobethatisthequestion
thethethethethedatabasedatabasefuture
worldworldpowerfullazybrown
mapreducemapreducebebetojumpscomputershello
isisisquestionoverathat
12 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Server 1 Final Result Filethe 5
… ….
Server 2 Final Result Fileworld 2
… ….
Server 3 Final Result Filemapreduce 2
… ….
Server 4 Final Result Fileis 3
… ….13 Confidential and proprietary. Copyright © 2008 Aster Data Systems
2nd Approach: No MapReduce, Distributed
14 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Does it work?Yes
Is it a pain?Yes!!
Does it take lots of time?Yes!
Would you do it?No!!!
15 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Moreover…
Who will manage your files?
What if nodes fail?
What if you want to add more nodes?
What if…
What if…
What if…
16 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Map()
InputAny file
(e.g. documents)
OutputStream of <key, value> pairs
(e.g. <word, count> pairs)
InputAll <key, value> pairs with
the same key grouped(e.g. all <word, count> pairs
where word = “the”)
OutputAnything
(e.g. sum of counts for a specific word)
Reduce()Dat
a Re
dist
ribut
ion
and
Gro
upin
g
Confidential and proprietary. Copyright © 2008 Aster Data Systems17
The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog
In-Database MapReduce is the future.
In-Database MapReduce is the future.
Map()
<the, 1><quick, 1><brown,1><fox,1><jumps,1><over,1><the,1><lazy,1><dog,1>
Map()
<in, 1><database, 1><mapreduce,1><is,1><the,1><future,1>
<world,1><world,1><powerful,1><lazy,1><brown,1>
<mapreduce,1><mapreduce,1><be,1><be,1><to,1><jumps,1><computers,1><hello,1>
<is,1><is,1><is,1><question,1><over,1><a,1><that,1>
Server A Server B Server C Server D
Switch
<the, 1><the, 1><the, 1><the, 1><the, 1><database,1><database,1><future,1>
Map() and Redistribution Phase
Confidential and proprietary. Copyright © 2008 Aster Data Systems18
<the, 1><the, 1><the, 1><the, 1><the, 1><database,1><database,1><future,1>
Reduce()
<the, 1><the, 1><the, 1><the, 1><the, 1>
<database,1><database,1>
<future,1>
Server 1 Final Result File
the 5
database 2
future 1
Reduce()
Reduce()
Grouping and Reduce() Phase(on Server 1)
19 Confidential and proprietary. Copyright © 2008 Aster Data Systems
What Just Happened?
By writing two small scripts with a few lines of code…… we achieved exactly the same result!Plus, our code did not have to care about:•the # of servers on the system (4 or 400?)•which server to send each word •any network communication aspects•any fault tolerance aspects•…
20 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Word Count was Only an Example!
Google does all web indexing on MapReduce
“The indexing code is simpler, smaller, and easier tounderstand, because the code that deals with faulttolerance, distribution and parallelization is hiddenwithin the MapReduce library. For example, thesize of one phase of the computation dropped fromapproximately 3,800 lines of C++ code to approximately700 lines when expressed using MapReduce.”
“The indexing code is simpler, smaller, and easier tounderstand, because the code that deals with faulttolerance, distribution and parallelization is hiddenwithin the MapReduce library. For example, thesize of one phase of the computation dropped fromapproximately 3,800 lines of C++ code to approximately700 lines when expressed using MapReduce.”
Google 2004 MapReduce paperGoogle 2004 MapReduce paper
21 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Word Count was Only an Example!
Published work from Stanford University showed that even extremely complex Data Mining algorithms can fit in this very simple model
“We adapt Google’s MapReduce paradigm todemonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN).”
“We adapt Google’s MapReduce paradigm todemonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN).”
Stanford 2006 AI Lab paperStanford 2006 AI Lab paper
22 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Result?
MapReduce makes writing parallel programs extremely easy…
…and can accommodate
from trivial to very
complex algorithms…
…thus enabling the
processing of petabytes of
data with a few lines of
code!
23 Confidential and proprietary. Copyright © 2008 Aster Data Systems
But…
Today MapReduce is used only by hardcore
coders/programmers/hackers
Changes in MapReduce queries require changes in
the MapReduce code itself•Constantly keep coding
Using MapReduce with database data is hard and
cumbersome…
…when most of the structured data in the
enterprise are stored in databases!
24 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Beyond SQL and MapReduce
25 Confidential and proprietary. Copyright © 2008 Aster Data Systems
SQL vs MapReduce: Two different worlds?
SQL
Declarative• Specifies what needs to
happen
Execution plans optimized
dynamically
Input/output is
structured
Data redistribution inferred
from SQL statement (in
MPP Databases)
MapReduce
Procedural• Specifies how it needs to
happen
Code compiled once;
MapReduce plans are
static
Input/output is
unstructured
Data redistribution based
on <keys> in Reduce()
phase
26 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Implementing MR in the Database
Uses Polymorphic SQL operators to embed MapReduce functions to SQL
Introduces a “PARTITION BY” clause to specify data redistribution
Introduces a “SEQUENCE BY” clause to specify ordering of data flows to the MR functions
Best of both worlds•Planning is still dynamic•MapReduce functions can be used like custom SQL operators•MapReduce functions can implement any algorithm or transformation•Code Once – Use Many (through SQL) model
27 Confidential and proprietary. Copyright © 2008 Aster Data Systems
The SQL/MR Process
28 Confidential and proprietary. Copyright © 2008 Aster Data Systems
SQL/MR Function: Syntax
SELECT…
FROM
MR_Function ( ON source_data [ PARTITION BY column ] [ ORDER BY column ] [Function Arguments]
)WHERE …GROUP BY …HAVING …ORDER BY …LIMIT …;
Optional conditions & filters
(5) Select output (eg. count)
(1) Source table or sub-select
(3) Sort before the MR function
(4) Java/Python/… MR function
(2) <key> for data redistribution
Optional MR_Function Arguments
29 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Example 1: Tokenization
Demo #1: Only Map (Tokenization) in SQL/MR SELECT word, count(*) AS wordcount FROM Tokenize( ON blogs ) GROUP BY word ORDER BY wordcount DESC LIMIT 20;
Demo #2: Map (Tokenization) and Reduce (WordCount) in SQL/MR SELECT key AS word, value AS wordcount FROM WordCountReduce ( ON Tokenize ( ON blogs ) PARTITION BY key ) ORDER BY wordcount DESC LIMIT 20;
Demo #3: Why do Reduce when you have SQL? SELECT word, count(*) AS wordcount FROM Tokenize( ON blogs ) GROUP BY word ORDER BY wordcount DESC LIMIT 20;
30 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Example 2: Sessionization
What Is Sessionize?
An example Aster SQL/MR function
Leverages Aster’s Java library API
What Does It Do?
User specified a column (eg. timestamp) and a
session timeout value (in seconds)
Spits out unique session identifiers (sessionid
column)
Usage CREATE TABLE sessionized_clicks AS SELECT ts, userid, sessionid, ... FROM Sessionize( ON clicks PARTITION BY userid ORDER BY ts TIMEOUT 60 );
31 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Example 2: Sessionization
Slide 32
timestamp
userid
10:00:00 Shawn1
00:58:24 PrezBush
10:00:24 Shawn1
02:30:33 PrezBush
10:01:23 Shawn1
10:02:40 Shawn1
timestamp
userid sessionid10:00:00 Shawn1 0
10:00:24 Shawn1 0
10:01:23 Shawn1 0
10:02:40 Shawn1 1
Session Timeout = 60 seconds
timestamp
userid sessionid00:58:24 PrezBus
h0
02:30:33 PrezBush
1
Clickstream
INPUT OUTPUT
Confidential and proprietary. Copyright © 2008 Aster Data Systems32
MR Applications in the Database
ELT
Text and data transformations, in-parallel, in-database
Queries that become too complex for SQL
E.g. Sessionize(), customer segmentation, predictive analytics, …
Queries that SQL inherently cannot handle well
Time series analytics
Aster has a set of pre-defined SQL/MR functions for this
Data structures that do not fit well the relational model
Time series (again)
Graphs, spatial data
Any analytical or reporting application that requires more performance and data proximity!
33 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Summary
Growing challenges in scaling analytical
applications and reporting
MapReduce is driving a data revolution (see:
Google)
In-Database MapReduce will open up databases
to a host of new applications
tasso@asterdata.com(Questions, Comments)
asterdata.com/blog(Lots of technical details)
1.888.Aster.Data(Any other information)
tasso@asterdata.com(Questions, Comments)
asterdata.com/blog(Lots of technical details)
1.888.Aster.Data(Any other information)
34 Confidential and proprietary. Copyright © 2008 Aster Data Systems
top related