Download - Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 [email protected]
![Page 2: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062718/56649eb15503460f94bb7c04/html5/thumbnails/2.jpg)
Previously …
• (Traditional) Databases are not Swiss-Army knives• Large data problems require radically different
solutions• Exploit the power of parallel I/O and computation• MapReduce as a framework for building reliable
distributed data processing applications• Storing large data requires redesign from the
ground up, i.e. filesystem (HDFS)
![Page 3: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062718/56649eb15503460f94bb7c04/html5/thumbnails/3.jpg)
Previously …
• HDFS : A reliable open source distributed file system
• HBase : A sorted multi-dimensional map for record oriented data– Not Relational– No query language other than map semantics (Get
and Put)
![Page 4: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062718/56649eb15503460f94bb7c04/html5/thumbnails/4.jpg)
MapReduce is great but …
Got to write all this for a WordCount!!!
![Page 5: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062718/56649eb15503460f94bb7c04/html5/thumbnails/5.jpg)
MapReduce
• Development cycles too long– Writing code– Packaging code
• JOINs on large data too hard to implement in MapReduce
• Today’s class: Keeping it Simple– Can we abstract users from MapReduce?
![Page 6: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062718/56649eb15503460f94bb7c04/html5/thumbnails/6.jpg)
Pig
• Started in Fall 2007 at Yahoo!• Simplify MapReduce by
capturing common data processing patterns– Results in improved productivity – Lowers barrier to entry for large data processing
• Today: Runs 40% of Yahoo!’s large data jobs• Who else: Twitter, LinkedIn, AOL, …• Similar efforts elsewhere: Sawzall (Google), Hive
(Facebook)
![Page 7: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062718/56649eb15503460f94bb7c04/html5/thumbnails/7.jpg)
Pig = Query Language + Interpreter
• Language: Pig Latin– A data flow language • LOAD, STORE, FILTER, ORDER, GROUP, JOIN
• Interpreter: Grunt– An execution environment to convert Pig Latin to
MapReduce• Two modes– Local : JVM– Distributed: via Hadoop
![Page 8: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062718/56649eb15503460f94bb7c04/html5/thumbnails/8.jpg)
Pig Latin
Example from Pittsburg Hadoop Users Group
![Page 9: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062718/56649eb15503460f94bb7c04/html5/thumbnails/9.jpg)
Equivalent MapReduce code
![Page 10: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062718/56649eb15503460f94bb7c04/html5/thumbnails/10.jpg)
Pig Latin from an Example
• Find users who visit “good” pages
(Example courtesy: Yahoo! Research)
![Page 11: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062718/56649eb15503460f94bb7c04/html5/thumbnails/11.jpg)
Conc
eptu
al D
atafl
ow
![Page 12: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062718/56649eb15503460f94bb7c04/html5/thumbnails/12.jpg)
Pig Latin script
![Page 13: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062718/56649eb15503460f94bb7c04/html5/thumbnails/13.jpg)
Pig Latin: The Language
• Structure– Collection of STATEMENTS– Statement has an OPERATOR and ends in ‘;’
![Page 14: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062718/56649eb15503460f94bb7c04/html5/thumbnails/14.jpg)
Summary of Pig Latin OperatorsCategory Operator
Loading and Storing LOADSTOREDUMP
Filtering FILTERDISTINCTFOREACH … GENERATESTREAM
Grouping and Joining JOINCOGROUPCROSS
Sorting ORDERLIMIT
Combining and Splitting UNIONSPLIT
![Page 15: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062718/56649eb15503460f94bb7c04/html5/thumbnails/15.jpg)
LOAD/STORE and Schemas
grunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);
grunt> records = LOAD ‘input/sample.txt’;
grunt> STORE records INTO ‘output/sample.out`;
![Page 16: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062718/56649eb15503460f94bb7c04/html5/thumbnails/16.jpg)
FILTER
grunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);
grunt> bad_records = FILTER records BY quality < 0;
grunt> bad_years = FOREACH bad_records GENERATE year;
![Page 17: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062718/56649eb15503460f94bb7c04/html5/thumbnails/17.jpg)
STREAM
grunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);
grunt> projected = FOREACH records GENERATE $0, $2;
grunt> projected = STREAM records THROUGH `cut -f0,2`
![Page 18: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062718/56649eb15503460f94bb7c04/html5/thumbnails/18.jpg)
JOIN
grunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);
grunt> sales = LOAD ‘input/sales.txt’>> AS (year:int, profit:float);
grunt> combined = JOIN records BY year, sales BY year;
grunt> profit_year = FOREACH combined GENERATE profit, year;
![Page 19: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062718/56649eb15503460f94bb7c04/html5/thumbnails/19.jpg)
GROUP
grunt> combined = GROUP records BY quality;
grunt> combined = GROUP sales BY quality < AVG(quality);
grunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);
![Page 20: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062718/56649eb15503460f94bb7c04/html5/thumbnails/20.jpg)
ORDERgrunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);
grunt> combined = ORDER records BY year, quality DESC;
![Page 21: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062718/56649eb15503460f94bb7c04/html5/thumbnails/21.jpg)
Parallelismgrunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);
grunt> combined = GROUP records BY quality PARALLEL 50;
Can use PARALLEL keyword in any statement
![Page 22: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062718/56649eb15503460f94bb7c04/html5/thumbnails/22.jpg)
User Defined Functions
• Unlike SQL, can invoke custom defined functions in query– Proprietary solutions like PL/SQL allow that
grunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);
grunt> REGISTER mypackage.jar;grunt> DEFINE MyFunc mypackage.MyFuncImpl.myFunc();grunt> combined = GROUP records BY MyFunc(quality);
![Page 23: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062718/56649eb15503460f94bb7c04/html5/thumbnails/23.jpg)
PIG LATIN ReviewCategory Operator
Loading and Storing LOADSTOREDUMP
Filtering FILTERDISTINCTFOREACH … GENERATESTREAM
Grouping and Joining JOINCOGROUPCROSS
Sorting ORDERLIMIT
Combining and Splitting UNIONSPLIT
![Page 24: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062718/56649eb15503460f94bb7c04/html5/thumbnails/24.jpg)
Revisiting WordCount
grunt> sentences = LOAD ‘input/*.txt’>> USING TextLoader() AS (sentence: chararray);
grunt> words = FOREACH sentences GENERATE flatten(TOKENIZE(sentence)) AS word;
grunt> word_kinds = GROUP words BY word;
grunt> word_count = FOREACH word_kinds>> GENERATE group, COUNT(words)
grunt> STORE word_count INTO ‘output/wordcount’;
![Page 25: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062718/56649eb15503460f94bb7c04/html5/thumbnails/25.jpg)
No more this …
![Page 26: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062718/56649eb15503460f94bb7c04/html5/thumbnails/26.jpg)
Related Project: Hive
• Started in Facebook, now open source• Like PIG but supports SQL• Trend : Move towards in-database MapReduce• Allows existing DB applications to scale up• Makes MapReduce capabilities easily
accessible• Business opportunity: www.vertica.com
![Page 27: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062718/56649eb15503460f94bb7c04/html5/thumbnails/27.jpg)
Summary (this and last class)
• MapReduce as a radically different solution to large data problems
• Exploit the power of parallel I/O and computation
• Need to think from the “ground up”– Filesystem: HDFS– Table store: HBase
• Basic MapReduce too complicated DB end users
![Page 28: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062718/56649eb15503460f94bb7c04/html5/thumbnails/28.jpg)
Summary (this and last class)
• Efforts to simplify MapReduce based data processing
• PIG from Yahoo!• Pig Latin a-not-so-SQL like language– A data flow language
• LOAD, STORE, FILTER, ORDER, GROUP, JOIN
• Facebook Hive supports direct SQL interface• Emerging trend: Fusion of MapReduce and DB
technologies
![Page 29: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062718/56649eb15503460f94bb7c04/html5/thumbnails/29.jpg)
Happy Thanksgiving!