data analysis with hadoop and hive, chicagodb 2/21/2011
Post on 21-Oct-2014
8.519 views
DESCRIPTION
Slides from presentation on using Hadoop and Hive as a new data analysis platform. Presented at the ChicagoDB user group on February 21st, 2011.TRANSCRIPT
![Page 1: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/1.jpg)
Introduction to Data Analysis with Hadoop and Hive
Jonathan Seidman
ChicagoDB
February 21 | 2011
![Page 2: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/2.jpg)
About Me
• Lead Engineer on Business Intelligence/Data Infrastructure team at Orbitz, former member of Machine Learning team
• Co-organizer/founder of Chicago Hadoop User Group (http://www.meetup.com/Chicago-area-Hadoop-User-Group-CHUG/)
• Recovering Java developer • [email protected] • @jseidman • @OrbitzTalent
page 2
![Page 3: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/3.jpg)
page 3
Why Hadoop and Hive?
![Page 4: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/4.jpg)
Some Hadoop “Clichés” (Which are still true…)
Hadoop allows you to store and process data that was previously impractical because of cost, technical issues, etc.
page 4
![Page 5: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/5.jpg)
page 5
Utterly redonkulous amounts of money
$ per managed TB
![Page 6: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/6.jpg)
page 6
Utterly redonkulous amounts of money
More reasonable amounts of money $ per managed TB
![Page 7: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/7.jpg)
page 7
Adding data to our data warehouse also requires a lengthy plan/implement/deploy cycle.
Because of the expense and time our data teams need to be very judicious about which data gets added. This means that potentially valuable data may not be saved.
![Page 8: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/8.jpg)
page 8
Hadoop brings our cost per TB down to $1500 (or even less)
![Page 9: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/9.jpg)
Hadoop Distributed File System
HDFS provides economical, reliable, fault tolerant and scalable storage of very large datasets across machines in a cluster.
page 9
![Page 10: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/10.jpg)
Some Hadoop “Clichés” (Which are still true…)
Hadoop places no constraints on how data is processed.
page 10
![Page 11: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/11.jpg)
Some Hadoop “Clichés” (Which are still true…)
Hadoop makes it relatively easy to efficiently process all the data stored in HDFS.
MapReduce is a programming model for efficient distributed processing. Designed to reliably perform computations on large volumes of data in parallel.
MapRedue Removes much of the burden of writing distributed computations.
page 11
![Page 12: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/12.jpg)
The Problem with MapReduce
• package org.myorg;
• 2.
• 3. import java.io.IOException;
• 4. import java.util.*;
• 5.
• 6. import org.apache.hadoop.fs.Path;
• 7. import org.apache.hadoop.conf.*;
• 8. import org.apache.hadoop.io.*;
• 9. import org.apache.hadoop.mapred.*;
• 10. import org.apache.hadoop.util.*;
• 11.
• 12. public class WordCount {
• 13.
• 14. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
• 15. private final static IntWritable one = new IntWritable(1);
• 16. private Text word = new Text();
• 17.
• 18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
• 19. String line = value.toString();
• 20. StringTokenizer tokenizer = new StringTokenizer(line);
• 21. while (tokenizer.hasMoreTokens()) {
• 22. word.set(tokenizer.nextToken());
• 23. output.collect(word, one);
• 24. }
• 25. }
• 26. }
• 27.
• 28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
• 29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
• 30. int sum = 0;
• 31. while (values.hasNext()) {
• 32. sum += values.next().get();
• 33. }
• 34. output.collect(key, new IntWritable(sum));
• 35. }
• 36. }
• 37.
• 38. public static void main(String[] args) throws Exception {
• 39. JobConf conf = new JobConf(WordCount.class);
• 40. conf.setJobName("wordcount");
• 41.
• 42. conf.setOutputKeyClass(Text.class);
• 43. conf.setOutputValueClass(IntWritable.class);
• 44.
• 45. conf.setMapperClass(Map.class);
• 46. conf.setCombinerClass(Reduce.class);
• 47. conf.setReducerClass(Reduce.class);
• 48.
• 49. conf.setInputFormat(TextInputFormat.class);
• 50. conf.setOutputFormat(TextOutputFormat.class);
• 51.
• 52. FileInputFormat.setInputPaths(conf, new Path(args[0]));
• 53. FileOutputFormat.setOutputPath(conf, new Path(args[1]));
• 54.
• 55. JobClient.runJob(conf);
• 57. }
• 58. }
page 12
![Page 13: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/13.jpg)
Hive Overview
Hive is an open-source data warehousing solution built on top of Hadoop which allows for easy data summarization, ad-hoc querying and analysis of large datasets stored in Hadoop.
Developed at Facebook to provide a structured data model over Hadoop data.
Simplifies Hadoop data analysis – users can use a familiar SQL model rather than writing low level custom code.
Hive queries are compiled into Hadoop MapReduce jobs.
Designed for scalability, not low latency.
page 13
![Page 14: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/14.jpg)
page 14
Hive provides the basis for a new data analysis infrastructure.
We currently run Hive 0.6.0 with Cloudera CDH2 (Hadoop 0.20.1)
![Page 15: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/15.jpg)
Hive Architecture (Simplified)
page 15
![Page 16: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/16.jpg)
Hive Overview – Comparison to Traditional DBMS Systems
Although Hive uses a model familiar to database users, it does not support a full relational model and only supports a subset of SQL.
Schema on read vs. schema on write
What Hadoop/Hive offers is highly scalable and fault-tolerant processing of very large data sets.
Hive However is moving more and more towards being a parallel DBMS.
page 16
![Page 17: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/17.jpg)
Hive - Data Model
Tables – analogous to tables in a standard RDBMS.
Partitions and buckets – Allow Hive to prune data during query processing.
page 17
![Page 18: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/18.jpg)
Not Yet, But Soon
Multiple databases
Views
Indexes
page 18
![Page 19: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/19.jpg)
Hive – Data Types
Supports primitive types such as int, double, and string.
Also supports complex types such as structs, maps (key/value tuples), and arrays (indexable lists).
page 19
![Page 20: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/20.jpg)
Extensible Storage Model
Row formats determine how records are stored.
Row format is defined by a SerDe (Serializer-Deserializer).
Container format is determined by the file format.
page 20
![Page 21: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/21.jpg)
Hive – Hive Query Language
HiveQL – Supports basic SQL-like operations such as select, join, aggregate, union, sub-queries, etc.
HiveQL queries are compiled into MapReduce processes.
Supports embedding custom MapReduce scripts.
Built in support for standard relational, arithmetic, and boolean operators.
Supports aggregate functions, including statistical functions (avg, standard deviation, covariance, percentiles).
page 21
![Page 22: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/22.jpg)
Hive – User Defined Functions
HiveQL is extensible through user defined functions implemented in Java.
Also supports aggregation functions.
Provides table functions when more than one value needs to be returned.
page 22
![Page 23: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/23.jpg)
Hive – User Defined Functions
Example UDF – Find hotel’s position in an impression list:
package com.orbitz.hive;!
import org.apache.hadoop.hive.ql.exec.UDF;!
import org.apache.hadoop.io.Text;!
/**!
* returns hotel_id's position given a hotel_id and impression list!
*/!
public final class GetPos extends UDF {!
public Text evaluate(final Text hotel_id, final Text impressions) {!
if (hotel_id == null || impressions == null)!
return null;!
String[] hotels = impressions.toString().split(";");!
String position;!
String id = hotel_id.toString();!
int begin=0, end=0;!
for (int i=0; i<hotels.length; i++) {!
begin = hotels[i].indexOf(",");!
end = hotels[i].lastIndexOf(",");!
position = hotels[i].substring(begin+1,end);!
if (id.equals(hotels[i].substring(0,begin)))!
return new Text(position);!
}!
return null;!
}!
}!
page 23
![Page 24: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/24.jpg)
Hive – User Defined Functions
hive> add jar path-to-jar/pos.jar; !
hive> create temporary function getpos as 'com.orbitz.hive.GetPos';!
hive> select getpos(‘1’, ‘1,3,100.00;2,1,100.00’);!
…!
hive> 3 !
page 24
![Page 25: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/25.jpg)
Hive MapReduce
Allows analysis not possible through standard HiveQL queries.
Can be implemented in any language.
page 25
![Page 26: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/26.jpg)
Hive MapReduce
• #!/usr/bin/python
import sys
for line in sys.stdin: line = line.replace(';', '|') impressions = line.split('|') for impression in impressions: fields = "".join(impression).split(',') print "%s\t%s" % (fields[0], fields[1])
hive> ADD FILE /home/jseidman/parse_impressions.py; hive> FROM > hotel_searches > SELECT > TRANSFORM(impressions) > USING > 'parse_impressions.py' > AS > hotel, pos;
page 26
![Page 27: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/27.jpg)
Processing Web Analytics Logs
Hive provides the infrastructure to support analysis of web analytics logs stored in Hadoop
Used to support analysis for machine learning tasks, cache optimization, keyword performance, etc.
page 27
![Page 28: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/28.jpg)
Processing Flow – Step 1
page 28
![Page 29: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/29.jpg)
Processing Flow – Step 2
page 29
![Page 30: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/30.jpg)
Processing Flow – Step 3
page 30
![Page 31: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/31.jpg)
Processing Flow – Step 4
page 31
![Page 32: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/32.jpg)
Processing Flow – Step 5
page 32
![Page 33: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/33.jpg)
Processing Flow – Step 6
page 33
![Page 34: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/34.jpg)
Importing Prepared Data to Hive
$HIVE_HOME/bin/hive -e "LOAD DATA INPATH \!
’/output/part-00000' OVERWRITE INTO!
TABLE hotel_searches PARTITION(dt='$YEAR-$MONTH-$DAY')"!
CREATE TABLE hotel_searches( !
session_id STRING, host STRING, visitors_ip STRING, search_date STRING, search_time STRING, dept_date STRING, ret_date STRING, destination STRING, location_id STRING, number_of_guests INT, number_of_rooms INT, !
impressions STRING)!
PARTITIONED BY (dt STRING)!
ROW FORMAT DELIMITED!
FIELDS TERMINATED BY '\t’!
STORED AS TEXTFILE;!
page 34
![Page 35: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/35.jpg)
Exporting Data from Hive Tables
hive> INSERT OVERWRITE LOCAL DIRECTORY !
> '/tmp/searches.dat' !
> SELECT * FROM hotel_searches; !
page 35
![Page 36: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/36.jpg)
Analyzing Prepared Data
Example - Find the Position of Each Booked Hotel in Search Results:
CREATE TABLE positions(!
session_id STRING,!
booked_hotel_id STRING,!
position INT);!
INSERT OVERWRITE TABLE positions!
SELECT h.session_id, h.booked_hotel_id, i.position!
FROM hotel_impressions i JOIN hotel_bookings h!
ON (h.booked_hotel_id = i.hotel_id and h.session_id = i.session_id);!
page 36
![Page 37: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/37.jpg)
Analyzing Prepared Data
Example - Aggregate Booking Position by Location by Day:
CREATE TABLE position_aggregate_by_day(!
location_id STRING,!
booking_date STRING,!
position INT,!
pcount INT);!
INSERT OVERWRITE TABLE!
position_aggregate_by_day!
SELECT!
h.location_id, h.booking_date, i.position, count(1)!
FROM!
hotel_bookings h JOIN hotel_impressions i!
ON!
(i.hotel_id = h.booked_hotel_id and i.session_id = h.session_id)!
GROUP BY!
h.location_id, h.booking_date, i.position!
page 37
![Page 38: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/38.jpg)
Hive vs. Pig
Both are declarative languages, but Hive is SQL-like, Pig is a scripting language.
Explicit schema vs. implicit schema.
Hive metadata can be accessed by external tools.
page 38
![Page 39: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/39.jpg)
Hive vs. HBase
HBase is a column-based key value store as opposed to an SQL model.
HBase offers lower latency and random access to data.
Hive/HBase integration was recently released, allowing Hive queries to be executed over HBase tables.
page 39
![Page 40: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/40.jpg)
Hive – Lessons Learned
Job scheduling – Default Hadoop scheduling is FIFO. Consider using something like the fair scheduler.
Multi-user Hive – Default install is single user. Multi-user installs require an external relational store.
set mapred.reduce.tasks is your friend.
Migrating Hive between clusters is not fun.
Documentation is still a little sparse.
page 40
![Page 41: Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54466afdb1af9fe83a8b45fd/html5/thumbnails/41.jpg)
References
• Hadoop project: http://hadoop.apache.org/ • Hive project: http://hadoop.apache.org/hive/ • Hive – A Petabyte Scale Data Warehouse Using Hadoop:
http://i.stanford.edu/~ragho/hive-icde2010.pdf • Hadoop The Definitive Guide, Second Edition, Tom White, O’Reilly
Press, 2011 • Hive Evolution, John Sichi, November 2010: http://
www.slideshare.net/jsichi/hive-evolution-apachecon-2010
page 41