hive – a warehousing solution over a mapreduce framework bingbing liu 2009-12-12 1
TRANSCRIPT
![Page 1: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1](https://reader036.vdocuments.mx/reader036/viewer/2022062423/56649ee05503460f94bf1134/html5/thumbnails/1.jpg)
Hive – A Warehousing Solution Over a MapReduce Framework
Bingbing Liu
2009-12-12
1
![Page 2: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1](https://reader036.vdocuments.mx/reader036/viewer/2022062423/56649ee05503460f94bf1134/html5/thumbnails/2.jpg)
Outline
• Introduction
• Data Model
• Architecture
• HiveQL
2
![Page 3: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1](https://reader036.vdocuments.mx/reader036/viewer/2022062423/56649ee05503460f94bf1134/html5/thumbnails/3.jpg)
What is Hive?
• A system for managing and querying structured data built on top of Hadoop– Map-Reduce for execution– HDFS for storage– Metadata on raw files
• Key Building Principles:– SQL as a familiar data warehousing tool– Extensibility – Types, Functions, Formats, Scripts– Scalability and Performance
3
![Page 4: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1](https://reader036.vdocuments.mx/reader036/viewer/2022062423/56649ee05503460f94bf1134/html5/thumbnails/4.jpg)
Hive/Hadoop Usage @ Facebook
• Types of Applications:– Reporting
• Eg: Daily/Weekly aggregations of impression/click counts• Complex measures of user engagement
– Ad hoc Analysis• Eg: how many group admins broken down by state/country
– Data Mining (Assembling training data)• Eg: User Engagement as a function of user attributes
– Spam Detection• Anomalous patterns for Site Integrity• Application API usage patterns
– Ad Optimization– Too many to count ..
700 Terabytes data
5000queries/day
More than 100 users
4
![Page 5: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1](https://reader036.vdocuments.mx/reader036/viewer/2022062423/56649ee05503460f94bf1134/html5/thumbnails/5.jpg)
Data Warehousing at Facebook Today
Web Servers Scribe Servers
Filers
Hive on Hadoop ClusterOracle RAC Federated MySQL 5
![Page 6: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1](https://reader036.vdocuments.mx/reader036/viewer/2022062423/56649ee05503460f94bf1134/html5/thumbnails/6.jpg)
6
![Page 7: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1](https://reader036.vdocuments.mx/reader036/viewer/2022062423/56649ee05503460f94bf1134/html5/thumbnails/7.jpg)
Data Model
• Hive中数据组织形式 :
– Tables: 概念上类似于 rdbms中的 table,在存储上对应于一个 HDFS的目录。
– Partitions:每个表有一个或多个分区,决定数据在子目录中分发。
– Buckets: 每个分区中数据基于对列的 hash分配到每个 bucket,每个 bucket是一个文件。
例如:指定数据按例 ds划分Create table sc ( sno
int ) partitioned by ( ds string)则数据中,若 ds=2009-12-08,存储中此分区子目录则为
/sc/ds=2009-12-08
7
![Page 8: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1](https://reader036.vdocuments.mx/reader036/viewer/2022062423/56649ee05503460f94bf1134/html5/thumbnails/8.jpg)
Data Model
Logical Partitioning
Hash Partitioning
sc
HDFS MetaStore
/hive/sc/hive/sc/ds=2009-12-08
/hive/sc/ds=2009-12-08/sc.txt
…
Tables
Data LocationBucketing Info
Partitioning Cols
Metastore DB
student
course
8
![Page 9: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1](https://reader036.vdocuments.mx/reader036/viewer/2022062423/56649ee05503460f94bf1134/html5/thumbnails/9.jpg)
Metastore
• 存储于本地或者传统的 Rdbms中(非 Hdfs)。• Database
– 所有 table的命名空间,默认为“ default”• Table
– 包括 Column列表和其类型, storage和序列反序列化信息。
– Storage包括数据在底层位置,数据格式(类型), buckets信息。
• Partition – 每个分区可以包含自己的列,序列反序列化信息,以
及 storage信息。9
![Page 10: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1](https://reader036.vdocuments.mx/reader036/viewer/2022062423/56649ee05503460f94bf1134/html5/thumbnails/10.jpg)
Architecture
HDFS
Hive CLIDDL QueriesBrowsing
Map Reduce
MetaStore
Thrift API
SerDeThrift Jute JSON..
ExecutionParser
Planner
DB
Web U
I
Optimizer
10
![Page 11: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1](https://reader036.vdocuments.mx/reader036/viewer/2022062423/56649ee05503460f94bf1134/html5/thumbnails/11.jpg)
HiveQL – Hive Query Language
• Support:– Select ,project, aggregate ,union all– Load data to table from local or hdfs directory– Equi-joins– Subqueries in from clause– Multi-table Insert– Multi-group-by
11
![Page 12: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1](https://reader036.vdocuments.mx/reader036/viewer/2022062423/56649ee05503460f94bf1134/html5/thumbnails/12.jpg)
Example
• Student ( sno int ,sname string ,class int)
• Course (cno int ,cname string);
• Sc (sno int , cno int ,grade int) partitioned by (ds string);
12
![Page 13: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1](https://reader036.vdocuments.mx/reader036/viewer/2022062423/56649ee05503460f94bf1134/html5/thumbnails/13.jpg)
13
![Page 14: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1](https://reader036.vdocuments.mx/reader036/viewer/2022062423/56649ee05503460f94bf1134/html5/thumbnails/14.jpg)
传统的Insert into table test( 1 , 1 , 1);不支持
14
![Page 15: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1](https://reader036.vdocuments.mx/reader036/viewer/2022062423/56649ee05503460f94bf1134/html5/thumbnails/15.jpg)
HiveQL- Join
• SQL:
INSERT OVERWRITE TABLE test
SELECT t1.sname,t2.cno
FROM student t1 JOIN sc t2 ON (t1.sno = t2.sno);
Sno Sname
Class
1 Wang 1
2 Zhang
1
3 Zhou 2
4 Chen 2
Sno Cno
Grade
1 1 90
1 2 80
2 1 79
2 2 80
sno cno
Wang
1
Wang
2
Zhang
1
Zhang
2
X =
student sc test
15
![Page 16: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1](https://reader036.vdocuments.mx/reader036/viewer/2022062423/56649ee05503460f94bf1134/html5/thumbnails/16.jpg)
HiveQL- Join in Map Reducekey value
1 <0,Wang>
2 <0,Zhang>
3 <0,Zhou>
4 <0,Chen>
student
sckey value
1 <1,1>
1 <1,2>
2 <1,1>
2 <1,2>
Map
key value
1 <0,Wang>
1 <1,1>
1 <1,2>
key value
2 <0,Zhang>
2 <1,1>
2 <1,2>
ShuffleSort
Reduce
Sno Sname
Class
1 Wang 1
2 Zhang 1
3 Zhou 2
4 Chen 2
Sno Cno Grade
1 1 90
1 2 80
2 1 79
2 2 80
3 <0,Zhou>
4 <0,Chen>
16
![Page 17: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1](https://reader036.vdocuments.mx/reader036/viewer/2022062423/56649ee05503460f94bf1134/html5/thumbnails/17.jpg)
Query planTableScanOperator
Table:student[sno int ,sname string ,class
int]
TableScanOperatorTable:sc
[sno int ,cno int ,grade int]
ReduceSinkOperatorPartition cols:col[0][0 int ,1 string ,2 int]
ReduceSinkOperatorPartition cols:col[0][0 int ,1 int ,2 int]
JoinOperatorPredicate : cols[0,0]=col[1,0]
[0 int ,1 string ,2 int ,3 int ,4 int ,5 int]
SelectOperatorExpressions:[col[1],col[4]]
[0 string ,1 int]
FileOutputOperatorTable:test
[0 string ,1 int]
Map
Reduce
17
![Page 18: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1](https://reader036.vdocuments.mx/reader036/viewer/2022062423/56649ee05503460f94bf1134/html5/thumbnails/18.jpg)
18
![Page 19: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1](https://reader036.vdocuments.mx/reader036/viewer/2022062423/56649ee05503460f94bf1134/html5/thumbnails/19.jpg)
Hive QL – Group By
SELECT student.class, count(1)
FROM student
GROUP BY student.class;
student
Class count
1 2
2 2
Sno Sname
Class
1 Wang 1
2 Zhang
1
3 Zhou 2
4 Chen 2
19
![Page 20: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1](https://reader036.vdocuments.mx/reader036/viewer/2022062423/56649ee05503460f94bf1134/html5/thumbnails/20.jpg)
Hive QL – Group By in Map Reduce
Sno Sname
Class
1 Wang
1
2 Zhang
1
pv_users
class count
1 2
Sno Sname
Class
3 Zhou 2
4 Chen 2
Map
key value
1 1
1 1
key value
2 1
2 1
key value
1 1
1 1
key value
2 1
2 1
ShuffleSort
class count
2 2
Reduce
20
![Page 21: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1](https://reader036.vdocuments.mx/reader036/viewer/2022062423/56649ee05503460f94bf1134/html5/thumbnails/21.jpg)
Query planTableScanOperator
Table:student[sno int ,sname string ,class
int]
ReduceSinkOperatorPartition cols:col[2][0 int ,1 string ,2 int]
GroupByOperatorAggregations:[count[2]]
Keys:[col[2]][0 int ,1 bigint]
FileOutputOperatorTable:tmp1
[0 int , 1 bigint]
TableScanOperatorTable:tmp1
[0 int , 1 bigint]
ReduceSinkOperatorPartition cols:col[0]
[0 int , 1 bigint]
SelectOperatorExpressions:[col[0],col[1]]
[0 int , 1 bigint]
聚集的key
如果 groupby sno , class?
0<int ,int>?21
![Page 22: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1](https://reader036.vdocuments.mx/reader036/viewer/2022062423/56649ee05503460f94bf1134/html5/thumbnails/22.jpg)
22
![Page 23: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1](https://reader036.vdocuments.mx/reader036/viewer/2022062423/56649ee05503460f94bf1134/html5/thumbnails/23.jpg)
Multi group by
23
![Page 24: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1](https://reader036.vdocuments.mx/reader036/viewer/2022062423/56649ee05503460f94bf1134/html5/thumbnails/24.jpg)
24