1 analyzing twitter data with hadoop gwen shapira, software engineer @gwenshap ©2012 cloudera, inc

44
1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc.

Upload: dorcas-hunter

Post on 22-Dec-2015

227 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

1

Analyzing Twitter Data with HadoopGwen Shapira, Software Engineer@Gwenshap

©2012 Cloudera, Inc.

Page 2: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

All meetings located in Moscone South - Room 208

Monday, September 29Exadata SIG: 2:00 p.m. - 3:00 p.m.BIWA SIG: 5:00 p.m. – 6:00 p.m.

Tuesday, September 30Internet of Things SIG: 11:00 a.m. - 12:00 p.m.Storage SIG: 4:00 p.m. - 5:00 p.m.SPARC/Solaris SIG: 5:00 p.m. - 6:00 p.m.

Wednesday, October 1Oracle Enterprise Manager SIG: 8:00 a.m. - 9:00 a.m.Big Data SIG: 10:30 a.m. - 11:30 a.m.Oracle 12c SIG: 2:00 p.m. – 3:00 p.m.Oracle Spatial and Graph SIG: 4:00 p.m. (*OTN lounge)

IOUG SIG Meetings at OpenWorld

Page 3: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

• Save more than $1,000 on education offerings like pre-conference workshops• Access the brand-new, specialized IOUG Strategic Leadership Program• Priority access to the hands-on labs with Oracle ACE support• Advance access to supplemental session material and presentations• Special IOUG activities with no "ante in" needed - evening networking opportunities

and more

COLLABORATE 15 – IOUG ForumApril 12-16, 2015

Mandalay Bay Resort and CasinoLas Vegas, NV

COLLABORATE 15 Call for Speakers

Ends October 10

The IOUG Forum Advantage

www.collaborate.ioug.org

Follow us on Twitter at @IOUG or via the conference hashtag #C15LV!

Page 4: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

©2014 Cloudera, Inc. All rights reserved.

I have15 years of experience in

moving data around

Page 5: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

©2014 Cloudera, Inc. All rights reserved.

• Oracle ACE Director• Member of Oak Table• Blogger• Presenter – Hotsos, IOUG, OOW, OSCON• NoCOUG board• Contributor to Apache Oozie, Sqoop, Kafka• Author – Hadoop Application Architectures

In my spare time…

Page 6: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

6

Analyzing Twitter Data with Hadoop

BUILDING AN HADOOP APPLICATION

©2012 Cloudera, Inc.

Page 7: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

7

Page 8: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

8

Hive Level Architecture

©2012 Cloudera, Inc.

Data Source HDFSFlume

Hive +Oozie

Impala / Oracle

Page 9: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

9

Analyzing Twitter Data with Hadoop

AN EXAMPLE USE CASE

©2012 Cloudera, Inc.

Page 10: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

10

Analyzing Twitter

• Social media popular with marketing teams• Twitter is an effective tool for promotion• Which twitter user gets the most retweets?• Who is influential in our industry?• Which topics are trending?

©2012 Cloudera, Inc.

Page 11: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

11

Analyzing Twitter Data with Hadoop

HOW DO WE ANSWER THESE QUESTIONS?

©2012 Cloudera, Inc.

Page 12: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

12

Techniques

• Bring Data with Flume• Complex data

• Deeply nested• Variable schema

• Clean, Standardize, Partition, etc• SQL

• Filtering• Aggregation• Sorting

Page 13: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

13

Analyzing Twitter Data with Hadoop

FLUME

Page 14: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

14

Flume Agent design

Page 15: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

15

In our case…

• Twitter source• Pulls JSON format files from twitter

• Memory Channel• HDFS Sink – directory per hour

Page 16: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

16

What is JSON?

©2012 Cloudera, Inc.

{ "retweeted_status": { "contributors": null, "text": "#Crowdsourcing – drivers already generate traffic data for your smartphone to suggest alternative routes when a road is clogged. #bigdata", "retweeted": false, "entities": { "hashtags": [ { "text": "Crowdsourcing", "indices": [0, 14] }, { "text": "bigdata", "indices": [129,137] } ], "user_mentions": [] } }}

Page 17: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

17

But Wait! There’s More!

• Many sources – directory, files, log4j, net, JMS• Interceptors – process data in flight• Selectors – choose which sink• Many channels – Memory, file• Many sinks – HDFS, Hbase, Solr

Page 18: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

18

High Level Pipeline Architecture

Web App Flume Avro Client

Web App Flume Avro Client

Web App Flume Avro Client

Web App Flume Avro Client

Web App Flume Avro Client

Web App Flume Avro Client

Web App Flume Avro Client

Web App Flume Avro Client

Flume Agent

Flume Agent

Flume Agent

Flume Agent

HDFS

SparkStreaming HBase

Report App

Fan-in Pattern

Multi Agents for Failover and rolling restarts

SparkStreaming data is sub set of whole events

ML Map/Reduce Jobs

Batch Report Updates

Pull Near Real Time Results

Query With Hbase API Or Impala

Client providing, multi-threading, compression, encryption, and batching

Page 19: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

19

TwitterAgent.sources = TwitterTwitterAgent.channels = MemChannelTwitterAgent.sinks = HDFS

TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSourceTwitterAgent.sources.Twitter.channels = MemChannelTwitterAgent.sources.Twitter.consumerKey = TwitterAgent.sources.Twitter.consumerSecret = TwitterAgent.sources.Twitter.accessToken = TwitterAgent.sources.Twitter.accessTokenSecret = TwitterAgent.sources.Twitter.keywords = hadoop, big data, flume, sqoop, oracle, oow

TwitterAgent.sinks.HDFS.channel = MemChannelTwitterAgent.sinks.HDFS.type = hdfsTwitterAgent.sinks.HDFS.hdfs.path = hdfs://quickstart :8020/user/flume/tweets/%Y/%m/%d/%H/TwitterAgent.sinks.HDFS.serializer = text

TwitterAgent.channels.MemChannel.type = memory

Configuration

Page 20: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

20

Analyzing Twitter Data with Hadoop

FLUME DEMO

©2012 Cloudera, Inc.

Page 21: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

21

Analyzing Twitter Data with Hadoop

HIVE

©2012 Cloudera, Inc.

Page 22: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

22

What is Hive?

• Created at Facebook• HiveQL

• SQL like interface• Hive interpreter

converts HiveQL to MapReduce code

• Returns results to the client

©2012 Cloudera, Inc.

Page 23: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

23

Hive Details

• Metastore contains table definitions• Stored in a relational database• Basically a data dictionary

• SerDes parse data • and converts to table/column structure• SerDe:

• CSV, XML, JSON, Avro, Parquet, OCR files• Or write your own (We created one for CopyBook)

Page 24: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

24

Complex Data

©2012 Cloudera, Inc.

SELECT  t.retweet_screen_name,  sum(retweets) AS total_retweets,  count(*) AS tweet_countFROM (SELECT   retweeted_status.user.screen_name AS retweet_screen_name,     retweeted_status.text,     max(retweeted_status.retweet_count) AS retweets FROM tweets   GROUP BY

retweeted_status.user.screen_name,       retweeted_status.text) tGROUP BY t.retweet_screen_nameORDER BY total_retweets DESCLIMIT 10;

Page 25: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

25

Analyzing Twitter Data with Hadoop

HIVE DEMO

©2012 Cloudera, Inc.

Page 26: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

26

Analyzing Twitter Data with Hadoop

IT’S A TRAP

©2012 Cloudera, Inc.

Page 27: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

27

Not a Database

©2012 Cloudera, Inc.

RDBMS Hive Impala

LanguageGenerally >= SQL-92

Subset of SQL-92 plus Hive specific extensions

Subset of SQL-92

Update Capabilities

INSERT, UPDATE, DELETE

Bulk INSERT, UPDATE, DELETE

Insert, truncate

Transactions Yes Yes No

Latency Sub-second Minutes Sub-second

Indexes Yes Yes No

Data size Few Terabytes Petabytes Lots of Terabytes

Page 28: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

28

Analyzing Twitter Data with Hadoop

DATA FORMATS

Page 29: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

29

I don’t like our data

• Lots of small files• JSON – requires parsing• Can’t compress• Sensitive to changes

Page 30: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

30

I’d rather use Avro

• Few large files containing records• Schema in file• Schema evolution• Can compress• Well supported in Hadoop• Clients in other languages

Page 31: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

31

Lets convert

• Create table AVRO_TWEETS• Insert into Avro_tweets

select …. From tweets

Page 32: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

32

Analyzing Twitter Data with Hadoop

IMPALA ASIDE

©2012 Cloudera, Inc.

Page 33: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

33

Cloudera ImpalaReal-Time Query for Data Stored in Hadoop.

FAMILIAR Supports Hive SQL

FAST 4-30X faster than Hive over MapReduce

Uses existing drivers, integrates with existing metastore, works with leading BI toolsINTEGRATED

100% OPEN SOURCE Flexible, cost-effective, no lock-in

EASY TO USE Deploy & operate withCloudera Enterprise RTQ

FLEXIBLE Supports multiple storage engines & file formats

©2012 Cloudera, Inc.

Page 34: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

34

Benefits of Cloudera ImpalaReal-Time Query for Data Stored in Hadoop

SPEED TO INSIGHT

COST SAVINGS

FULL FIDELITY ANALYSIS

DISCOVERABILITY

• Real-time queries run directly on source data• No ETL delays• No jumping between data silos

• No double storage with EDW/RDBMS• Unlock analysis on more data• No need to create and maintain complex ETL between systems• No need to preplan schemas

• All data available for interactive queries• No loss of fidelity from fixed data schemas

• Single metadata store from origination through analysis• No need to hunt through multiple data silos

©2012 Cloudera, Inc.

Page 35: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

Cloudera Impala Details

35 ©2012 Cloudera, Inc.

HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase

ODBC

SQL App

HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBaseHDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase

Fully MPP Distributed

Local Direct Reads

State Store

HDFS NNHive Metastore YARN

Common Hive SQL and interface

Unified metadata and scheduler

Low-latency scheduler and cache(low-impact failures)

Page 36: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

LOAD DATA TO ORACLE

Page 37: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

Oracle Connectors for Hadoop

• Oracle Loader for Hadoop

• Oracle SQL Connector for Hadoop

• BigData SQL

Page 38: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

Oracle Loader for Hadoop

• Load data from Hadoop into Oracle• Map-Reduce job inside Hadoop• Converts data types, partitions and sorts• Direct path loads• Reduces CPU utilization on database • Supports Avro and compression

Page 39: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

Oracle SQL Connector for Hadoop

• Run a Java app• Creates an external table• Runs MapReduce when external table is queries• Can use Hive Metastore for schema• Optimized for parallel queries• Supports Avro and compression

Page 40: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

40

Big Data SQL

• Also external table• Can also use Hive metastore for schema• But …. NO MapReduce• Instead – an agent will do SMART SCANS

• Bloom filters• Storage indexes• Filters

• Supports any Hadoop data format

Page 41: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

41

Analyzing Twitter Data with Hadoop

PUTTING IT ALL TOGETHER

©2012 Cloudera, Inc.

Page 42: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

42

Hive Level Architecture

©2012 Cloudera, Inc.

Data Source HDFSFlume

Hive +Oozie

Impala / Oracle

Page 44: 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc

44 ©2012 Cloudera, Inc.