basic sentiment analysis using hive

15
Sentiment Analysis using Hive Secrets From the Pros We will be starting at 11:03 PDT Use the Chat Pane in GoToWebinar to Ask Questions! Assess your level and learn new stuff This webinar is intended for intermediate audiences (familiar with Apache Hive and Hadoop, but not experts) ?

Upload: qubole

Post on 10-May-2015

1.169 views

Category:

Technology


1 download

DESCRIPTION

Slide deck from a hands on workshop: Covers the following 1. Learn what Sentiment Analysis and how it can be used 2. Perform pre-processing and post-processing of textual data using Hive 3. Use n-gram language model built into Hive for perform sentiment analysis 4. Learn how to use Hive extensibility to plug-in other language models

TRANSCRIPT

Page 1: Basic Sentiment Analysis using Hive

Sentiment Analysis using Hive Secrets From the Pros

We will be starting at 11:03 PDTUse the Chat Pane in GoToWebinar to Ask Questions!

Assess your level and learn new stuffThis webinar is intended for intermediate audiences (familiar with Apache Hive and Hadoop, but not experts)?

Page 2: Basic Sentiment Analysis using Hive

News Cycle for “Mortgage” 2008-09Mortgage- Crisis, Foreclosures, Fraud

6/12/04 8/1/04 9/20/04 11/9/04 12/29/04 2/17/05 4/8/05 5/28/050

10

20

30

40

50

60

70

80

90

CrisisLinear (Crisis)ForeclosureLinear (Foreclosure)FraudLinear (Fraud)

# of records: 90M/partitionPartitions: MonthColumns: URL Timestamp Array of Memes Links

Table: MemeTracker

36GB of JSON Data

Page 3: Basic Sentiment Analysis using Hive

AGENDA

This Webinar provides tips on doing basic sentiment analysis on large data sets using Hive:

• Overview of Sentiment Analysis (SA)• Hive UDFs useful for SA• Demo, Guided Tutorial• Developing advanced, custom SA Engines

Page 4: Basic Sentiment Analysis using Hive

Sentiment AnalysisApplications

Direct-- Call center logs, Emails, Chat logsIndirect-- Social Media, Forums, Review websites

Gather Customer Feedback

Over time, geographyBy customer, market segments

Sentiment AnalysisProduct / service decisionsCustomer supportMarketing- messaging, offersCustomer retention, upsell

Use for Decision making

Page 5: Basic Sentiment Analysis using Hive

Sentiment AnalysisHow to operationalize a Sentiment Analysis App

1. Crawl, Scrape, API calls, collect

2. Create “Documents”

3. Pre-process Data

4. Apply Language Model, Extract

Sentiment

5. Integrate with Mktg Automn., CRM, CCA, etc

OLTP

6. Improve Product, Better

CS, Targeted Offers

Page 6: Basic Sentiment Analysis using Hive

Pre and Post PreprocessingHive Built-In Functions

Goal Input Data Output Data Use this Hive UDF

Tokenization (“Hello There! How are you?”)

( (“Hello”, “There”), (“How”, “are”, “you”))

sentences

Column (array) to rows [1, 2, 3]123

explode

Navigating documents, extracting fields

{"store": {"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}], "bicycle":{"price":19.95,"color":"red"} }, "email":"[email protected]", "owner":"amy"}

{"weight":8,"type":"apple"}

get_json_object(src_json.json, '$.fruit\[0]')

Page 7: Basic Sentiment Analysis using Hive

N-GramLanguage Models

Q: What is a language model?A: A mathematical model that assigns probability to a sequence of m words

Q: What is “n-gram” model?A: Probabilistic language model for predicting next word in a sequence of words

Q: What is an n-gram?A: A contiguous sequence of n items from a given sequence of text Eg: “Mary had a little lamb” Bi-grams: “Mary had”, “had a”, “a little”, “little lamb”

Page 8: Basic Sentiment Analysis using Hive

N-Gram Language ModelHive Built-In Functions

Goal Input Data Output Data Use this Hive UDF

Find important topics using a stop word list, trending topics

Collection of sentences k most frequently occurring n-grams ngrams

Extract intelligence around certain keywords, pre-compute search look aheads

Collection of sentencesk most frequently occuring n-grams around a “context” word. Eg: “Government shutdown”

context_ngrams

Page 9: Basic Sentiment Analysis using Hive

Dataset used-- Meme TrackerHow MemeTracker.org creates the dataset

90 Million sources900K news stories / dayTrack 17M memes

# of records: 90M/partitionPartitions: MonthColumns: URL Timestamp Array of Memes Links

Table: MemeTracker

6GB of Data / month

Crawl, Scrape

Create Documents

Extract “Memes”

Page 10: Basic Sentiment Analysis using Hive

Analyze Sentiment on “Mortgage”

By Tracking How Memes spread, using Hive

What is a Meme? “Government Shutdown”, “Affordable Care Act”, “Green Eggs and Ham”, etc

# of records: 90M/partitionPartitions: MonthColumns: URL Timestamp Array of Memes Links

Table: MemeTracker

36GB of JSON Data

Prepare Data

Apply language

model, Extract sentiment

Page 11: Basic Sentiment Analysis using Hive

Demo

Page 12: Basic Sentiment Analysis using Hive

Hive’s Extensibility Framework

• There are many UDFs built into Hive

• For more advanced users Hive allows many ways to extend the language– SERDEs– UDFs, UDAFs, and UDTFs– Hive Streaming

Page 13: Basic Sentiment Analysis using Hive

How to access this Tutorial

• Create a free Qubole Account (www.qubole.com)• Login Click on “Analyze” Look for “Tutorials” tab

at top of page

Page 14: Basic Sentiment Analysis using Hive

Summary• Pre and post processing

– Use Hive

• Language Models– Use pre-existing language models codified as Hive UDFs such as ngrams

and context_ngrams– UDFs-- Build your own language model in java using Hive UDF

framework– Hive Streaming-- Plug-in your existing language models or 3rd party

libraries

• Visualization– Use a spreadsheet / BI reporting tool

Page 15: Basic Sentiment Analysis using Hive

THANK YOU

Managed Cluster Built-In Connectors Friendly User-Interface Dedicated Support

• 100% Managed Hadoop Cluster in the Cloud• Auto-Scaling Cluster. Full Life-cycle Management• +12 Connectors to Applications and Data Sources• 14-Day Free Trial (free account available)• 24/7 Customer Support

What’s Included?

www.qubole.com/try