a pipeline for distributed topic and sentiment analysis of tweets on pivotal greenplum database

1© Copyright 2013 Pivotal. All rights reserved. 1© Copyright 2013 Pivotal. All rights reserved. 1© Copyright 2013 Pivotal. All rights reserved. 1© Copyright 2013 Pivotal. All rights reserved.

Big Data Pipeline for Topic and Sentiment Analysis, with

Applications

Srivatsan Ramanujam (@being_bayesian)Senior Data Scientist, Pivotal

11 Jan 2014

2© Copyright 2013 Pivotal. All rights reserved.

Agenda

Introduction

The Problem

The Platform

The Pipeline

Live Demo: Topic and Sentiment Analysis Engine

Applications in real world customer engagements


Pivotal: A New Platform for a New Era

...ETC

Cloud Fabric“The new OS”

Data Fabric“The new Database”

App Fabric“The new Middleware”

“The new Hardware”

Pivotal Data Science Labs

Data-Driven Application Development


The Problem


The Problem Make sense of large volumes of unstructured text and integrate this with the

structured sources of data to make better predictions

Approaches– Topic Analysis– Sentiment Analysis


The Platform


Pivotal Greenplum MPP DBThink of it as multiple PostGreSQL servers

Segments/Workers

Master

Rows are distributed across segments by a particular field (or randomly)


Pivotal Hadoop

• The pipeline in this talk can be run on Pivotal Hadoop + HAWQ


Data Parallelism Vs. Task Parallelism

Data Parallelism: Little or no effort is required to break up the problem into a number of parallel tasks, and there exists no dependency (or communication) between those parallel tasks.

– Ex: Build one Churn model for each state in the US simultaneously, when customer data is distributed by state code.

Task Parallelism: Split the problem into independent sub-tasks which can executed in parallel.

– Ex: Build one Churn model in parallel for the entire US, though customer data is distributed by state code.


User-Defined Functions (UDFs) PostgreSQL/Greenplum provide lots of flexibility in defining your own functions.

Simple UDFs are SQL queries with calling arguments and return types.

Definition:

CREATE FUNCTION times2(INT)RETURNS INT

AS $$ SELECT 2 * $1$$ LANGUAGE sql;

Execution:

SELECT times2(1); times2 -------- 2(1 row)


The interpreter/VM of the language ‘X’ is installed on each node of the Greenplum Database Cluster

• Data Parallelism: - PL/X piggybacks on

Greenplum’s MPP architecture

• Allows users to write Greenplum/PostgreSQL functions in the R/Python/Java, Perl, pgsql or C languages Standby

Master

…

MasterHost

SQL

Interconnect

Segment HostSegmentSegment




PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}


Going Beyond Data Parallelism Data Parallel computation via PL/Python libraries only allow

us to run ‘n’ models in parallel.

This works great when we are building one model for each value of the group by column, but we need parallelized algorithms to be able to build a single model on all the available data

For this, we use MADlib – an open source library of parallel in-database machine learning algorithms.


Scalable, in-database ML

• Open Source!https://github.com/madlib/madlib• Works on Greenplum DB and PostgreSQL• Active development by Pivotal

- Latest Release : 1.4 (Dec 2014)• Downloads and Docs: http://madlib.net/

https://github.com/madlib/madlib



http://madlib.net/

http://madlib.net/

http://madlib.net/


MADlib In-DatabaseFunctions

Predictive Modeling Library

Linear Systems• Sparse and Dense Solvers

Matrix Factorization• Single Value Decomposition (SVD)• Low-Rank

Generalized Linear Models• Linear Regression• Logistic Regression• Multinomial Logistic Regression• Cox Proportional Hazards• Regression• Elastic Net Regularization• Sandwich Estimators (Huber white,

clustered, marginal effects)

Machine Learning Algorithms• Principal Component Analysis (PCA)• Association Rules (Affinity Analysis, Market

Basket)• Topic Modeling (Parallel LDA)• Decision Trees• Ensemble Learners (Random Forests)• Support Vector Machines• Conditional Random Field (CRF)• Clustering (K-means) • Cross Validation

Descriptive Statistics

Sketch-based Estimators• CountMin (Cormode-

Muthukrishnan)• FM (Flajolet-Martin)• MFV (Most Frequent

Values)CorrelationSummary

Support Modules

Array OperationsSparse VectorsRandom SamplingProbability Functions


Architecture

RDBMS Query Processing(Greenplum, PostgreSQL, …)

Low-level Abstraction Layer(matrix operations, C++ to RDBMS

type bridge, …)

RDBMSBuilt-in

Functions

User Interface

High-level Abstraction Layer(iteration controller, ...)

Functions for Inner Loops(for streaming algorithms)

“Driver” Functions(outer loops of iterative algorithms, optimizer invocations)

Python

Python with templated SQL

SQL, generated from specification

C++


MADlib on Hadoop

• A subset of algorithms from MADlib on Pivotal Greenplum DB, work out of the box on HAWQ.

• Other functions are being ported.• With the general availability and support for User Defined Functions in

HAWQ, MADlib will attain full parity with GPDB


The Pipeline


The Pipeline

Stored on HDFS

Tweet Stream

(gpfdist)Loaded as

external tables into GPDB

Parallel Parsing of JSON and extraction

of fields using PL/Python

Topic Analysis through MADlib pLDA

Sentiment Analysis through custom

PL/Python functions

D3.js


Topic Analysis – MADlib pLDA

Prepare dataset for

Topic Modeling

MADlib Topic Model

Social Media

Tokenizer

Align Data

Stemming, frequency

filtering

Natural Language Processing - GPText

Filter relevant content

Topic composition

Topic Clouds

Topic Graph


Sentiment Analysis We don’t have labeled data for our problem (Tweets

aren’t tagged with Sentiment)

Semi-Supervised Sentiment Prediction can be achieved by dictionary look-ups of tokens in a Tweet, but without Context, Sentiment Prediction is futile!

“Unpredictable”

“Breakthrough”


Sentiment Analysis – PL/X Functions

Phrase Extraction

Semi-Supervised Sentiment Classification

Phrasal Polarity Scoring

Sentiment Scored Tweets

Use learned phrasal polarities to score

sentiment of new tweets

1: Parts-of-speech Tagger : Gp-Ark-Tweet-NLP (http://vatsan.github.io/gp-ark-tweet-nlp/)

Part-of-speech tagger1

Break-up Tweets into tokens and tag their

parts-of-speech

http://www.ark.cs.cmu.edu/TweetNLP/gimpel+etal.acl11.pdf

http://vatsan.github.io/gp-ark-tweet-nlp/





Live Demo

http://10.110.122.108:8081/gp/topic/home

http://10.110.122.108:8081/gp/senti/home


Real World Applications


Churn Models for Telecom IndustryGoal

– Identify and prevent customers who are likely to churn.

Challenges– Cost of acquiring new customers is high– Recouping cost of customer acquisition high if customer is not retained long enough– Lower barrier to switching subscribers– With mobile number portability, barrier to switching even lower

Good News– Cost of retaining existing customers is lower!


Structured Features for Churn ModelsThe problem is extensively studied with a rich set of approaches in the literature

These features are great, but the models soon hit a plateau with structured features!

Device Texting Stats Call Stats Rate PlansCustomer

Demographics


Blending the Unstructured with the Structured

What other sources of previously untapped data could we use ?

Are our customers happy ? Where ? What segments ?

What are the common topics in their conversations online ?


Sentiment Analysis and Topic Models

Sentiment Analysis Engine

(Classifier)

Structured Data: EDW

MORE ACCURATE LIKELIHOOD TO CHURN

Topic Engine(LDA)

Topic Dashboard

Unstructured Data

External Internal


Predicting Commodity Futures through Twitter

Customer

A major a agri-business cooperative

Business Problem

Predict price of commodity futures through Twitter

Challenges

Language on Twitter does not adhere to rules of grammar and has poor structure

No domain specific label corpus of tweet sentiment – problem is semi-supervised

Solution

Built Sentiment Analysis and Text Regression algorithms to predict commodity futures from Tweets

Established the foundation for blending the structured data (market fundamentals) with unstructured data (tweets)


The Approach

• Tweets alone had significant predictive power for the commodity of interest to us. When blended with structured features like weather data we expect to see much better results.


What’s in it for me?


Pivotal Open Source Contributions

http://gopivotal.com/pivotal-products/open-source-software

• PyMADlib – Python Wrapper for MADlib- https://github.com/gopivotal/pymadlib

• PivotalR – R wrapper for MADlib- https://github.com/madlib-internal/PivotalR

• Part-of-speech tagger for Twitter via SQL- http://vatsan.github.io/gp-ark-tweet-nlp/

Questions?@being_bayesian

http://gopivotal.com/pivotal-products/open-source-software

https://github.com/gopivotal/pymadlib

https://github.com/gopivotal/pymadlib

https://github.com/madlib-internal/PivotalR

https://github.com/madlib-internal/PivotalR



BUILT FOR THE SPEED OF BUSINESS

a pipeline for distributed topic and sentiment analysis of tweets on pivotal greenplum database

Technology