a pipeline for distributed topic and sentiment analysis of tweets on pivotal greenplum database
DESCRIPTION
Unstructured data is everywhere - in the form of posts, status updates, bloglets or news feeds in social media or in the form of customer interactions Call Center CRM. While many organizations study and monitor social media for tracking brand value and targeting specific customer segments, in our experience blending the unstructured data with the structured data in supplementing data science models has been far more effective than working with it independently. In this talk we will show case an end-to-end topic and sentiment analysis pipeline we've built on the Pivotal Greenplum Database platform for Twitter feeds from GNIP, using open source tools like MADlib and PL/Python. We've used this pipeline to build regression models to predict commodity futures from tweets and in enhancing churn models for telecom through topic and sentiment analysis of call center transcripts. All of this was possible because of the flexibility and extensibility of the platform we worked with.TRANSCRIPT
1© Copyright 2013 Pivotal. All rights reserved. 1© Copyright 2013 Pivotal. All rights reserved. 1© Copyright 2013 Pivotal. All rights reserved. 1© Copyright 2013 Pivotal. All rights reserved.
Big Data Pipeline for Topic and Sentiment Analysis, with
Applications
Srivatsan Ramanujam (@being_bayesian)Senior Data Scientist, Pivotal
11 Jan 2014
2© Copyright 2013 Pivotal. All rights reserved.
Agenda
Introduction
The Problem
The Platform
The Pipeline
Live Demo: Topic and Sentiment Analysis Engine
Applications in real world customer engagements
3© Copyright 2013 Pivotal. All rights reserved.
Pivotal: A New Platform for a New Era
...ETC
Cloud Fabric“The new OS”
Data Fabric“The new Database”
App Fabric“The new Middleware”
“The new Hardware”
Pivotal Data Science Labs
Data-Driven Application Development
4© Copyright 2013 Pivotal. All rights reserved.
The Problem
5© Copyright 2013 Pivotal. All rights reserved.
The Problem Make sense of large volumes of unstructured text and integrate this with the
structured sources of data to make better predictions
Approaches– Topic Analysis– Sentiment Analysis
6© Copyright 2013 Pivotal. All rights reserved.
The Platform
7© Copyright 2013 Pivotal. All rights reserved.
Pivotal Greenplum MPP DBThink of it as multiple PostGreSQL servers
Segments/Workers
Master
Rows are distributed across segments by a particular field (or randomly)
8© Copyright 2013 Pivotal. All rights reserved.
Pivotal Hadoop
• The pipeline in this talk can be run on Pivotal Hadoop + HAWQ
9© Copyright 2013 Pivotal. All rights reserved.
Data Parallelism Vs. Task Parallelism
Data Parallelism: Little or no effort is required to break up the problem into a number of parallel tasks, and there exists no dependency (or communication) between those parallel tasks.
– Ex: Build one Churn model for each state in the US simultaneously, when customer data is distributed by state code.
Task Parallelism: Split the problem into independent sub-tasks which can executed in parallel.
– Ex: Build one Churn model in parallel for the entire US, though customer data is distributed by state code.
10© Copyright 2013 Pivotal. All rights reserved.
User-Defined Functions (UDFs) PostgreSQL/Greenplum provide lots of flexibility in defining your own functions.
Simple UDFs are SQL queries with calling arguments and return types.
Definition:
CREATE FUNCTION times2(INT)RETURNS INT
AS $$ SELECT 2 * $1$$ LANGUAGE sql;
Execution:
SELECT times2(1); times2 -------- 2(1 row)
11© Copyright 2013 Pivotal. All rights reserved.
The interpreter/VM of the language ‘X’ is installed on each node of the Greenplum Database Cluster
• Data Parallelism: - PL/X piggybacks on
Greenplum’s MPP architecture
• Allows users to write Greenplum/PostgreSQL functions in the R/Python/Java, Perl, pgsql or C languages Standby
Master
…
MasterHost
SQL
Interconnect
Segment HostSegmentSegment
Segment HostSegmentSegment
Segment HostSegmentSegment
Segment HostSegmentSegment
PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}
12© Copyright 2013 Pivotal. All rights reserved.
Going Beyond Data Parallelism Data Parallel computation via PL/Python libraries only allow
us to run ‘n’ models in parallel.
This works great when we are building one model for each value of the group by column, but we need parallelized algorithms to be able to build a single model on all the available data
For this, we use MADlib – an open source library of parallel in-database machine learning algorithms.
13© Copyright 2013 Pivotal. All rights reserved.
Scalable, in-database ML
• Open Source!https://github.com/madlib/madlib• Works on Greenplum DB and PostgreSQL• Active development by Pivotal
- Latest Release : 1.4 (Dec 2014)• Downloads and Docs: http://madlib.net/
14© Copyright 2013 Pivotal. All rights reserved.
MADlib In-DatabaseFunctions
Predictive Modeling Library
Linear Systems• Sparse and Dense Solvers
Matrix Factorization• Single Value Decomposition (SVD)• Low-Rank
Generalized Linear Models• Linear Regression• Logistic Regression• Multinomial Logistic Regression• Cox Proportional Hazards• Regression• Elastic Net Regularization• Sandwich Estimators (Huber white,
clustered, marginal effects)
Machine Learning Algorithms• Principal Component Analysis (PCA)• Association Rules (Affinity Analysis, Market
Basket)• Topic Modeling (Parallel LDA)• Decision Trees• Ensemble Learners (Random Forests)• Support Vector Machines• Conditional Random Field (CRF)• Clustering (K-means) • Cross Validation
Descriptive Statistics
Sketch-based Estimators• CountMin (Cormode-
Muthukrishnan)• FM (Flajolet-Martin)• MFV (Most Frequent
Values)CorrelationSummary
Support Modules
Array OperationsSparse VectorsRandom SamplingProbability Functions
15© Copyright 2013 Pivotal. All rights reserved.
Architecture
RDBMS Query Processing(Greenplum, PostgreSQL, …)
Low-level Abstraction Layer(matrix operations, C++ to RDBMS
type bridge, …)
RDBMSBuilt-in
Functions
User Interface
High-level Abstraction Layer(iteration controller, ...)
Functions for Inner Loops(for streaming algorithms)
“Driver” Functions(outer loops of iterative algorithms, optimizer invocations)
Python
Python with templated SQL
SQL, generated from specification
C++
16© Copyright 2013 Pivotal. All rights reserved.
MADlib on Hadoop
• A subset of algorithms from MADlib on Pivotal Greenplum DB, work out of the box on HAWQ.
• Other functions are being ported.• With the general availability and support for User Defined Functions in
HAWQ, MADlib will attain full parity with GPDB
17© Copyright 2013 Pivotal. All rights reserved.
The Pipeline
18© Copyright 2013 Pivotal. All rights reserved.
The Pipeline
Stored on HDFS
Tweet Stream
(gpfdist)Loaded as
external tables into GPDB
Parallel Parsing of JSON and extraction
of fields using PL/Python
Topic Analysis through MADlib pLDA
Sentiment Analysis through custom
PL/Python functions
D3.js
19© Copyright 2013 Pivotal. All rights reserved.
Topic Analysis – MADlib pLDA
Prepare dataset for
Topic Modeling
MADlib Topic Model
Social Media
Tokenizer
Align Data
Stemming, frequency
filtering
Natural Language Processing - GPText
Filter relevant content
Topic composition
Topic Clouds
Topic Graph
20© Copyright 2013 Pivotal. All rights reserved.
Sentiment Analysis We don’t have labeled data for our problem (Tweets
aren’t tagged with Sentiment)
Semi-Supervised Sentiment Prediction can be achieved by dictionary look-ups of tokens in a Tweet, but without Context, Sentiment Prediction is futile!
“Unpredictable”
“Breakthrough”
21© Copyright 2013 Pivotal. All rights reserved.
Sentiment Analysis – PL/X Functions
Phrase Extraction
Semi-Supervised Sentiment Classification
Phrasal Polarity Scoring
Sentiment Scored Tweets
Use learned phrasal polarities to score
sentiment of new tweets
1: Parts-of-speech Tagger : Gp-Ark-Tweet-NLP (http://vatsan.github.io/gp-ark-tweet-nlp/)
Part-of-speech tagger1
Break-up Tweets into tokens and tag their
parts-of-speech
22© Copyright 2013 Pivotal. All rights reserved.
Live Demo
23© Copyright 2013 Pivotal. All rights reserved.
Real World Applications
24© Copyright 2013 Pivotal. All rights reserved.
Churn Models for Telecom IndustryGoal
– Identify and prevent customers who are likely to churn.
Challenges– Cost of acquiring new customers is high– Recouping cost of customer acquisition high if customer is not retained long enough– Lower barrier to switching subscribers– With mobile number portability, barrier to switching even lower
Good News– Cost of retaining existing customers is lower!
25© Copyright 2013 Pivotal. All rights reserved.
Structured Features for Churn ModelsThe problem is extensively studied with a rich set of approaches in the literature
These features are great, but the models soon hit a plateau with structured features!
Device Texting Stats Call Stats Rate PlansCustomer
Demographics
26© Copyright 2013 Pivotal. All rights reserved.
Blending the Unstructured with the Structured
What other sources of previously untapped data could we use ?
Are our customers happy ? Where ? What segments ?
What are the common topics in their conversations online ?
27© Copyright 2013 Pivotal. All rights reserved.
Sentiment Analysis and Topic Models
Sentiment Analysis Engine
(Classifier)
Structured Data: EDW
MORE ACCURATE LIKELIHOOD TO CHURN
Topic Engine(LDA)
Topic Dashboard
Unstructured Data
External Internal
28© Copyright 2013 Pivotal. All rights reserved.
Predicting Commodity Futures through Twitter
Customer
A major a agri-business cooperative
Business Problem
Predict price of commodity futures through Twitter
Challenges
Language on Twitter does not adhere to rules of grammar and has poor structure
No domain specific label corpus of tweet sentiment – problem is semi-supervised
Solution
Built Sentiment Analysis and Text Regression algorithms to predict commodity futures from Tweets
Established the foundation for blending the structured data (market fundamentals) with unstructured data (tweets)
29© Copyright 2013 Pivotal. All rights reserved.
The Approach
• Tweets alone had significant predictive power for the commodity of interest to us. When blended with structured features like weather data we expect to see much better results.
30© Copyright 2013 Pivotal. All rights reserved.
What’s in it for me?
31© Copyright 2013 Pivotal. All rights reserved.
Pivotal Open Source Contributions
http://gopivotal.com/pivotal-products/open-source-software
• PyMADlib – Python Wrapper for MADlib- https://github.com/gopivotal/pymadlib
• PivotalR – R wrapper for MADlib- https://github.com/madlib-internal/PivotalR
• Part-of-speech tagger for Twitter via SQL- http://vatsan.github.io/gp-ark-tweet-nlp/
Questions?@being_bayesian
BUILT FOR THE SPEED OF BUSINESS