big data science: a path forward
DESCRIPTION
This talk uses a case study to demonstrate core data science capabilities in Big Data, infrastructure requirements, and talent profiles that translate to early success. Using the challenge of classifying events in a consumer-oriented website, the discussion is for a wide audience: - Practitioners will learn two key techniques for early success - Technologists will learn how teams rely on key infrastructure and where engineers play a valuable role in data sciences - Hiring managers will expand their knowledge of the skills required to bring business value with dataTRANSCRIPT
June 2013
BIG DATA SCIENCE: A PATH FORWARD
CONFIDENTIAL | 2
linkedin.com/in/danmallinger/
@danmallinger
www.thinkbiganalytics.com
Data Science Lead @ Think Big
Product/Brand Obsessive
Teacher
Occasional Engineer
CONFIDENTIAL | 3
TODAY
• High level exploration of the
• skills, tools, and techniques
• needed to achieve early success
• and to help you build
• your data science practice.
CONFIDENTIAL | 4
Understand our organizational needs for data science
Infrastructure: Technological tools and platforms.
Talent: Staff hired and trained.
Capabilities: Data science techniques utilized.
INFRASTRUCTURE, TALENT, & CAPABILITIES
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduceData
ExplorationBasic Modeling PhD Math
Visualization Clustering CategorizationContinuous
ModelsText Analysis
CONFIDENTIAL | 5
Boxed Solutions: Mahout & Platform
Toolkits: RHadoop, Scikit, etc.
You will need toolkits to solve unique problems
but smart techniques make that easier.
Boxed solutions are limited
but can be a good source of early velocity.
ANALYTICS TOOLS
CONFIDENTIAL | 6
Gigabytes from Stackoverflow
Questions from users
With metadata
Users have reputations
Questions open or closed
Follow along
Thinking about your data
To learn in a
Familiar context and
Plan
DATA
Presenter Audience
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
CONFIDENTIAL | 7
select count(1) as total
, sum(has_code)
, avg(body_count)
, stddev_samp(body_count)
, corr(reputation, owner_questions)
, histogram_numeric(body_count, 10)
from questions
;
STEP 1: EXPLORE
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
Patterns through Hive Patterns through Tableau
CONFIDENTIAL | 8
Summaries of unstructured data
Time-since metrics
select transform(…)
using ‘python …’
Clustering: Browsing cohorts
/bin/mahout canopy
STEP 2: FEATURE BUILDING
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
SQL Windowing Cross-Record Features
CONFIDENTIAL | 9
• Sample (don’t parallelize)
• Naturally parallel
• SVD
• Random Forests
• Estimators and Ensembles
• Bootstrapping
• Localizing
• Advanced Parallelization
• Linear models with SGD
• Neural networks
PARALLEL MODELS IN HADOOP
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
CONFIDENTIAL | 10
Single R model
run many times
over samples
and aggregated
m <- C5.0(status ~ …)
STEP 3: STRUCTURED MODEL (BAGGING)
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
Mapper 1:
Define n reducer keys
Send any record to reducer I with
probability p
Reducer 1:
Key: Id of sample
Value: List of records
Perform analysis over records
Reducer 2:
Key: One
Value: List of models
Aggregate the models (e.g. average)
Bagging a Model
CONFIDENTIAL | 11
WHERE ARE WE?
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
We’ve created a structured model
to flag questions that won’t be closed
using Big Data.
But we haven’t used unstructured data.
CONFIDENTIAL | 12
TEXT ANALYSIS
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
• Is “the big dog” really different from “dog is big?”
• How about “I like eggs but hate tofu” and “I hate eggs but like tofu?”
• Language has lexical and syntactical features
• Different techniques leverage these in different ways
Bag of Words: Structure doesn’t matter
n-gram: Structure matters (but not that much)
Feature Extraction: BACON! BACON! BACON!
CONFIDENTIAL | 13
STEP 4: UNSTRUCTURED MODEL
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
Similar to Hadoop’s Word Count
Create counts for token/category pairs
Use counts to calculate Information Gain
MR Job 1:
Calculate information gain (IG) for all
tokens.
MR Job 2:
Select tokens with largest IG.
Create structured data for record, tokens:question #4 | 0 | 1 | 0 | 1 | 1
MR Job 3:
Build a classifier over the newly structured
data (prior slides)
Information Gain
CONFIDENTIAL | 14
WHERE ARE WE?
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
We’ve created two models
One structured,
one unstructured.
But they don’t work together.
CONFIDENTIAL | 15
STEP 5: ENSEMBLE MODEL
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
Join many models together
By using their output
As input to ensemble model.
Best when models perform differently
Exploit differences with nonlinearities
Like interaction effects.
Ensembling
Mapper 1:
Load multiple models
Score the models per record and output
Reducer 1:
Key: Id of record
Value: List of model outputs
Join model outputs to make new records
MR Job 2:
Build a model over the output data as if it
was raw data.
CONFIDENTIAL | 16
We’ve created two models:
one structured,
one unstructured
and have ensembled them
to create a single, powerful model
and solve a practical business problem.
WHERE ARE WE?
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
CONFIDENTIAL | 17
This required simple infrastructure
a blend of analysis and scripting skills
an understanding of BIG data science techniques
but not a team of PhDs or a billion dollars.
HOW DID WE GET HERE?
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
CONFIDENTIAL | 18
Questions?
www.thinkbiganalytics.com
@danmallinger