big it data workshop pub
TRANSCRIPT
Lu
tzFi
nger
.com
How to extract significant business value from big
dataSeptember 20th 2016
Lu
tzFi
nger
.com
Lutz & Matt
Lu
tzFi
nger
.com
Disclaimer
This presentation is solemnly our opinion and not necessarily the
opinion of my employer Harvard, Linkedin or Cornell.
Lu
tzFi
nger
.com
Agenda: 9:00 - 17:009:00 The right Ask9:45 Teamwork: Discover an Ask
10:30 Coffee Break10:45 Data is King11:15 Decision Tree
13:00 Lunch14:00 Pitfalls with Data14:30 Teamwork: Which Data?
15:30 Coffee Break15:45 Innovation & Technology16:30 Build A Team16:45 Privacy & Ethics
Lu
tzFi
nger
.comHype About Data
Lu
tzFi
nger
.com
Hyped Data Scientists
image by Mike under Creative Commons
Lu
tzFi
nger
.com
McK Study forecasted:
10 Times More Managers per Data Savvy Person
Lu
tzFi
nger
.com
?
Lu
tzFi
nger
.com
SCHOOLS COMPANIES KNOWLEDGE SKILLSMEMBERS JOBS
LinkedIn's vision is to create economic opportunity for every member of the global
workforce.
Lu
tzFi
nger
.com
Actionable Insights
Lu
tzFi
nger
.com
ASK the right Questions.
MEASURE the right data – even if it is not Big data.
Take Actions and LEARN from them.
?
Lu
tzFi
nger
.com
BIG DATA IS “BULLSHIT”
Lu
tzFi
nger
.com
To Get Data is EASYTo Get The Right Data is HARD
To Get Insights is EASYTo Make Money of Data/Insights is
HARD
Lu
tzFi
nger
.com
THE ASK is the hardest part, but there are many use-cases to get started.?
Lu
tzFi
nger
.com
The Right Question
Lu
tzFi
nger
.com
Google had the right Questionis difficult to find
Lu
tzFi
nger
.com
Fisheye Learning
Lu
tzFi
nger
.com
Data Without Action
300+ Million Member at LinkedIn
60.000 with a Job Title that might fit
19.000 who switched after 3 to 8 years
24 who had the same career path
Lu
tzFi
nger
.com
Data by itselfis
USELESS
Information by itselfis often
USELESS
Only Action
Counts!
Data Reportingprescriptive, predictive, actionable, data science … the holy grail
Lu
tzFi
nger
.com
How To Work With Data?
Past Future
What happened?
What is happening?
What is likely to happen?
Reporting, Dashboards
Real-Time Analytics
Predictive Analytics
Forensics & Data Mining
Real-Time Data Mining
Prescriptive Analytics
Why did it happen?
Why is it happening?
What should I do about it?
Ref. Gartner
Lu
tzFi
nger
.com
Easiest - Start With Reporting
LinkedIn’s LMI Tool
Lu
tzFi
nger
.com
We Want Predictions
Lu
tzFi
nger
.com
We Want Monetization
Lu
tzFi
nger
.com
Examples At LinkedInPeople You May Know
Groups You May Like
Ads in Which You May Be Interested
Companies You May Want to Follow
Pulse
Similar Profiles
Lu
tzFi
nger
.com
Many Other Good Ideas
• Banking: Card Fraud Detection• Banking: Credit Scoring• Media: Content Recommendation • Health Care: Fraud Detection• Medicine: Image Processing• Medicine: Outliers Detection• Education: Course Improvement• Retail: Likelihood to Buy• Books: Marketing Planning• Manufacturing: Machine Failure Prediction• Manufacturing: Optimization• Insurance: Likelihood & Pricing• Transportation: Route Planning• Energy: Grid Utilization
Lu
tzFi
nger
.com
About Innovation
By
Alis
tair
Cro
ll
Lu
tzFi
nger
.com
Team Work
Photo by Creative Sustainability under the Creative Commons (CC BY 2.0)
What Would You Like To Do With Data?○ Is it Actionable? “So What?”○ Is it Reporting or Predictions?○ Is it Sustaining, Adjunct or Disruptive?
Please Stay REAL!
Lu
tzFi
nger
.com
Agenda: 9:00 - 17:009:00 The right Ask9:45 Teamwork: Discover an Ask
10:30 Coffee Break10:45 Data is King11:15 Decision Tree
13:00 Lunch14:00 Pitfalls with Data14:30 Teamwork: Which Data?
15:30 Coffee Break15:45 Innovation16:15 Technology16:45 Build A Team
Lu
tzFi
nger
.com
“Data is the new oil”- World Economic Forum
Lu
tzFi
nger
.com
“DATA IS THE NEW OIL”
Oil Mine the oil
Use the oil
Goal
Lu
tzFi
nger
.com
V OF “BIG DATA”
Data at scale(TB, PB … )
Data in many forms(Structured, unstructured ...)
Speed(Streaming, real time, near time ..)
Uncertainty(Imprecise, not always up-to-date ..)
Lu
tzFi
nger
.com
DATACategorical
• Ordinal: Monday, Tuesday, Wednesday• Nominal: Man, Woman
Quantitative:• Ratio: Kelvin, Height, Weight• Interval: Celsius, Fahrenheit
Structure:• Structured• Unstructured• Semi-structured / Meta data
Read more: “On the Theory of Scales of Measurement”S.Stevens 1946
Lu
tzFi
nger
.com
What Have Troubled The Media Industry?
Lu
tzFi
nger
.com
The Media Industry Is One Step Removed From The Customer
Photo by Norimutsu Nogami under the Creative Commons (CC BY 2.0)
They Do Not Know Who Reads What &
When?
Lu
tzFi
nger
.com
Facebook Knows
* only member - not necessarily ‘active’ members
Lu
tzFi
nger
.com
& Size MattersNetwork Size
(Proportion by Members*)
* only member - not necessarily ‘active’ members
Lu
tzFi
nger
.com
“Data is the new oil”- World Economic Forum
Photo by William Warby under the Creative Commons (CC BY 2.0)
Lu
tzFi
nger
.com
$3.2 billion
Lu
tzFi
nger
.com
Prediction
Photo by KOMUnews under the Creative Commons (CC BY 2.0)
Boring could be the New Sexy!
Lu
tzFi
nger
.com
Innovation To Get Data
from Marketing Material of Ursa Space Systems
Lu
tzFi
nger
.com
Also Governments Take Part
Lu
tzFi
nger
.com
Public Data is Not Competitive
Lu
tzFi
nger
.com
Look For Data Only You Own
taken from http://blogs.ubc.ca/mdaw15/2013/11/15/ipo-twitter-vs-facebook/
Lu
tzFi
nger
.com
Data Might (Not) Be A Barrier To Enter
Lu
tzFi
nger
.com
Data Might (Not) Be A Barrier To Enter
Lu
tzFi
nger
.com
Data Is Kingbut not all data is equal.
Lu
tzFi
nger
.com
The Tale of “Social Media” DataSo
urce: ‘Ask M
easure Learn’ by O’Reilly M
edia
Lu
tzFi
nger
.comStructured Data Is Often
BetterNew York Weather in April 2013
Source: ‘Ask Measure Learn’ by O’Reilly Media
Lu
tzFi
nger
.com
Sometimes, it’s worth it.
Source: Jeffrey Breen
RE @dave_mcgregor: Publicly pledging to never fly @delta again. The worst airline ever. U have lost my patronage forever du to ur incompetence
Completely unimpressed with @continental or @united. Poor communication, goofy reservations systems and all to turn my trip into a mess.
@SouthWestAir I know you don't make the weather. But at least pretend I am not a bother when I ask if the delay will make miss my connection
Lu
tzFi
nger
.com
Agenda: 9:00 - 17:009:00 The right Ask9:45 Teamwork: Discover an Ask
10:30 Coffee Break10:45 Data is King11:15 Decision Tree
13:00 Lunch14:00 Pitfalls with Data14:30 Teamwork: Which Data?
15:30 Coffee Break15:45 Innovation & Technology16:30 Build A Team16:45 Privacy & Ethics
Lu
tzFi
nger
.com
Pregnant Or Not?
Lu
tzFi
nger
.com
Decision Trees Step by Step
by Maciej Lewandowski under Creative Commons (CC BY-SA 2.0)
Lu
tzFi
nger
.com
Split Apples & Mandarins
Lu
tzFi
nger
.com
What Is The Target Variable?
Lu
tzFi
nger
.com
What Are The Features That Describe The Target?
Lu
tzFi
nger
.com
What Are The Features That Describe The Target?
• Weight: light, medium, heavy - or x gram• Size: round or not• Color: green, orange, red• Surface: flat or porous surface• …
Lu
tzFi
nger
.com
Which Feature Works Best?
● The variable with the most important information about the target variable.
● Which variable can split the group as homogeneous with respect to the target variable?
(pure vs. impure)
Lu
tzFi
nger
.com
Color Red?
Color Orange?
Split on Color Red vs. Split on Color Orange
Which One Is Better?
Lu
tzFi
nger
.com
We Need A Way To Describe Chaos
"Cla
ude
Elw
ood
Sha
nnon
(191
6-20
01)"
by
Sou
rce.
Lic
ense
d un
der F
air u
se v
ia
Wik
iped
ia
Lu
tzFi
nger
.com
ENTROPYEntropy is a measure of disorder.
Entropy only tells us how impure one individual subset is.
Lu
tzFi
nger
.com
ENTROPY & PROBABILITY
entropy = -p1 * log (p1) - p2 * log (p2) - ….
Lu
tzFi
nger
.com
● Highest Entropy Reduction
● Highest Information Gain
Lu
tzFi
nger
.com
1st. Entropy Without Splitentropy = -p1 * log (p1) - p2 * log (p2)
Apple: 8 out of 15 p(apple)= 8/15
Mandarines: 7 out of 15 p(mandarine)= 7/15
ENTROPY (Without Split):
-p(apple)*log(p(apple)) -p(mandarins)*log(p(mandarines))
= 0.996791632 = 1
very impure
Lu
tzFi
nger
.com
Color Red?
Color Orange?
entropy = -p1 * log (p1) - p2 * log (p2)
ENTROPY (After Split on Red):
= 8/15* ENTROPY (Split on Red=’no’) + 7/15* ENTROPY (Split on Red=’yes’)
= 0.43 + 0.28 = 0.71
INFORMATION GAIN= Entropy (Before) - Entropy (After) = 1 - 0.71 = 0.29
ENTROPY (Split on Red=’no’):= -6/8*(log2(6/8))-2/8*(log2(2/8))= 0.81
ENTROPY (Split on Red=’yes’):= -6/7*(log2(6/7)) -1/7*(log2(1/7))= 0.59
ENTROPY (Split on Orange=’yes’):= -6/6*(log2(6/6))= 0
ENTROPY (Split on Orange=’no’):= -8/9*(log2(8/9))-1/9*(log2(1/9))= 0.50
ENTROPY (After Split on Orange):
= 6/15* ENTROPY (Split on Orange=’no’) + 9/15* ENTROPY (Split on Orange=’yes’)
= 0 + 0.23 = 0.23
INFORMATION GAIN= Entropy (Before) - Entropy (After) = 1 - 0.23 = 0.77
Lu
tzFi
nger
.com
INFORMATION GAIN (IG)Information Gain measures how much a
given feature improves (decreases) entropy over the whole segmentation it creates.
How important is this feature for the prediction?
Lu
tzFi
nger
.com
Decision Tree
Color Orange? ROOT NODE
LEAFS
Lu
tzFi
nger
.com
Decision Tree
Color Orange?
Decision Tree Structure
Lu
tzFi
nger
.com
Which Feature Would Be Better?
Lu
tzFi
nger
.com
Heavy?
Always Start With Highest IG
Lu
tzFi
nger
.com
BIG ML
Competitors:
● Algorithms.io● SnapAnalytx● Wise.io● Predixion Software● Google Prediction
API
Lu
tzFi
nger
.com
Pregnant Or Not?
Lu
tzFi
nger
.com
• Drag & Drop• Often by Connecting
Get Source
Lu
tzFi
nger
.com
One Click DataBase
• Sense Check• Any Outliers / Anything Strange
Lu
tzFi
nger
.com
Split Training & Testing
Lu
tzFi
nger
.com
Configure Model
Select The Objective Field - What To Train The Model On?
Lu
tzFi
nger
.com
Done
Lu
tzFi
nger
.com
Right hand column displaying scroll over for this high confidence node
Lu
tzFi
nger
.com
Highest Information Gain
Lu
tzFi
nger
.com
Now What?
Lu
tzFi
nger
.com
Predicting
Lu
tzFi
nger
.com
Predicting
Lu
tzFi
nger
.com
Half Pregnant?
Lu
tzFi
nger
.com
CONFUSION MATRIX
Bought Did Not Buy
Bought (A) true positive
(B) false positive
Did Not Buy (C) false negative
(D) true negativeC
lass
ifier
Reality
Lu
tzFi
nger
.com
Business Decision: Cut-Off Value
It depends on the Ask
Lu
tzFi
nger
.com
TRUE NEGATIVE Specificity
# of true negative / truthalso: Specificity = 1 - False positive rate
Bought Did Not Buy
Bought true positive false positive
Did Not Buy false negative
true negativeCla
ssifi
erTruth
Lu
tzFi
nger
.com
PRECISION
# of true positives / Total in this prediction class
Bought Did Not Buy
Bought true positive false positive
Did Not Buy false negative
true negativeCla
ssifi
erTruth
Lu
tzFi
nger
.com
ROC CURVE
Better Model
Worse Model
Lu
tzFi
nger
.com
Using The Model
Lu
tzFi
nger
.com
Using The Model
Lu
tzFi
nger
.com
Now How Can I Improve the Quality?
Lu
tzFi
nger
.com
Agenda: 9:00 - 17:009:00 The right Ask9:45 Teamwork: Discover an Ask
10:30 Coffee Break10:45 Data is King11:15 Decision Tree
13:00 Lunch14:00 Pitfalls with Data14:30 Teamwork: Which Data?
15:30 Coffee Break15:45 Innovation & Technology16:30 Build A Team16:45 Privacy & Ethics
Lu
tzFi
nger
.com
The Tale of Big Data
Lu
tzFi
nger
.com
Overfitting
To tailor a model to training data at the expense of being generalizable for previously unseen data
points. The model becomes perfect in describing noise and spurious correlations.
TRADE OFF
Complexity of a Model & Overfitting Likelihood
Lu
tzFi
nger
.com
The More Nodes - The More Likely To Overfit
Lu
tzFi
nger
.com
The Story of MORE DataDecision Trees are good in identifying LOCAL
patterns, but they often need more data.
by Claudia Perlich et. al., “Tree Induction vs. Logistic Regression: A Learning-Curve Analysis”, Journal of Machine Learning Research 4 (2003) 211-255
Lu
tzFi
nger
.com
Correlation vs. Causation
Lu
tzFi
nger
.com
Team Work
Photo by Creative Sustainability under the Creative Commons (CC BY 2.0)
○ do only you have this data?○ do you have a positive feedback loop? ○ is the data sustainable?○ who else could get the data?○ how much data is needed?
Lu
tzFi
nger
.com
Agenda: 9:00 - 17:009:00 The right Ask9:45 Teamwork: Discover an Ask
10:30 Coffee Break10:45 Data is King11:15 Decision Tree
13:00 Lunch14:00 Pitfalls with Data14:30 Teamwork: Which Data?
15:30 Coffee Break15:45 Innovation & Technology16:30 Build A Team16:45 Privacy & Ethics
Lu
tzFi
nger
.com
How Was Big Data Infrastructure Invented?
Lu
tzFi
nger
.com
Issue Of YahooCENTRALIZED SYSTEMS ARE EXPENSIVE
• diminishing returns in power (overhead issue)• exponential cost to scale• slow to transport (ETL) the data
Scan 1000 TB Datasets on a 1000 node cluster:• Remote Storage @ 10 MB’s = 165 min• Local Storage @ 200 MB’s = 8 min
MAKE SYSTEMS FAULT TOLERANT1000 nodes - a machine a day will break
Lu
tzFi
nger
.com
The VisionCHEAP Systems
• can run on commodity hardware
Computation are done DECENTRAL• ability to ‘dispatch’ a task• parallelize work-streams
Fault TOLERANTno matter where and when, is not an issue
Lu
tzFi
nger
.com
Lu
tzFi
nger
.com
Typical Workflow
· Load data into the cluster (HDFS writes)· Analyze the data (Map Reduce)· Store results in the cluster (HDFS writes)· Read the results from the cluster (HDFS reads) Sample Scenario:
Huge file containing all emails sentto customer service
Ref. Brad Hedlund .com
How many times did our customers type the word “Refund” into emails sent to customer service?
File. Txt
Lu
tzFi
nger
.com
How To Access HDFS
Hadoop Storage (HDFS / HBase / Solr)
Map Reduce
Lu
tzFi
nger
.com
Via The Normal Languages
Hadoop Storage (HDFS / HBase / Solr)
Map Reduce
Map
Red
uce
Hiv
e
Pig
/Cas
scad
ing
Gira
ph
Mah
out
SQL Like
Scripting Like
Graph Oriented
ML Engine
Lu
tzFi
nger
.com
Pro & Con
Hadoop Storage (HDFS / HBase / Solr)
Map Reduce
Map
Red
uce
Hiv
e
Pig
/Cas
scad
ing
Gira
ph
Mah
out
SQL Like
Scripting Like
Graph Oriented
ML Engine
Store
ETL: Extract / Transform / Load
DB / Key Value Store
Visualize
Pro:way better than traditional BI
Con:Heavy tech involvement. 12-18 month for non-tech company to implement a schema
Lu
tzFi
nger
.com
Hadoop 2.0
Hadoop Storage (HDFS / HBase / Solr)
Map Reduce Spark Tez
Map
Red
uce
Hiv
e
Pig
/Cas
scad
ing
Gira
ph
Mah
out
Spa
rk
Hiv
e
Pig
/Cas
scad
ing
Gira
ph
Mah
out
Tez
Pig
/Cas
scad
ing
Hiv
e
Impa
la /
Pre
sto
H2O
/ O
ryx
SQL Like
Scripting Like
Graph Oriented
ML Engine
Store in DB
Visualize
Visualize
Lu
tzFi
nger
.com
Why Is It So Hard To Become Data Driven
Lu
tzFi
nger
.com
Ingredients of Data Products
The question?
Ask
The need?
The Why? MeasureThe Data?
The features?
Team
All of them are necessary - None of them are sufficient!
The algorithms?
The right Skills?
Collaboration
110
Lu
tzFi
nger
.com
How To Ingest Ideas
Hack - Days & IncubatorInternal Process
External Competition
Close Collaboration between Business & Data Scientists“All we do is Data” - Jeff Weiner
111
Lu
tzFi
nger
.com
Agenda: 9:00 - 17:009:00 The right Ask9:45 Teamwork: Discover an Ask
10:30 Coffee Break10:45 Data is King11:15 Decision Tree
13:00 Lunch14:00 Pitfalls with Data14:30 Teamwork: Which Data?
15:30 Coffee Break15:45 Innovation & Technology16:30 Build A Team16:45 Privacy & Ethics
Lu
tzFi
nger
.com
Old vs. New
Old School Today / Big data
Data Amount
IT Infrastructure
Data Types
Schema
When and How is the ASK formulated?
Lu
tzFi
nger
.com
Old vs. New
Old School Today / Big data
Data Amount Gigabytes & Terabytes Petabytes & Exabytes
IT Infrastructure
Data Types
Schema
When and How is the ASK formulated?
Lu
tzFi
nger
.com
Old vs. New
Old School Today / Big data
Data Amount Gigabytes & Terabytes Petabytes & Exabytes
IT Infrastructure Centralized Decentralized / Parallelized
Data Types
Schema
When and How is the ASK formulated?
Lu
tzFi
nger
.com
Old vs. New
Old School Today / Big data
Data Amount Gigabytes & Terabytes Petabytes & Exabytes
IT Infrastructure Centralized Decentralized / Parallelized
Data Types Structured Structured & Unstructured
Schema
When and How is the ASK formulated?
Lu
tzFi
nger
.com
Old vs. New
Old School Today / Big data
Data Amount Gigabytes & Terabytes Petabytes & Exabytes
IT Infrastructure Centralized Decentralized / Parallelized
Data Types Structured Structured & unstructured
Schema Stable schema Schema on the fly
When and How is the ASK formulated?
Lu
tzFi
nger
.com
Old vs. New
Old School Today / Big data
Data Amount Gigabytes & Terabytes Petabytes & Exabytes
IT Infrastructure Centralized Decentralized / Parallelized
Data Types Structured Structured & unstructured
Schema Stable schema Schema on the fly
When and How is the ASK formulated?
Set ask Ad-hoc ask
Lu
tzFi
nger
.com
How to build a Data Team
Lu
tzFi
nger
.com
Lu
tzFi
nger
.com
Data Scientist
Lu
tzFi
nger
.com
Data Scientist
BI Analyst
Lu
tzFi
nger
.com
Data Scientist
BI Analyst
Engineer
Lu
tzFi
nger
.com
Data Scientist
BI Analyst
Engineer
Product Manager
Lu
tzFi
nger
.com
Data Scientist
BI Analyst
Engineer
Product Manager
Communication Skills Domain Knowledge
Lu
tzFi
nger
.comThere Is NO Data Science
Shortage
Source: World Economic Forum - Human Capital Report 2016
Lu
tzFi
nger
.com
There are 9 Million Data Enabled People
Lu
tzFi
nger
.com
Agenda: 9:00 - 17:009:00 The right Ask9:45 Teamwork: Discover an Ask
10:30 Coffee Break10:45 Data is King11:15 Decision Tree
13:00 Lunch14:00 Pitfalls with Data14:30 Teamwork: Which Data?
15:30 Coffee Break15:45 Innovation & Technology16:30 Build A Team16:45 Privacy & Ethics
Lu
tzFi
nger
.com
In the EU, insurers will no longer be allowed to take the gender of their customers into account for insurance premiums:
● young men's premiums will fall by up to 10%
● young women's premiums will rise by up to 30%
by: BBC News: http://www.bbc.com/news/business-12608777
Not Everything That Is Possible Is Legal
Lu
tzFi
nger
.com
Let me analyze your Social Network Connections. If they
are “trustworthy” you will become easier a Credit.
Ethical or Not?
by: BBC News: http://www.bbc.com/news/business-12608777
How About Community Profiling
Lu
tzFi
nger
.com
Nobel Worthy!
Muhammad YunusPhoto by University of Salford under Creative Commons CC BY 2.0
Lu
tzFi
nger
.com
Thank You