Download - Big it data workshop pub
![Page 1: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/1.jpg)
Lu
tzFi
nger
.com
How to extract significant business value from big
dataSeptember 20th 2016
![Page 2: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/2.jpg)
Lu
tzFi
nger
.com
Lutz & Matt
![Page 3: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/3.jpg)
Lu
tzFi
nger
.com
Disclaimer
This presentation is solemnly our opinion and not necessarily the
opinion of my employer Harvard, Linkedin or Cornell.
![Page 4: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/4.jpg)
Lu
tzFi
nger
.com
Agenda: 9:00 - 17:009:00 The right Ask9:45 Teamwork: Discover an Ask
10:30 Coffee Break10:45 Data is King11:15 Decision Tree
13:00 Lunch14:00 Pitfalls with Data14:30 Teamwork: Which Data?
15:30 Coffee Break15:45 Innovation & Technology16:30 Build A Team16:45 Privacy & Ethics
![Page 5: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/5.jpg)
Lu
tzFi
nger
.comHype About Data
![Page 6: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/6.jpg)
Lu
tzFi
nger
.com
Hyped Data Scientists
image by Mike under Creative Commons
![Page 7: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/7.jpg)
Lu
tzFi
nger
.com
McK Study forecasted:
10 Times More Managers per Data Savvy Person
![Page 8: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/8.jpg)
Lu
tzFi
nger
.com
?
![Page 9: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/9.jpg)
Lu
tzFi
nger
.com
SCHOOLS COMPANIES KNOWLEDGE SKILLSMEMBERS JOBS
LinkedIn's vision is to create economic opportunity for every member of the global
workforce.
![Page 10: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/10.jpg)
Lu
tzFi
nger
.com
Actionable Insights
![Page 11: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/11.jpg)
Lu
tzFi
nger
.com
ASK the right Questions.
MEASURE the right data – even if it is not Big data.
Take Actions and LEARN from them.
?
![Page 12: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/12.jpg)
Lu
tzFi
nger
.com
BIG DATA IS “BULLSHIT”
![Page 13: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/13.jpg)
Lu
tzFi
nger
.com
To Get Data is EASYTo Get The Right Data is HARD
To Get Insights is EASYTo Make Money of Data/Insights is
HARD
![Page 14: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/14.jpg)
Lu
tzFi
nger
.com
THE ASK is the hardest part, but there are many use-cases to get started.?
![Page 15: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/15.jpg)
Lu
tzFi
nger
.com
The Right Question
![Page 16: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/16.jpg)
Lu
tzFi
nger
.com
Google had the right Questionis difficult to find
![Page 17: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/17.jpg)
Lu
tzFi
nger
.com
Fisheye Learning
![Page 18: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/18.jpg)
Lu
tzFi
nger
.com
Data Without Action
300+ Million Member at LinkedIn
60.000 with a Job Title that might fit
19.000 who switched after 3 to 8 years
24 who had the same career path
![Page 19: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/19.jpg)
Lu
tzFi
nger
.com
Data by itselfis
USELESS
Information by itselfis often
USELESS
Only Action
Counts!
Data Reportingprescriptive, predictive, actionable, data science … the holy grail
![Page 20: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/20.jpg)
Lu
tzFi
nger
.com
How To Work With Data?
Past Future
What happened?
What is happening?
What is likely to happen?
Reporting, Dashboards
Real-Time Analytics
Predictive Analytics
Forensics & Data Mining
Real-Time Data Mining
Prescriptive Analytics
Why did it happen?
Why is it happening?
What should I do about it?
Ref. Gartner
![Page 21: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/21.jpg)
Lu
tzFi
nger
.com
Easiest - Start With Reporting
LinkedIn’s LMI Tool
![Page 23: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/23.jpg)
Lu
tzFi
nger
.com
We Want Predictions
![Page 24: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/24.jpg)
Lu
tzFi
nger
.com
We Want Monetization
![Page 25: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/25.jpg)
Lu
tzFi
nger
.com
Examples At LinkedInPeople You May Know
Groups You May Like
Ads in Which You May Be Interested
Companies You May Want to Follow
Pulse
Similar Profiles
![Page 26: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/26.jpg)
Lu
tzFi
nger
.com
Many Other Good Ideas
• Banking: Card Fraud Detection• Banking: Credit Scoring• Media: Content Recommendation • Health Care: Fraud Detection• Medicine: Image Processing• Medicine: Outliers Detection• Education: Course Improvement• Retail: Likelihood to Buy• Books: Marketing Planning• Manufacturing: Machine Failure Prediction• Manufacturing: Optimization• Insurance: Likelihood & Pricing• Transportation: Route Planning• Energy: Grid Utilization
![Page 27: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/27.jpg)
Lu
tzFi
nger
.com
About Innovation
By
Alis
tair
Cro
ll
![Page 28: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/28.jpg)
Lu
tzFi
nger
.com
Team Work
Photo by Creative Sustainability under the Creative Commons (CC BY 2.0)
What Would You Like To Do With Data?○ Is it Actionable? “So What?”○ Is it Reporting or Predictions?○ Is it Sustaining, Adjunct or Disruptive?
Please Stay REAL!
![Page 29: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/29.jpg)
Lu
tzFi
nger
.com
Agenda: 9:00 - 17:009:00 The right Ask9:45 Teamwork: Discover an Ask
10:30 Coffee Break10:45 Data is King11:15 Decision Tree
13:00 Lunch14:00 Pitfalls with Data14:30 Teamwork: Which Data?
15:30 Coffee Break15:45 Innovation16:15 Technology16:45 Build A Team
![Page 30: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/30.jpg)
Lu
tzFi
nger
.com
“Data is the new oil”- World Economic Forum
![Page 31: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/31.jpg)
Lu
tzFi
nger
.com
“DATA IS THE NEW OIL”
Oil Mine the oil
Use the oil
Goal
![Page 32: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/32.jpg)
Lu
tzFi
nger
.com
V OF “BIG DATA”
Data at scale(TB, PB … )
Data in many forms(Structured, unstructured ...)
Speed(Streaming, real time, near time ..)
Uncertainty(Imprecise, not always up-to-date ..)
![Page 33: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/33.jpg)
Lu
tzFi
nger
.com
DATACategorical
• Ordinal: Monday, Tuesday, Wednesday• Nominal: Man, Woman
Quantitative:• Ratio: Kelvin, Height, Weight• Interval: Celsius, Fahrenheit
Structure:• Structured• Unstructured• Semi-structured / Meta data
Read more: “On the Theory of Scales of Measurement”S.Stevens 1946
![Page 34: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/34.jpg)
Lu
tzFi
nger
.com
What Have Troubled The Media Industry?
![Page 35: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/35.jpg)
Lu
tzFi
nger
.com
The Media Industry Is One Step Removed From The Customer
Photo by Norimutsu Nogami under the Creative Commons (CC BY 2.0)
They Do Not Know Who Reads What &
When?
![Page 36: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/36.jpg)
Lu
tzFi
nger
.com
Facebook Knows
* only member - not necessarily ‘active’ members
![Page 37: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/37.jpg)
Lu
tzFi
nger
.com
& Size MattersNetwork Size
(Proportion by Members*)
* only member - not necessarily ‘active’ members
![Page 38: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/38.jpg)
Lu
tzFi
nger
.com
“Data is the new oil”- World Economic Forum
Photo by William Warby under the Creative Commons (CC BY 2.0)
![Page 39: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/39.jpg)
Lu
tzFi
nger
.com
$3.2 billion
![Page 40: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/40.jpg)
Lu
tzFi
nger
.com
Prediction
Photo by KOMUnews under the Creative Commons (CC BY 2.0)
Boring could be the New Sexy!
![Page 41: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/41.jpg)
Lu
tzFi
nger
.com
Innovation To Get Data
from Marketing Material of Ursa Space Systems
![Page 42: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/42.jpg)
Lu
tzFi
nger
.com
Also Governments Take Part
![Page 43: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/43.jpg)
Lu
tzFi
nger
.com
Public Data is Not Competitive
![Page 44: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/44.jpg)
Lu
tzFi
nger
.com
Look For Data Only You Own
taken from http://blogs.ubc.ca/mdaw15/2013/11/15/ipo-twitter-vs-facebook/
![Page 45: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/45.jpg)
Lu
tzFi
nger
.com
Data Might (Not) Be A Barrier To Enter
![Page 46: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/46.jpg)
Lu
tzFi
nger
.com
Data Might (Not) Be A Barrier To Enter
![Page 47: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/47.jpg)
Lu
tzFi
nger
.com
Data Is Kingbut not all data is equal.
![Page 48: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/48.jpg)
Lu
tzFi
nger
.com
The Tale of “Social Media” DataSo
urce: ‘Ask M
easure Learn’ by O’Reilly M
edia
![Page 49: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/49.jpg)
Lu
tzFi
nger
.comStructured Data Is Often
BetterNew York Weather in April 2013
Source: ‘Ask Measure Learn’ by O’Reilly Media
![Page 50: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/50.jpg)
Lu
tzFi
nger
.com
Sometimes, it’s worth it.
Source: Jeffrey Breen
RE @dave_mcgregor: Publicly pledging to never fly @delta again. The worst airline ever. U have lost my patronage forever du to ur incompetence
Completely unimpressed with @continental or @united. Poor communication, goofy reservations systems and all to turn my trip into a mess.
@SouthWestAir I know you don't make the weather. But at least pretend I am not a bother when I ask if the delay will make miss my connection
![Page 51: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/51.jpg)
Lu
tzFi
nger
.com
Agenda: 9:00 - 17:009:00 The right Ask9:45 Teamwork: Discover an Ask
10:30 Coffee Break10:45 Data is King11:15 Decision Tree
13:00 Lunch14:00 Pitfalls with Data14:30 Teamwork: Which Data?
15:30 Coffee Break15:45 Innovation & Technology16:30 Build A Team16:45 Privacy & Ethics
![Page 52: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/52.jpg)
Lu
tzFi
nger
.com
Pregnant Or Not?
![Page 53: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/53.jpg)
Lu
tzFi
nger
.com
Decision Trees Step by Step
by Maciej Lewandowski under Creative Commons (CC BY-SA 2.0)
![Page 54: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/54.jpg)
Lu
tzFi
nger
.com
Split Apples & Mandarins
![Page 55: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/55.jpg)
Lu
tzFi
nger
.com
What Is The Target Variable?
![Page 56: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/56.jpg)
Lu
tzFi
nger
.com
What Are The Features That Describe The Target?
![Page 57: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/57.jpg)
Lu
tzFi
nger
.com
What Are The Features That Describe The Target?
• Weight: light, medium, heavy - or x gram• Size: round or not• Color: green, orange, red• Surface: flat or porous surface• …
![Page 58: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/58.jpg)
Lu
tzFi
nger
.com
Which Feature Works Best?
● The variable with the most important information about the target variable.
● Which variable can split the group as homogeneous with respect to the target variable?
(pure vs. impure)
![Page 59: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/59.jpg)
Lu
tzFi
nger
.com
Color Red?
Color Orange?
Split on Color Red vs. Split on Color Orange
Which One Is Better?
![Page 60: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/60.jpg)
Lu
tzFi
nger
.com
We Need A Way To Describe Chaos
"Cla
ude
Elw
ood
Sha
nnon
(191
6-20
01)"
by
Sou
rce.
Lic
ense
d un
der F
air u
se v
ia
Wik
iped
ia
![Page 61: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/61.jpg)
Lu
tzFi
nger
.com
ENTROPYEntropy is a measure of disorder.
Entropy only tells us how impure one individual subset is.
![Page 62: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/62.jpg)
Lu
tzFi
nger
.com
ENTROPY & PROBABILITY
entropy = -p1 * log (p1) - p2 * log (p2) - ….
![Page 63: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/63.jpg)
Lu
tzFi
nger
.com
● Highest Entropy Reduction
● Highest Information Gain
![Page 64: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/64.jpg)
Lu
tzFi
nger
.com
1st. Entropy Without Splitentropy = -p1 * log (p1) - p2 * log (p2)
Apple: 8 out of 15 p(apple)= 8/15
Mandarines: 7 out of 15 p(mandarine)= 7/15
ENTROPY (Without Split):
-p(apple)*log(p(apple)) -p(mandarins)*log(p(mandarines))
= 0.996791632 = 1
very impure
![Page 65: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/65.jpg)
Lu
tzFi
nger
.com
Color Red?
Color Orange?
entropy = -p1 * log (p1) - p2 * log (p2)
ENTROPY (After Split on Red):
= 8/15* ENTROPY (Split on Red=’no’) + 7/15* ENTROPY (Split on Red=’yes’)
= 0.43 + 0.28 = 0.71
INFORMATION GAIN= Entropy (Before) - Entropy (After) = 1 - 0.71 = 0.29
ENTROPY (Split on Red=’no’):= -6/8*(log2(6/8))-2/8*(log2(2/8))= 0.81
ENTROPY (Split on Red=’yes’):= -6/7*(log2(6/7)) -1/7*(log2(1/7))= 0.59
ENTROPY (Split on Orange=’yes’):= -6/6*(log2(6/6))= 0
ENTROPY (Split on Orange=’no’):= -8/9*(log2(8/9))-1/9*(log2(1/9))= 0.50
ENTROPY (After Split on Orange):
= 6/15* ENTROPY (Split on Orange=’no’) + 9/15* ENTROPY (Split on Orange=’yes’)
= 0 + 0.23 = 0.23
INFORMATION GAIN= Entropy (Before) - Entropy (After) = 1 - 0.23 = 0.77
![Page 66: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/66.jpg)
Lu
tzFi
nger
.com
INFORMATION GAIN (IG)Information Gain measures how much a
given feature improves (decreases) entropy over the whole segmentation it creates.
How important is this feature for the prediction?
![Page 67: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/67.jpg)
Lu
tzFi
nger
.com
Decision Tree
Color Orange? ROOT NODE
LEAFS
![Page 68: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/68.jpg)
Lu
tzFi
nger
.com
Decision Tree
Color Orange?
Decision Tree Structure
![Page 69: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/69.jpg)
Lu
tzFi
nger
.com
Which Feature Would Be Better?
![Page 70: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/70.jpg)
Lu
tzFi
nger
.com
Heavy?
Always Start With Highest IG
![Page 71: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/71.jpg)
Lu
tzFi
nger
.com
BIG ML
Competitors:
● Algorithms.io● SnapAnalytx● Wise.io● Predixion Software● Google Prediction
API
![Page 72: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/72.jpg)
Lu
tzFi
nger
.com
Pregnant Or Not?
![Page 73: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/73.jpg)
Lu
tzFi
nger
.com
• Drag & Drop• Often by Connecting
Get Source
![Page 74: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/74.jpg)
Lu
tzFi
nger
.com
One Click DataBase
• Sense Check• Any Outliers / Anything Strange
![Page 75: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/75.jpg)
Lu
tzFi
nger
.com
Split Training & Testing
![Page 76: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/76.jpg)
Lu
tzFi
nger
.com
Configure Model
Select The Objective Field - What To Train The Model On?
![Page 77: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/77.jpg)
Lu
tzFi
nger
.com
Done
![Page 78: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/78.jpg)
Lu
tzFi
nger
.com
Right hand column displaying scroll over for this high confidence node
![Page 79: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/79.jpg)
Lu
tzFi
nger
.com
Highest Information Gain
![Page 80: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/80.jpg)
Lu
tzFi
nger
.com
Now What?
![Page 81: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/81.jpg)
Lu
tzFi
nger
.com
Predicting
![Page 82: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/82.jpg)
Lu
tzFi
nger
.com
Predicting
![Page 83: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/83.jpg)
Lu
tzFi
nger
.com
Half Pregnant?
![Page 84: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/84.jpg)
Lu
tzFi
nger
.com
CONFUSION MATRIX
Bought Did Not Buy
Bought (A) true positive
(B) false positive
Did Not Buy (C) false negative
(D) true negativeC
lass
ifier
Reality
![Page 85: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/85.jpg)
Lu
tzFi
nger
.com
Business Decision: Cut-Off Value
It depends on the Ask
![Page 86: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/86.jpg)
Lu
tzFi
nger
.com
TRUE NEGATIVE Specificity
# of true negative / truthalso: Specificity = 1 - False positive rate
Bought Did Not Buy
Bought true positive false positive
Did Not Buy false negative
true negativeCla
ssifi
erTruth
![Page 87: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/87.jpg)
Lu
tzFi
nger
.com
PRECISION
# of true positives / Total in this prediction class
Bought Did Not Buy
Bought true positive false positive
Did Not Buy false negative
true negativeCla
ssifi
erTruth
![Page 88: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/88.jpg)
Lu
tzFi
nger
.com
ROC CURVE
Better Model
Worse Model
![Page 89: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/89.jpg)
Lu
tzFi
nger
.com
Using The Model
![Page 90: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/90.jpg)
Lu
tzFi
nger
.com
Using The Model
![Page 91: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/91.jpg)
Lu
tzFi
nger
.com
Now How Can I Improve the Quality?
![Page 92: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/92.jpg)
Lu
tzFi
nger
.com
Agenda: 9:00 - 17:009:00 The right Ask9:45 Teamwork: Discover an Ask
10:30 Coffee Break10:45 Data is King11:15 Decision Tree
13:00 Lunch14:00 Pitfalls with Data14:30 Teamwork: Which Data?
15:30 Coffee Break15:45 Innovation & Technology16:30 Build A Team16:45 Privacy & Ethics
![Page 93: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/93.jpg)
Lu
tzFi
nger
.com
The Tale of Big Data
![Page 94: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/94.jpg)
Lu
tzFi
nger
.com
Overfitting
To tailor a model to training data at the expense of being generalizable for previously unseen data
points. The model becomes perfect in describing noise and spurious correlations.
TRADE OFF
Complexity of a Model & Overfitting Likelihood
![Page 95: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/95.jpg)
Lu
tzFi
nger
.com
The More Nodes - The More Likely To Overfit
![Page 96: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/96.jpg)
Lu
tzFi
nger
.com
The Story of MORE DataDecision Trees are good in identifying LOCAL
patterns, but they often need more data.
by Claudia Perlich et. al., “Tree Induction vs. Logistic Regression: A Learning-Curve Analysis”, Journal of Machine Learning Research 4 (2003) 211-255
![Page 97: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/97.jpg)
Lu
tzFi
nger
.com
Correlation vs. Causation
![Page 98: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/98.jpg)
Lu
tzFi
nger
.com
Team Work
Photo by Creative Sustainability under the Creative Commons (CC BY 2.0)
○ do only you have this data?○ do you have a positive feedback loop? ○ is the data sustainable?○ who else could get the data?○ how much data is needed?
![Page 99: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/99.jpg)
Lu
tzFi
nger
.com
Agenda: 9:00 - 17:009:00 The right Ask9:45 Teamwork: Discover an Ask
10:30 Coffee Break10:45 Data is King11:15 Decision Tree
13:00 Lunch14:00 Pitfalls with Data14:30 Teamwork: Which Data?
15:30 Coffee Break15:45 Innovation & Technology16:30 Build A Team16:45 Privacy & Ethics
![Page 100: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/100.jpg)
Lu
tzFi
nger
.com
How Was Big Data Infrastructure Invented?
![Page 101: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/101.jpg)
Lu
tzFi
nger
.com
Issue Of YahooCENTRALIZED SYSTEMS ARE EXPENSIVE
• diminishing returns in power (overhead issue)• exponential cost to scale• slow to transport (ETL) the data
Scan 1000 TB Datasets on a 1000 node cluster:• Remote Storage @ 10 MB’s = 165 min• Local Storage @ 200 MB’s = 8 min
MAKE SYSTEMS FAULT TOLERANT1000 nodes - a machine a day will break
![Page 102: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/102.jpg)
Lu
tzFi
nger
.com
The VisionCHEAP Systems
• can run on commodity hardware
Computation are done DECENTRAL• ability to ‘dispatch’ a task• parallelize work-streams
Fault TOLERANTno matter where and when, is not an issue
![Page 103: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/103.jpg)
Lu
tzFi
nger
.com
![Page 104: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/104.jpg)
Lu
tzFi
nger
.com
Typical Workflow
· Load data into the cluster (HDFS writes)· Analyze the data (Map Reduce)· Store results in the cluster (HDFS writes)· Read the results from the cluster (HDFS reads) Sample Scenario:
Huge file containing all emails sentto customer service
Ref. Brad Hedlund .com
How many times did our customers type the word “Refund” into emails sent to customer service?
File. Txt
![Page 105: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/105.jpg)
Lu
tzFi
nger
.com
How To Access HDFS
Hadoop Storage (HDFS / HBase / Solr)
Map Reduce
![Page 106: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/106.jpg)
Lu
tzFi
nger
.com
Via The Normal Languages
Hadoop Storage (HDFS / HBase / Solr)
Map Reduce
Map
Red
uce
Hiv
e
Pig
/Cas
scad
ing
Gira
ph
Mah
out
SQL Like
Scripting Like
Graph Oriented
ML Engine
![Page 107: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/107.jpg)
Lu
tzFi
nger
.com
Pro & Con
Hadoop Storage (HDFS / HBase / Solr)
Map Reduce
Map
Red
uce
Hiv
e
Pig
/Cas
scad
ing
Gira
ph
Mah
out
SQL Like
Scripting Like
Graph Oriented
ML Engine
Store
ETL: Extract / Transform / Load
DB / Key Value Store
Visualize
Pro:way better than traditional BI
Con:Heavy tech involvement. 12-18 month for non-tech company to implement a schema
![Page 108: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/108.jpg)
Lu
tzFi
nger
.com
Hadoop 2.0
Hadoop Storage (HDFS / HBase / Solr)
Map Reduce Spark Tez
Map
Red
uce
Hiv
e
Pig
/Cas
scad
ing
Gira
ph
Mah
out
Spa
rk
Hiv
e
Pig
/Cas
scad
ing
Gira
ph
Mah
out
Tez
Pig
/Cas
scad
ing
Hiv
e
Impa
la /
Pre
sto
H2O
/ O
ryx
SQL Like
Scripting Like
Graph Oriented
ML Engine
Store in DB
Visualize
Visualize
![Page 109: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/109.jpg)
Lu
tzFi
nger
.com
Why Is It So Hard To Become Data Driven
![Page 110: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/110.jpg)
Lu
tzFi
nger
.com
Ingredients of Data Products
The question?
Ask
The need?
The Why? MeasureThe Data?
The features?
Team
All of them are necessary - None of them are sufficient!
The algorithms?
The right Skills?
Collaboration
110
![Page 111: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/111.jpg)
Lu
tzFi
nger
.com
How To Ingest Ideas
Hack - Days & IncubatorInternal Process
External Competition
Close Collaboration between Business & Data Scientists“All we do is Data” - Jeff Weiner
111
![Page 112: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/112.jpg)
Lu
tzFi
nger
.com
Agenda: 9:00 - 17:009:00 The right Ask9:45 Teamwork: Discover an Ask
10:30 Coffee Break10:45 Data is King11:15 Decision Tree
13:00 Lunch14:00 Pitfalls with Data14:30 Teamwork: Which Data?
15:30 Coffee Break15:45 Innovation & Technology16:30 Build A Team16:45 Privacy & Ethics
![Page 113: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/113.jpg)
Lu
tzFi
nger
.com
Old vs. New
Old School Today / Big data
Data Amount
IT Infrastructure
Data Types
Schema
When and How is the ASK formulated?
![Page 114: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/114.jpg)
Lu
tzFi
nger
.com
Old vs. New
Old School Today / Big data
Data Amount Gigabytes & Terabytes Petabytes & Exabytes
IT Infrastructure
Data Types
Schema
When and How is the ASK formulated?
![Page 115: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/115.jpg)
Lu
tzFi
nger
.com
Old vs. New
Old School Today / Big data
Data Amount Gigabytes & Terabytes Petabytes & Exabytes
IT Infrastructure Centralized Decentralized / Parallelized
Data Types
Schema
When and How is the ASK formulated?
![Page 116: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/116.jpg)
Lu
tzFi
nger
.com
Old vs. New
Old School Today / Big data
Data Amount Gigabytes & Terabytes Petabytes & Exabytes
IT Infrastructure Centralized Decentralized / Parallelized
Data Types Structured Structured & Unstructured
Schema
When and How is the ASK formulated?
![Page 117: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/117.jpg)
Lu
tzFi
nger
.com
Old vs. New
Old School Today / Big data
Data Amount Gigabytes & Terabytes Petabytes & Exabytes
IT Infrastructure Centralized Decentralized / Parallelized
Data Types Structured Structured & unstructured
Schema Stable schema Schema on the fly
When and How is the ASK formulated?
![Page 118: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/118.jpg)
Lu
tzFi
nger
.com
Old vs. New
Old School Today / Big data
Data Amount Gigabytes & Terabytes Petabytes & Exabytes
IT Infrastructure Centralized Decentralized / Parallelized
Data Types Structured Structured & unstructured
Schema Stable schema Schema on the fly
When and How is the ASK formulated?
Set ask Ad-hoc ask
![Page 119: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/119.jpg)
Lu
tzFi
nger
.com
How to build a Data Team
![Page 120: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/120.jpg)
Lu
tzFi
nger
.com
![Page 121: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/121.jpg)
Lu
tzFi
nger
.com
Data Scientist
![Page 122: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/122.jpg)
Lu
tzFi
nger
.com
Data Scientist
BI Analyst
![Page 123: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/123.jpg)
Lu
tzFi
nger
.com
Data Scientist
BI Analyst
Engineer
![Page 124: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/124.jpg)
Lu
tzFi
nger
.com
Data Scientist
BI Analyst
Engineer
Product Manager
![Page 125: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/125.jpg)
Lu
tzFi
nger
.com
Data Scientist
BI Analyst
Engineer
Product Manager
Communication Skills Domain Knowledge
![Page 126: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/126.jpg)
Lu
tzFi
nger
.comThere Is NO Data Science
Shortage
Source: World Economic Forum - Human Capital Report 2016
![Page 127: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/127.jpg)
Lu
tzFi
nger
.com
There are 9 Million Data Enabled People
![Page 128: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/128.jpg)
Lu
tzFi
nger
.com
Agenda: 9:00 - 17:009:00 The right Ask9:45 Teamwork: Discover an Ask
10:30 Coffee Break10:45 Data is King11:15 Decision Tree
13:00 Lunch14:00 Pitfalls with Data14:30 Teamwork: Which Data?
15:30 Coffee Break15:45 Innovation & Technology16:30 Build A Team16:45 Privacy & Ethics
![Page 129: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/129.jpg)
Lu
tzFi
nger
.com
In the EU, insurers will no longer be allowed to take the gender of their customers into account for insurance premiums:
● young men's premiums will fall by up to 10%
● young women's premiums will rise by up to 30%
by: BBC News: http://www.bbc.com/news/business-12608777
Not Everything That Is Possible Is Legal
![Page 130: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/130.jpg)
Lu
tzFi
nger
.com
Let me analyze your Social Network Connections. If they
are “trustworthy” you will become easier a Credit.
Ethical or Not?
by: BBC News: http://www.bbc.com/news/business-12608777
How About Community Profiling
![Page 131: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/131.jpg)
Lu
tzFi
nger
.com
Nobel Worthy!
Muhammad YunusPhoto by University of Salford under Creative Commons CC BY 2.0
![Page 132: Big it data workshop pub](https://reader031.vdocuments.mx/reader031/viewer/2022022413/589f4a9a1a28abec418b4ccd/html5/thumbnails/132.jpg)
Lu
tzFi
nger
.com
Thank You