why why why
TRANSCRIPT
berlin pydata | @gabegaster | 2015 february
what is data science?
who is a data scientist?
review of literature
berlin pydata | @gabegaster | 2015 february
what is data science?
who is a data scientist?
review of literature
berlin pydata | @gabegaster | 2015 february
what is data science?
who is a data scientist?“a scientist who can code”
berlin pydata | @gabegaster | 2015 february
what is data science?
who is a data scientist?“a scientist who can code”
• lower barrier to attack new problems
berlin pydata | @gabegaster | 2015 february
what is data science?
who is a data scientist?“a scientist who can code”
• lower barrier to attack new problems • repeatable analysis
berlin pydata | @gabegaster | 2015 february
what is data science?
who is a data scientist?“a scientist who can code”
• lower barrier to attack new problems • repeatable analysis • freedom to think about problems new ways
berlin pydata | @gabegaster | 2015 february
what is data science?
using emerging technologies to approach problems scientifically
berlin pydata | @gabegaster | 2015 february
what is data science?
using emerging technologies to approach problems scientifically
which were difficult to answer before
berlin pydata | @gabegaster | 2015 february
computing has progressed
which were difficult to answer before
berlin pydata | @gabegaster | 2015 february
1950
cost of new analysis years
computing has progressed
berlin pydata | @gabegaster | 2015 february
1950
cost of new analysis years
todaycomputing has progressed
berlin pydata | @gabegaster | 2015 february
1950
cost of new analysis years
today
v
computing has progressed
berlin pydata | @gabegaster | 2015 february
1950
cost of new analysis years
today
hoursv
v
computing has progressed
berlin pydata | @gabegaster | 2015 february
1950
cost of new analysis years
today
same person thinking about the problem can conduct experiments to answer it
hoursv
v
computing has progressed
berlin pydata | @gabegaster | 2015 february
open-source code
standing on shoulders of giants
computing has progressed
berlin pydata | @gabegaster | 2015 february
open-source code
standing on shoulders of giants
computing has progressed
berlin pydata | @gabegaster | 2015 february
open-source code
standing on shoulders of giants
computing has progressed
berlin pydata | @gabegaster | 2015 february
open-source code
standing on shoulders of giants
reinventing the wheel
computing has progressed
berlin pydata | @gabegaster | 2015 february
open-source code
standing on shoulders of giants
reinventing the wheel
computing has progressed
berlin pydata | @gabegaster | 2015 february
what is data science?
using emerging technologies to approach problems scientifically
which were difficult to answer before
berlin pydata | @gabegaster | 2015 february
what is data science?
using emerging technologies to approach problems scientifically
knowing what is possible
which were difficult to answer before
berlin pydata | @gabegaster | 2015 february
what is data science?
using emerging technologies to approach problems scientifically
which were difficult to answer before
knowing what is possible
doing something useful
berlin pydata | @gabegaster | 2015 february
what is data science?
using emerging technologies to approach problems scientifically
which were difficult to answer before
knowing what is possible
doing something useful
HOW
berlin pydata | @gabegaster | 2015 february
what is data science?
using emerging technologies to approach problems scientifically
which were difficult to answer before
knowing what is possible
doing something useful
HOW WHY
berlin pydata | @gabegaster | 2015 february
what is data science?
using emerging technologies to approach problems scientifically
which were difficult to answer before
knowing what is possible
doing something useful
berlin pydata | @gabegaster | 2015 february
what is data science?
using emerging technologies to approach problems scientifically
which were difficult to answer before
knowing what is possible
doing something useful
usingnew good
the righttools
berlin pydata | @gabegaster | 2015 february
what is data science?
using emerging technologies to approach problems scientifically
which were difficult to answer before
knowing what is possible
doing something useful
usingnew good
the rightasking whytools
berlin pydata | @gabegaster | 2015 february
what is data science?
using emerging technologies to approach problems scientifically
which were difficult to answer before
knowing what is possible
doing something useful
usingnew good
the rightasking whytools
berlin pydata | @gabegaster | 2015 february
what is data science?
using emerging technologies to approach problems scientifically
which were difficult to answer before
knowing what is possible
doing something useful
usingnew good
the rightasking whytools
berlin pydata | @gabegaster | 2015 february
what is data science?
using emerging technologies to approach problems scientifically
which were difficult to answer before
knowing what is possible
doing something useful
usingnew good
the rightasking whytools WHY
berlin pydata | @gabegaster | 2015 february
what is data science?
using emerging technologies to approach problems scientifically
which were difficult to answer before
knowing what is possible
doing something useful
usingnew good
the rightasking whytools WHY
WHY
berlin pydata | @gabegaster | 2015 february
why why whywhat is data science?
science is about asking why
berlin pydata | @gabegaster | 2015 february
why why whywhat is data science?
science is about asking whystart there
berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february
goal: save moneytask: find needle in the haystack (without poking yourself)
berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february
goal: save moneytask: find needle in the haystack (without poking yourself)
berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february
goal: save moneytask: find needle in the haystack (without poking yourself)
berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february
abou
t pat
ent
not
abou
t pat
ent
goal: save moneytask: find needle in the haystack (without poking yourself)
berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february
abou
t pat
ent
not
abou
t pat
ent
turn over to plaintiffdon’t
turn over to plaintiff
adverse inference
goal: save moneytask: find needle in the haystack (without poking yourself)
berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february
abou
t pat
ent
not
abou
t pat
ent
turn over to plaintiffdon’t
turn over to plaintiff
adverse inference
give away trade secrets
goal: save moneytask: find needle in the haystack (without poking yourself)
berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february
abou
t pat
ent
not
abou
t pat
ent
turn over to plaintiffdon’t
turn over to plaintiff
adverse inference
give away trade secrets
goal: save moneytask: find needle in the haystack (without poking yourself)
berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february
turn over to plaintiffdon’t
turn over to plaintiff
goal: save moneytask: find needle in the haystack (without poking yourself)
berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february
goal: save moneyprototype
berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february
goal: save moneyprototype
berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february
goal: save moneyprototype — design for lawyers
berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february
goal: save moneyprototype — design for lawyers
berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february
Sexier. Less nerdy. Tailored. design for transparency
berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february
http://www.daegis.com/judicial-acceptance-of-tar/
berlin pydata | @gabegaster | 2015 february
why?classify schizophrenia w MRItask:
improve understanding of disease
berlin pydata | @gabegaster | 2015 february
why?classify schizophrenia w MRItask:
improve understanding of diseasehow?
berlin pydata | @gabegaster | 2015 february
why?classify schizophrenia w MRItask:
improve understanding of diseasehow? … outside contest purview
berlin pydata | @gabegaster | 2015 february
why? outside contest purview
kagglegetting data
& making usable
berlin pydata | @gabegaster | 2015 february
why? outside contest purview
kagglegetting data
& making usable
WHY
berlin pydata | @gabegaster | 2015 february
AUCwhat is AUC? Area Under Curve
what curve? Receiver Operating Characteristic
berlin pydata | @gabegaster | 2015 february
AUCwhat is AUC? Area Under Curve
what curve? Receiver Operating Characteristic
berlin pydata | @gabegaster | 2015 february
AUCwhat is AUC? Area Under Curve
what curve? Receiver Operating Characteristic
berlin pydata | @gabegaster | 2015 february
balances:
AUCwhat is AUC? Area Under Curve
what curve? Receiver Operating Characteristic
berlin pydata | @gabegaster | 2015 february
balances: True Positive RateFalse Positive Rate
AUCwhat is AUC? Area Under Curve
what curve? Receiver Operating Characteristic
berlin pydata | @gabegaster | 2015 february
balances: True Positive RateFalse Positive Rate
AUCwhat is AUC? Area Under Curve
what curve? Receiver Operating Characteristic
berlin pydata | @gabegaster | 2015 february
AUCwhat is AUC?
balances: True Positive RateFalse Positive Rate
Area Under Curve
what curve? Receiver Operating Characteristic
berlin pydata | @gabegaster | 2015 february
why?AUCwhat is AUC?
balances: True Positive RateFalse Positive Rate
Area Under Curve
what curve? Receiver Operating Characteristic
berlin pydata | @gabegaster | 2015 february
why?…
AUCwhat is AUC?
balances: True Positive RateFalse Positive Rate
Area Under Curve
what curve? Receiver Operating Characteristic
berlin pydata | @gabegaster | 2015 february
why?…
upshot:
AUCwhat is AUC?
balances: True Positive RateFalse Positive Rate
Area Under Curve
what curve? Receiver Operating Characteristic
berlin pydata | @gabegaster | 2015 february
why?…
choice of metric matters a LOT
upshot:
in practice
AUCwhat is AUC?
balances: True Positive RateFalse Positive Rate
Area Under Curve
what curve? Receiver Operating Characteristic
berlin pydata | @gabegaster | 2015 february
timeline of contest
Accuracy of Classification
AUC
random guess
berlin pydata | @gabegaster | 2015 february
timeline of contest
Accuracy of Classification
AUC
random guess
basic SVM
berlin pydata | @gabegaster | 2015 february
timeline of contest
goal?
Accuracy of Classification
AUC
random guess
basic SVM
berlin pydata | @gabegaster | 2015 february
timeline of contest
goal: depends on why
Accuracy of Classification
AUC
random guess
basic SVM
berlin pydata | @gabegaster | 2015 february
random guess
basic SVM
timeline of contest
Accuracy of Classification
AUC
berlin pydata | @gabegaster | 2015 february
me
timeline of contest
Accuracy of Classification
AUC turned out to place 9th — because overfitting
berlin pydata | @gabegaster | 2015 february
me
timeline of contest
Accuracy of Classification
AUC turned out to place 9th — because overfitting
very common problem
berlin pydata | @gabegaster | 2015 february
timeline of contest
Accuracy of Classification
worth it?
AUC
berlin pydata | @gabegaster | 2015 february
Show what I like about Bike share !
Chicago Bike Share System !!
kind of like call-a-bike
berlin pydata | @gabegaster | 2015 february
Show what I like about Bike share !
Think about how bike share has changed geography
Chicago Bike Share System !!
kind of like call-a-bike
berlin pydata | @gabegaster | 2015 february
Difficult
Public Transit on the grid
=+ Diagonals
2+ buses = FAIL
berlin pydata | @gabegaster | 2015 february
show how has divvy changed where people
can go
viz Goal:
berlin pydata | @gabegaster | 2015 february
show how has divvy changed where people
can goshow where people
actually go
viz Goal: