why why why

145
berlin pydata | @gabegaster | 2015 february what is data science?

Upload: gabriel-gaster

Post on 14-Jul-2015

245 views

Category:

Data & Analytics


0 download

TRANSCRIPT

berlin pydata | @gabegaster | 2015 february

what is data science?

berlin pydata | @gabegaster | 2015 february

what is data science?

berlin pydata | @gabegaster | 2015 february

what is data science?

who is a data scientist?

berlin pydata | @gabegaster | 2015 february

what is data science?

who is a data scientist?

review of literature

berlin pydata | @gabegaster | 2015 february

what is data science?

who is a data scientist?

review of literature

berlin pydata | @gabegaster | 2015 february

what is data science?

review of literature

berlin pydata | @gabegaster | 2015 february

what is data science?

review of literature

berlin pydata | @gabegaster | 2015 february

what is data science?

who is a data scientist?

berlin pydata | @gabegaster | 2015 february

what is data science?

who is a data scientist?“a scientist who can code”

berlin pydata | @gabegaster | 2015 february

what is data science?

who is a data scientist?“a scientist who can code”

• lower barrier to attack new problems

berlin pydata | @gabegaster | 2015 february

what is data science?

who is a data scientist?“a scientist who can code”

• lower barrier to attack new problems • repeatable analysis

berlin pydata | @gabegaster | 2015 february

what is data science?

who is a data scientist?“a scientist who can code”

• lower barrier to attack new problems • repeatable analysis • freedom to think about problems new ways

berlin pydata | @gabegaster | 2015 february

what is data science?

berlin pydata | @gabegaster | 2015 february

what is data science?

using emerging technologies to approach problems scientifically

berlin pydata | @gabegaster | 2015 february

what is data science?

using emerging technologies to approach problems scientifically

which were difficult to answer before

berlin pydata | @gabegaster | 2015 february

which were difficult to answer before

berlin pydata | @gabegaster | 2015 february

computing has progressed

which were difficult to answer before

berlin pydata | @gabegaster | 2015 february

1950computing has progressed

berlin pydata | @gabegaster | 2015 february

1950

cost of new analysis

computing has progressed

berlin pydata | @gabegaster | 2015 february

1950

cost of new analysis years

computing has progressed

berlin pydata | @gabegaster | 2015 february

1950

cost of new analysis years

todaycomputing has progressed

berlin pydata | @gabegaster | 2015 february

1950

cost of new analysis years

today

v

computing has progressed

berlin pydata | @gabegaster | 2015 february

1950

cost of new analysis years

today

hoursv

v

computing has progressed

berlin pydata | @gabegaster | 2015 february

1950

cost of new analysis years

today

same person thinking about the problem can conduct experiments to answer it

hoursv

v

computing has progressed

berlin pydata | @gabegaster | 2015 february

computing has progressed

berlin pydata | @gabegaster | 2015 february

open-source code

computing has progressed

berlin pydata | @gabegaster | 2015 february

open-source code

standing on shoulders of giants

computing has progressed

berlin pydata | @gabegaster | 2015 february

open-source code

standing on shoulders of giants

computing has progressed

berlin pydata | @gabegaster | 2015 february

open-source code

standing on shoulders of giants

computing has progressed

berlin pydata | @gabegaster | 2015 february

open-source code

standing on shoulders of giants

reinventing the wheel

computing has progressed

berlin pydata | @gabegaster | 2015 february

open-source code

standing on shoulders of giants

reinventing the wheel

computing has progressed

berlin pydata | @gabegaster | 2015 february

what is data science?

using emerging technologies to approach problems scientifically

which were difficult to answer before

berlin pydata | @gabegaster | 2015 february

what is data science?

using emerging technologies to approach problems scientifically

knowing what is possible

which were difficult to answer before

berlin pydata | @gabegaster | 2015 february

what is data science?

using emerging technologies to approach problems scientifically

which were difficult to answer before

knowing what is possible

doing something useful

berlin pydata | @gabegaster | 2015 february

what is data science?

using emerging technologies to approach problems scientifically

which were difficult to answer before

knowing what is possible

doing something useful

HOW

berlin pydata | @gabegaster | 2015 february

what is data science?

using emerging technologies to approach problems scientifically

which were difficult to answer before

knowing what is possible

doing something useful

HOW WHY

berlin pydata | @gabegaster | 2015 february

what is data science?

using emerging technologies to approach problems scientifically

which were difficult to answer before

knowing what is possible

doing something useful

berlin pydata | @gabegaster | 2015 february

what is data science?

using emerging technologies to approach problems scientifically

which were difficult to answer before

knowing what is possible

doing something useful

usingnew good

the righttools

berlin pydata | @gabegaster | 2015 february

what is data science?

using emerging technologies to approach problems scientifically

which were difficult to answer before

knowing what is possible

doing something useful

usingnew good

the rightasking whytools

berlin pydata | @gabegaster | 2015 february

what is data science?

using emerging technologies to approach problems scientifically

which were difficult to answer before

knowing what is possible

doing something useful

usingnew good

the rightasking whytools

berlin pydata | @gabegaster | 2015 february

what is data science?

using emerging technologies to approach problems scientifically

which were difficult to answer before

knowing what is possible

doing something useful

usingnew good

the rightasking whytools

berlin pydata | @gabegaster | 2015 february

what is data science?

using emerging technologies to approach problems scientifically

which were difficult to answer before

knowing what is possible

doing something useful

usingnew good

the rightasking whytools WHY

berlin pydata | @gabegaster | 2015 february

what is data science?

using emerging technologies to approach problems scientifically

which were difficult to answer before

knowing what is possible

doing something useful

usingnew good

the rightasking whytools WHY

WHY

berlin pydata | @gabegaster | 2015 february

why why whywhat is data science?

berlin pydata | @gabegaster | 2015 february

why why whywhat is data science?

science is about asking why

berlin pydata | @gabegaster | 2015 february

why why whywhat is data science?

science is about asking whystart there

berlin pydata | @gabegaster | 2015 february

an anecdote

berlin pydata | @gabegaster | 2015 february

berlin pydata | @gabegaster | 2015 february

berlin pydata | @gabegaster | 2015 february

berlin pydata | @gabegaster | 2015 february

an examplefrom the real world

berlin pydata | @gabegaster | 2015 february

• e

an example

berlin pydata | @gabegaster | 2015 february

berlin pydata | @gabegaster | 2015 february

berlin pydata | @gabegaster | 2015 february

berlin pydata | @gabegaster | 2015 february

goal: save money

berlin pydata | @gabegaster | 2015 february

goal: save money

berlin pydata | @gabegaster | 2015 february

goal: save money

berlin pydata | @gabegaster | 2015 february

goal: save money

berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february

goal: save moneytask: find needle in the haystack (without poking yourself)

berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february

goal: save moneytask: find needle in the haystack (without poking yourself)

berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february

goal: save moneytask: find needle in the haystack (without poking yourself)

berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february

abou

t pat

ent

not

abou

t pat

ent

goal: save moneytask: find needle in the haystack (without poking yourself)

berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february

abou

t pat

ent

not

abou

t pat

ent

turn over to plaintiffdon’t

turn over to plaintiff

adverse inference

goal: save moneytask: find needle in the haystack (without poking yourself)

berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february

abou

t pat

ent

not

abou

t pat

ent

turn over to plaintiffdon’t

turn over to plaintiff

adverse inference

give away trade secrets

goal: save moneytask: find needle in the haystack (without poking yourself)

berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february

abou

t pat

ent

not

abou

t pat

ent

turn over to plaintiffdon’t

turn over to plaintiff

adverse inference

give away trade secrets

goal: save moneytask: find needle in the haystack (without poking yourself)

berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february

turn over to plaintiffdon’t

turn over to plaintiff

goal: save moneytask: find needle in the haystack (without poking yourself)

berlin pydata | @gabegaster | 2015 february

berlin pydata | @gabegaster | 2015 february

berlin pydata | @gabegaster | 2015 february

berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february

goal: save moneyprototype

berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february

goal: save moneyprototype

berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february

goal: save moneyprototype — design for lawyers

berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february

goal: save moneyprototype — design for lawyers

berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february

Sexier. Less nerdy. Tailored. design for transparency

berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february

http://www.daegis.com/judicial-acceptance-of-tar/

berlin pydata | @gabegaster | 2015 february

another examplecontests

berlin pydata | @gabegaster | 2015 february

another example

berlin pydata | @gabegaster | 2015 february

berlin pydata | @gabegaster | 2015 february

task:

berlin pydata | @gabegaster | 2015 february

classify schizophrenia w MRItask:

berlin pydata | @gabegaster | 2015 february

why?classify schizophrenia w MRItask:

berlin pydata | @gabegaster | 2015 february

why?classify schizophrenia w MRItask:

improve understanding of disease

berlin pydata | @gabegaster | 2015 february

why?classify schizophrenia w MRItask:

improve understanding of diseasehow?

berlin pydata | @gabegaster | 2015 february

why?classify schizophrenia w MRItask:

improve understanding of diseasehow? … outside contest purview

berlin pydata | @gabegaster | 2015 february

why? outside contest purview

berlin pydata | @gabegaster | 2015 february

why? outside contest purview

berlin pydata | @gabegaster | 2015 february

why? outside contest purview

kaggle

berlin pydata | @gabegaster | 2015 february

why? outside contest purview

kagglegetting data

& making usable

berlin pydata | @gabegaster | 2015 february

why? outside contest purview

kagglegetting data

& making usable

WHY

berlin pydata | @gabegaster | 2015 february

timeline of contest

Accuracy of Classification

berlin pydata | @gabegaster | 2015 february

timeline of contest

AUC

Accuracy of Classification

berlin pydata | @gabegaster | 2015 february

what is AUC?AU

C

berlin pydata | @gabegaster | 2015 february

AUCwhat is AUC? Area Under Curve

berlin pydata | @gabegaster | 2015 february

AUCwhat is AUC? Area Under Curve

what curve?

berlin pydata | @gabegaster | 2015 february

AUCwhat is AUC? Area Under Curve

what curve? Receiver Operating Characteristic

berlin pydata | @gabegaster | 2015 february

AUCwhat is AUC? Area Under Curve

what curve? Receiver Operating Characteristic

berlin pydata | @gabegaster | 2015 february

AUCwhat is AUC? Area Under Curve

what curve? Receiver Operating Characteristic

berlin pydata | @gabegaster | 2015 february

balances:

AUCwhat is AUC? Area Under Curve

what curve? Receiver Operating Characteristic

berlin pydata | @gabegaster | 2015 february

balances: True Positive RateFalse Positive Rate

AUCwhat is AUC? Area Under Curve

what curve? Receiver Operating Characteristic

berlin pydata | @gabegaster | 2015 february

balances: True Positive RateFalse Positive Rate

AUCwhat is AUC? Area Under Curve

what curve? Receiver Operating Characteristic

berlin pydata | @gabegaster | 2015 february

AUCwhat is AUC?

balances: True Positive RateFalse Positive Rate

Area Under Curve

what curve? Receiver Operating Characteristic

berlin pydata | @gabegaster | 2015 february

why?AUCwhat is AUC?

balances: True Positive RateFalse Positive Rate

Area Under Curve

what curve? Receiver Operating Characteristic

berlin pydata | @gabegaster | 2015 february

why?…

AUCwhat is AUC?

balances: True Positive RateFalse Positive Rate

Area Under Curve

what curve? Receiver Operating Characteristic

berlin pydata | @gabegaster | 2015 february

why?…

upshot:

AUCwhat is AUC?

balances: True Positive RateFalse Positive Rate

Area Under Curve

what curve? Receiver Operating Characteristic

berlin pydata | @gabegaster | 2015 february

why?…

choice of metric matters a LOT

upshot:

in practice

AUCwhat is AUC?

balances: True Positive RateFalse Positive Rate

Area Under Curve

what curve? Receiver Operating Characteristic

berlin pydata | @gabegaster | 2015 february

timeline of contest

Accuracy of Classification

AUC

berlin pydata | @gabegaster | 2015 february

timeline of contest

Accuracy of Classification

AUC

random guess

berlin pydata | @gabegaster | 2015 february

timeline of contest

Accuracy of Classification

AUC

random guess

basic SVM

berlin pydata | @gabegaster | 2015 february

timeline of contest

goal?

Accuracy of Classification

AUC

random guess

basic SVM

berlin pydata | @gabegaster | 2015 february

timeline of contest

goal: depends on why

Accuracy of Classification

AUC

random guess

basic SVM

berlin pydata | @gabegaster | 2015 february

random guess

basic SVM

timeline of contest

Accuracy of Classification

AUC

berlin pydata | @gabegaster | 2015 february

me

timeline of contest

Accuracy of Classification

AUC

berlin pydata | @gabegaster | 2015 february

me

timeline of contest

Accuracy of Classification

AUC turned out to place 9th — because overfitting

berlin pydata | @gabegaster | 2015 february

me

timeline of contest

Accuracy of Classification

AUC turned out to place 9th — because overfitting

very common problem

berlin pydata | @gabegaster | 2015 february

timeline of contest

Accuracy of Classification

worth it?

AUC

berlin pydata | @gabegaster | 2015 february

berlin pydata | @gabegaster | 2015 february

berlin pydata | @gabegaster | 2015 february

berlin pydata | @gabegaster | 2015 february

berlin pydata | @gabegaster | 2015 february

berlin pydata | @gabegaster | 2015 february

berlin pydata | @gabegaster | 2015 february

berlin pydata | @gabegaster | 2015 february

berlin pydata | @gabegaster | 2015 february

berlin pydata | @gabegaster | 2015 february

an example

just for fun

berlin pydata | @gabegaster | 2015 february

Chicago Bike Share System !!

kind of like call-a-bike

berlin pydata | @gabegaster | 2015 february

Show what I like about Bike share !

Chicago Bike Share System !!

kind of like call-a-bike

berlin pydata | @gabegaster | 2015 february

Show what I like about Bike share !

Think about how bike share has changed geography

Chicago Bike Share System !!

kind of like call-a-bike

berlin pydata | @gabegaster | 2015 february

a typical trip for me

berlin pydata | @gabegaster | 2015 february

Bus transit times

= a LIE

berlin pydata | @gabegaster | 2015 february

Chicago is a grid city

berlin pydata | @gabegaster | 2015 february

Difficult

Public Transit on the grid

=+ Diagonals

berlin pydata | @gabegaster | 2015 february

Difficult

Public Transit on the grid

=+ Diagonals

2+ buses = FAIL

berlin pydata | @gabegaster | 2015 february

Adding bikes to public transit

= win

berlin pydata | @gabegaster | 2015 february

show how has divvy changed where people

can go

viz Goal:

berlin pydata | @gabegaster | 2015 february

show how has divvy changed where people

can goshow where people

actually go

viz Goal:

berlin pydata | @gabegaster | 2015 february

demo

berlin pydata | @gabegaster | 2015 february

in conclusion

berlin pydata | @gabegaster | 2015 february

berlin pydata | @gabegaster | 2015 february

berlin pydata | @gabegaster | 2015 february

berlin pydata | @gabegaster | 2015 february

berlin pydata | @gabegaster | 2015 february

berlin pydata | @gabegaster | 2015 february

thanks!@gabegaster