a tour of the data science process, a case study using movie industry data

A tour of the (a? many?) “data science process(es)”

Including a short case study using movie industry data.

Eduardo Ariño de la RubiaChief Data Scientist, Domino Data Lab

[email protected]@earino

mailto:[email protected]

Rough Timeline of “Data Science” in My Life

1. 1996 - First account on a supercomputer (MasPar MP-1)

2. 1997 - Fell in love with Genetic Algorithms for job shop scheduling (PVM/MPI)

3. 1999 - Hired my first ML engineer (“I think aNNs may be useful for predicting users buying patterns.”)

4. 2003 - Expert / Fuzzy systems for accounting continuing education compliance

5. 2005 - ML (mostly aNNs) and Six Sigma statistical approaches for manufacturing

6. 2007 - Computer vision approaches for pre-press support (first “big” data, PB)

7. 2010 - ML for manufacturing automation (vision/job shop)

8. 2012 - A/B testing for effectiveness of new designs

9. 2013 - NLP for “jurisdictionally aware” obscenity detection

10. 2014 - Classifiers for “at risk” students intervention in Ed. Tech.

11. 2015 - Vendor (!)

What Kind of Data Scientist Are You?

Me.Probably.

@drewconway

@willynguen

Me.Definitely.

Thanks to:

@BecomingDataSci@StephdeSilva@josecamoessilva

How Are You Doing Data Science?

CRISP-DM“de facto standard for developing data mining and knowledge discovery projects”

Between 2006 and 2008 a CRISP-DM 2.0 SIG was formed and there were discussions about updating the CRISP-DM process model. The current status of these efforts is not known. However, the original crisp-dm.org website cited in the reviews,and the CRISP-DM 2.0 SIG website are both no longer active.

A Case Study:So How Much Would That Movie Bank?

A Brief Disclaimer…I don’t, like… watch a lot movies

I have paid actual real money to see these movies in theaters. I should probably not be trusted.

The IMDB Movie Dataset is pretty massive...

And not exactly friendly...

Signs You’re Doing Data Science

1.The problem is poorly specified

2.The data is messy and unstructured

3.Unique entities aren’t

Make Your Data Tidy

1. Each variable is a column2. Each observation is a row3. Each type of observational

unit is a table.

Tidy Manifesto:

1. Share data structures2. Compose simple pieces3. Embrace functional

programming4. Write for humans

Signs You’re Doing Data Science

1. You want to make your data “tidy”

2. Your data surprises you with its wrongness…

Can Anyone Guess What is Wrong???

Wait… what?

Movies Seem to be making more and more of their money in worldwide receipts… Seems like the Domestic Gross is just about covering your costs...

Some Movies Make Amazing Multiples!

Interesting Factoids - Ratios1. The movie with the highest “ratio” that

won an Academy Award: Rocky

2. Braveheart was the movie in the 1990s to have the lowest “ratio” (made 2.9 of it’s budget back) to win an Academy Award for Best Picture

3. Sports Movies and Adult Movies make back the best ratio. Crime and Western the worst.

4. Movies that won an academy award on average make 132% more return for invested dollar at the box office than movies that don’t.

Prepare to be Disappointed...

> cor(movie_and_genre$`Production Budget`, movie_and_genre$`Worldwide Gross`)[1] 0.7359111

Analysis on the AnalysisTotal Hours Spent - 4

Total Number of Commands - 497

Total Number of Packages Used - 12

RCurl

lubridate

ggplot2

h2o

readr

dplyr

rvest

tidyr

caret

Shiny

stringr

purrr

Total Number of Plots Generated - 25

Total Number of MB Downloaded - 3.4G

Total Number of Models Trained - 8

REMEMBER THIS GUY?!

Thanks for Your Time!

a tour of the data science process, a case study using movie industry data

Technology