a tour of the data science process, a case study using movie industry data
TRANSCRIPT
A tour of the (a? many?) “data science process(es)”
Including a short case study using movie industry data.
Eduardo Ariño de la RubiaChief Data Scientist, Domino Data Lab
[email protected]@earino
Rough Timeline of “Data Science” in My Life
1. 1996 - First account on a supercomputer (MasPar MP-1)
2. 1997 - Fell in love with Genetic Algorithms for job shop scheduling (PVM/MPI)
3. 1999 - Hired my first ML engineer (“I think aNNs may be useful for predicting users buying patterns.”)
4. 2003 - Expert / Fuzzy systems for accounting continuing education compliance
5. 2005 - ML (mostly aNNs) and Six Sigma statistical approaches for manufacturing
6. 2007 - Computer vision approaches for pre-press support (first “big” data, PB)
7. 2010 - ML for manufacturing automation (vision/job shop)
8. 2012 - A/B testing for effectiveness of new designs
9. 2013 - NLP for “jurisdictionally aware” obscenity detection
10. 2014 - Classifiers for “at risk” students intervention in Ed. Tech.
11. 2015 - Vendor (!)
What Kind of Data Scientist Are You?
Me.Probably.
@drewconway
@willynguen
Me.Definitely.
Thanks to:
@BecomingDataSci@StephdeSilva@josecamoessilva
How Are You Doing Data Science?
CRISP-DM“de facto standard for developing data mining and knowledge discovery projects”
Between 2006 and 2008 a CRISP-DM 2.0 SIG was formed and there were discussions about updating the CRISP-DM process model. The current status of these efforts is not known. However, the original crisp-dm.org website cited in the reviews,and the CRISP-DM 2.0 SIG website are both no longer active.
© Szilard Pafka
© UC Berkeley, Understanding Science
© UC Berkeley, Understanding Science
© UC Berkeley, Understanding Science
© UC Berkeley, Understanding Science
© UC Berkeley, Understanding Science
© UC Berkeley, Understanding Science
A Case Study:So How Much Would That Movie Bank?
A Brief Disclaimer…I don’t, like… watch a lot movies
I have paid actual real money to see these movies in theaters. I should probably not be trusted.
The IMDB Movie Dataset is pretty massive...
And not exactly friendly...
Signs You’re Doing Data Science
1.The problem is poorly specified
2.The data is messy and unstructured
3.Unique entities aren’t
Make Your Data Tidy
1. Each variable is a column2. Each observation is a row3. Each type of observational
unit is a table.
Tidy Manifesto:
1. Share data structures2. Compose simple pieces3. Embrace functional
programming4. Write for humans
Signs You’re Doing Data Science
1. You want to make your data “tidy”
2. Your data surprises you with its wrongness…
Can Anyone Guess What is Wrong???
Wait… what?
Movies Seem to be making more and more of their money in worldwide receipts… Seems like the Domestic Gross is just about covering your costs...
Some Movies Make Amazing Multiples!
Interesting Factoids - Ratios1. The movie with the highest “ratio” that
won an Academy Award: Rocky
2. Braveheart was the movie in the 1990s to have the lowest “ratio” (made 2.9 of it’s budget back) to win an Academy Award for Best Picture
3. Sports Movies and Adult Movies make back the best ratio. Crime and Western the worst.
4. Movies that won an academy award on average make 132% more return for invested dollar at the box office than movies that don’t.
Prepare to be Disappointed...
> cor(movie_and_genre$`Production Budget`, movie_and_genre$`Worldwide Gross`)[1] 0.7359111
Analysis on the AnalysisTotal Hours Spent - 4
Total Number of Commands - 497
Total Number of Packages Used - 12
RCurl
lubridate
ggplot2
h2o
readr
dplyr
rvest
tidyr
caret
Shiny
stringr
purrr
Total Number of Plots Generated - 25
Total Number of MB Downloaded - 3.4G
Total Number of Models Trained - 8
REMEMBER THIS GUY?!
Thanks for Your Time!