data acquisition project

20
Understanding Movie Success Piyush Bhargava Jason Helgren Abhishek Singh

Upload: abhishek-singh

Post on 16-Apr-2017

156 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Data Acquisition Project

UnderstandingMovie Success

Piyush BhargavaJason HelgrenAbhishek Singh

Page 2: Data Acquisition Project

The QuestionsHow do we calibrate a movie’s success? What are the indicators?

How do other variables correlate with these indicators?

Other interesting insights?

Page 3: Data Acquisition Project

MethodologyScrape 13,000 movie URLs from Box Office Mojo using Python

Randomly select 5% and download JSON blobs from OMDB

Clean data and check for consistency. Export to PostgreSQL

Page 4: Data Acquisition Project

MethodologyScrape 13,000 movie URLs from Box Office Mojo using Python

Randomly select 5% and download JSON blobs from OMDB

Clean data and check for consistency. Export to PostgreSQL

Problems along the way:● Couldn’t get an API key● Lots of missing values● Not all URLs had JSON files

Page 5: Data Acquisition Project

MethodologyScrape 13,000 movie URLs from Box Office Mojo using Python

Randomly select 5% and download JSON blobs from OMDB

Clean data and check for consistency. Export to PostgreSQL

Problems along the way:● Couldn’t get an API key● Lots of missing values● Not all URLs had JSON files

Final Result: database of about 500 movies

Page 6: Data Acquisition Project

ER Diagram

Page 7: Data Acquisition Project

ER Diagram

Page 8: Data Acquisition Project

Sample Query: User vs Critic RatingSELECT AVG(metascore) AS critic_avg,

STDDEV_POP(metascore) AS critic_std_dev,

AVG(imdb_rating) AS imdb_avg,

STDDEV_POP(imdb_rating) AS imdb_std_dev,

CORR(metascore, imdb_rating) AS correlation

FROM movies;

Page 9: Data Acquisition Project

User vs Critic Rating

Page 10: Data Acquisition Project

Sample query: Summer blockbuster effect?SELECT sum(1) AS count, month, genre FROM

(SELECT lhs.movie, lhs.month, rhs.genre FROM

(SELECT movie, month FROM movies) AS lhs

LEFT JOIN

(SELECT movie, genre from genre) AS rhs

USING(movie)) AS iq WHERE genre

IN ('Action', 'Comedy', 'Drama', 'Romance', 'Adventure', 'Crime')

GROUP BY month, genre ORDER BY month;

Page 11: Data Acquisition Project

Sample Query: Summer Blockbuster Effect?SELECT sum(1) AS count, month, genre FROM

(SELECT lhs.movie, lhs.month, rhs.genre FROM

(SELECT movie, month FROM movies) AS lhs

LEFT JOIN

(SELECT movie, genre FROM genres) AS rhs

USING(movie)) AS iq WHERE genre

IN ('Action', 'Comedy', 'Drama', 'Romance', 'Adventure', 'Crime')

GROUP BY month, genre ORDER BY month;

1

Page 12: Data Acquisition Project

Sample query: Summer blockbuster effect?SELECT sum(1) AS count, month, genre FROM

(SELECT lhs.movie, lhs.month, rhs.genre FROM

(SELECT movie, month FROM movies) AS lhs

LEFT JOIN

(SELECT movie, genre FROM genres) AS rhs

USING(movie)) AS iq WHERE genre

IN ('Action', 'Comedy', 'Drama', 'Romance', 'Adventure', 'Crime')

GROUP BY month, genre ORDER BY month;

1

2

Page 13: Data Acquisition Project

Sample query: Summer blockbuster effect?SELECT sum(1) AS count, month, genre FROM

(SELECT lhs.movie, lhs.month, rhs.genre FROM

(SELECT movie, month FROM movies) AS lhs

LEFT JOIN

(SELECT movie, genre FROM genres) AS rhs

USING(movie)) AS iq WHERE genre

IN ('Action', 'Comedy', 'Drama', 'Romance', 'Adventure', 'Crime')

GROUP BY month, genre ORDER BY month;

1

2

3

Page 14: Data Acquisition Project

Summer Blockbuster Effect

Page 15: Data Acquisition Project

Summer Blockbuster Effect

Page 16: Data Acquisition Project

Most Common Genres and MPAA Rating

Page 17: Data Acquisition Project

Awards/Nominations Versus Votes

*Major Win - Oscar/BAFTA/Golden Globe Wins

ρ = 0.65 ρ = 0.53 Nominations Wins Major Wins

Page 18: Data Acquisition Project

Awards/Nominations vs other variables

*Major Win/Nomination - Oscar/BAFTA/Golden Globe Wins/Nominations

Page 19: Data Acquisition Project

Revenue Vs IMDB Rating● Higher rated movies earn more worldwide

than in US. ● Worldwide movies are potentially influenced

by IMDB rating/number of votes.

Page 20: Data Acquisition Project

Takeaways and Next Steps

● Scrape more data to get a normal distribution!

● Build a model to predict movie success.