new york metro data analysis

Post on 11-Dec-2015

69 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Kevin HungStatistical Hypothesis Test and Machine Learning with Linear Regression to analyze and predict New York City Metro RidershipUdacity Intro to Data Science ud359 Final Project

TRANSCRIPT

Analyzing New York Metro

Image credits

"NYC subway-4D" by CountZ at English Wikipedia. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:NYC_subway-

4D.svg#/media/File:NYC_subway-4D.svg

https://c1.staticflickr.com/3/2188/2363525822_9285c44cd7_b.jpg

*proud member of

Udacity Intro to Data ScienceFinal Project

kevhung11@gmail.com

© Kevin Hung 2015

by Kevin Hung

@.

Question of Interest|How to Model Hourly Ridership Entries?

Image credits

http://i.dailymail.co.uk/i/pix/2012/11/05/article-2227401-15DC2645000005DC-20_634x407.jpg

sec3: Clue #1| Hourly Schedule

• Do people follow a predictable timetable or itinerary in ridership?

• Peak hours seem intuitive for plausible reasons

sec3: Clue #2 | Work Week

• Whopping 10 Million Difference in our subset!

• Found a Great Feature

sec3: Clue #3 | Is it Raining?

• Do people tend to ride less on rainy days, or is it because there are more non-rainy days than rainy ones in our particular

dataset?

• Next: Let’s test this!

sec1 | Statistical TestQ: Why use statistical significance test?

A1: Draw valid inferences!

A2: Formal framework to compare & evaluate data

A3: Tell us if perceived effects are reflective as a whole

Single-Sided Mann-Whitney U-Test[1]: Are 2 populations the same?

H0: “The distributions of rainy and non-rainy ridership populations are equal!”

HA: “No! Ridership of one population tends to be bigger than the other”

sec1 | Mann-Whitney U-TestResult

Reject the Null Hypothesis!

H0: “The distributions of rainy and non-rainy ridership populations are equal!”

HA: “No! Ridership of one population tends to be bigger than the other”

Rain may be a good feature…

sec2 | Building Our Model

We’ll use the Normal Equation to Find our Solution!

Easy as

123

Design (Data → Features → Matrix)

Target (Ridership Entries as Integer Vector)

Parameters (Solution Vector thatMinimizes Squared Error)

sec2 | Linear RegressionOur Model

Coefficient of Determination

Interpretation

• ~ 53% of the variation in ridership entries is

explained by our model

sec2 | Model Appropriateness

• Residual Plots Show that our model often under

predicts ridership for entries 2000+

• Using Hour, Weekday, UNIT, Rain may not be adequate!

• Suggestions: High Bias Model → Find more Features

sec4 | Conclusion

• Mann-Whitney U-Test & Paired Histogram Show Possibility of People

Tending to Ride the Metro More on Non-Rainy days

• Rainy Feature Contributes to 1% Increase in R2

• Need More Features: Incorporate Weather Factors? Foggy? Thunder?

Temperature Values?Image creditshttp://pix.avaxnews.com/avaxnews/6a/ab/0001ab6a_medium.jpeg

• [1] "Mann–Whitney U Test." Wikipedia . Wikimedia Foundation, n.d. Web.• [2] "CS220 Lecture notes." Andrew Ng .

http://cs229.stanford.edu/notes/cs229notes1.pdf• [3] Frost, Jim. "Regression Analysis: How Do I Interpret Rsquared and

Assess the GoodnessofFit?“ Minitab, 30 May 2013. Web. 15 Sept. 2015.• [4] NIST/SEMATECH eHandbook of Statistical Methods,

http://www.itl.nist.gov/div898/handbook/pri/section2/pri24.htm• [5] "GraphPad Curve Fitting Guide." GraphPad Curve Fitting Guide .

GraphPad, n.d. Web. 16 Sept. 2015. <http://www.graphpad.com/guides/prism/6/curvefitting/index.htm?reg_analysischeck_linearreg.htm>

* data sources: <http://web.mta.info/developers/developer-data-terms.html#data>

sec0 | References

top related