new york metro data analysis
DESCRIPTION
Kevin HungStatistical Hypothesis Test and Machine Learning with Linear Regression to analyze and predict New York City Metro RidershipUdacity Intro to Data Science ud359 Final ProjectTRANSCRIPT
Analyzing New York Metro
Image credits
"NYC subway-4D" by CountZ at English Wikipedia. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:NYC_subway-
4D.svg#/media/File:NYC_subway-4D.svg
https://c1.staticflickr.com/3/2188/2363525822_9285c44cd7_b.jpg
*proud member of
Udacity Intro to Data ScienceFinal Project
© Kevin Hung 2015
by Kevin Hung
@.
Question of Interest|How to Model Hourly Ridership Entries?
Image credits
http://i.dailymail.co.uk/i/pix/2012/11/05/article-2227401-15DC2645000005DC-20_634x407.jpg
sec3: Clue #1| Hourly Schedule
• Do people follow a predictable timetable or itinerary in ridership?
• Peak hours seem intuitive for plausible reasons
sec3: Clue #2 | Work Week
• Whopping 10 Million Difference in our subset!
• Found a Great Feature
sec3: Clue #3 | Is it Raining?
• Do people tend to ride less on rainy days, or is it because there are more non-rainy days than rainy ones in our particular
dataset?
• Next: Let’s test this!
sec1 | Statistical TestQ: Why use statistical significance test?
A1: Draw valid inferences!
A2: Formal framework to compare & evaluate data
A3: Tell us if perceived effects are reflective as a whole
Single-Sided Mann-Whitney U-Test[1]: Are 2 populations the same?
H0: “The distributions of rainy and non-rainy ridership populations are equal!”
HA: “No! Ridership of one population tends to be bigger than the other”
sec1 | Mann-Whitney U-TestResult
⇒
Reject the Null Hypothesis!
H0: “The distributions of rainy and non-rainy ridership populations are equal!”
HA: “No! Ridership of one population tends to be bigger than the other”
Rain may be a good feature…
sec2 | Building Our Model
We’ll use the Normal Equation to Find our Solution!
Easy as
123
Design (Data → Features → Matrix)
Target (Ridership Entries as Integer Vector)
Parameters (Solution Vector thatMinimizes Squared Error)
sec2 | Linear RegressionOur Model
Coefficient of Determination
Interpretation
• ~ 53% of the variation in ridership entries is
explained by our model
sec2 | Model Appropriateness
• Residual Plots Show that our model often under
predicts ridership for entries 2000+
• Using Hour, Weekday, UNIT, Rain may not be adequate!
• Suggestions: High Bias Model → Find more Features
sec4 | Conclusion
• Mann-Whitney U-Test & Paired Histogram Show Possibility of People
Tending to Ride the Metro More on Non-Rainy days
• Rainy Feature Contributes to 1% Increase in R2
• Need More Features: Incorporate Weather Factors? Foggy? Thunder?
Temperature Values?Image creditshttp://pix.avaxnews.com/avaxnews/6a/ab/0001ab6a_medium.jpeg
• [1] "Mann–Whitney U Test." Wikipedia . Wikimedia Foundation, n.d. Web.• [2] "CS220 Lecture notes." Andrew Ng .
http://cs229.stanford.edu/notes/cs229notes1.pdf• [3] Frost, Jim. "Regression Analysis: How Do I Interpret Rsquared and
Assess the GoodnessofFit?“ Minitab, 30 May 2013. Web. 15 Sept. 2015.• [4] NIST/SEMATECH eHandbook of Statistical Methods,
http://www.itl.nist.gov/div898/handbook/pri/section2/pri24.htm• [5] "GraphPad Curve Fitting Guide." GraphPad Curve Fitting Guide .
GraphPad, n.d. Web. 16 Sept. 2015. <http://www.graphpad.com/guides/prism/6/curvefitting/index.htm?reg_analysischeck_linearreg.htm>
* data sources: <http://web.mta.info/developers/developer-data-terms.html#data>
sec0 | References