Download - New York Metro Data Analysis
![Page 1: New York Metro Data Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062421/563db9fe550346aa9aa1c516/html5/thumbnails/1.jpg)
Analyzing New York Metro
Image credits
"NYC subway-4D" by CountZ at English Wikipedia. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:NYC_subway-
4D.svg#/media/File:NYC_subway-4D.svg
https://c1.staticflickr.com/3/2188/2363525822_9285c44cd7_b.jpg
*proud member of
Udacity Intro to Data ScienceFinal Project
© Kevin Hung 2015
by Kevin Hung
@.
![Page 2: New York Metro Data Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062421/563db9fe550346aa9aa1c516/html5/thumbnails/2.jpg)
Question of Interest|How to Model Hourly Ridership Entries?
Image credits
http://i.dailymail.co.uk/i/pix/2012/11/05/article-2227401-15DC2645000005DC-20_634x407.jpg
![Page 3: New York Metro Data Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062421/563db9fe550346aa9aa1c516/html5/thumbnails/3.jpg)
sec3: Clue #1| Hourly Schedule
• Do people follow a predictable timetable or itinerary in ridership?
• Peak hours seem intuitive for plausible reasons
![Page 4: New York Metro Data Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062421/563db9fe550346aa9aa1c516/html5/thumbnails/4.jpg)
sec3: Clue #2 | Work Week
• Whopping 10 Million Difference in our subset!
• Found a Great Feature
![Page 5: New York Metro Data Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062421/563db9fe550346aa9aa1c516/html5/thumbnails/5.jpg)
sec3: Clue #3 | Is it Raining?
• Do people tend to ride less on rainy days, or is it because there are more non-rainy days than rainy ones in our particular
dataset?
• Next: Let’s test this!
![Page 6: New York Metro Data Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062421/563db9fe550346aa9aa1c516/html5/thumbnails/6.jpg)
sec1 | Statistical TestQ: Why use statistical significance test?
A1: Draw valid inferences!
A2: Formal framework to compare & evaluate data
A3: Tell us if perceived effects are reflective as a whole
Single-Sided Mann-Whitney U-Test[1]: Are 2 populations the same?
H0: “The distributions of rainy and non-rainy ridership populations are equal!”
HA: “No! Ridership of one population tends to be bigger than the other”
![Page 7: New York Metro Data Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062421/563db9fe550346aa9aa1c516/html5/thumbnails/7.jpg)
sec1 | Mann-Whitney U-TestResult
⇒
Reject the Null Hypothesis!
H0: “The distributions of rainy and non-rainy ridership populations are equal!”
HA: “No! Ridership of one population tends to be bigger than the other”
Rain may be a good feature…
![Page 8: New York Metro Data Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062421/563db9fe550346aa9aa1c516/html5/thumbnails/8.jpg)
sec2 | Building Our Model
We’ll use the Normal Equation to Find our Solution!
Easy as
123
Design (Data → Features → Matrix)
Target (Ridership Entries as Integer Vector)
Parameters (Solution Vector thatMinimizes Squared Error)
![Page 9: New York Metro Data Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062421/563db9fe550346aa9aa1c516/html5/thumbnails/9.jpg)
sec2 | Linear RegressionOur Model
Coefficient of Determination
Interpretation
• ~ 53% of the variation in ridership entries is
explained by our model
![Page 10: New York Metro Data Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062421/563db9fe550346aa9aa1c516/html5/thumbnails/10.jpg)
sec2 | Model Appropriateness
• Residual Plots Show that our model often under
predicts ridership for entries 2000+
• Using Hour, Weekday, UNIT, Rain may not be adequate!
• Suggestions: High Bias Model → Find more Features
![Page 11: New York Metro Data Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062421/563db9fe550346aa9aa1c516/html5/thumbnails/11.jpg)
sec4 | Conclusion
• Mann-Whitney U-Test & Paired Histogram Show Possibility of People
Tending to Ride the Metro More on Non-Rainy days
• Rainy Feature Contributes to 1% Increase in R2
• Need More Features: Incorporate Weather Factors? Foggy? Thunder?
Temperature Values?Image creditshttp://pix.avaxnews.com/avaxnews/6a/ab/0001ab6a_medium.jpeg
![Page 12: New York Metro Data Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062421/563db9fe550346aa9aa1c516/html5/thumbnails/12.jpg)
• [1] "Mann–Whitney U Test." Wikipedia . Wikimedia Foundation, n.d. Web.• [2] "CS220 Lecture notes." Andrew Ng .
http://cs229.stanford.edu/notes/cs229notes1.pdf• [3] Frost, Jim. "Regression Analysis: How Do I Interpret Rsquared and
Assess the GoodnessofFit?“ Minitab, 30 May 2013. Web. 15 Sept. 2015.• [4] NIST/SEMATECH eHandbook of Statistical Methods,
http://www.itl.nist.gov/div898/handbook/pri/section2/pri24.htm• [5] "GraphPad Curve Fitting Guide." GraphPad Curve Fitting Guide .
GraphPad, n.d. Web. 16 Sept. 2015. <http://www.graphpad.com/guides/prism/6/curvefitting/index.htm?reg_analysischeck_linearreg.htm>
* data sources: <http://web.mta.info/developers/developer-data-terms.html#data>
sec0 | References