mapmycab presentation
TRANSCRIPT
Motivation• Tool for Data Scientists and Cab dispatchers to analyze (by
time of day or day of week):!
• cab occupancy!
• miles travelled!
• pickups and drop-offs!
• An app for city dwellers to view real-time cab status for unoccupied cabs in a given area
Data Aggregation CabID Lat Long Occ Timestamp
Aggregate Metrics (per cab)
MrJob
year month day hour avocc pickup drop off
• Drop off event: Occupancy change from 1 to 0!
• Pickup event: Occupancy change from 0 to 1
Computing Trip Durations and Shift Times
• Used Windowing function in Hive to calculate idle times!
• Maximum idle time in a day points to a potential shift!
• 1 million trips
idle/shift time!(hours)
tripId hour idle (s) idle (h)
Occupancy Profile
occ (
%)
0
0.175
0.35
0.525
0.7
hour
0 1 2 3 4 5 6 7 8 9 10 11 13 12 14 15 16 17 18 19 20 21 22 23
potential !shift time!
Tables
• Hourly data organized by Day of Week!
• Aggregate metrics stored in the same table for fast retrieval
y_m_dow c:0 c:1 c:2 c:3 c:4 … c:23 c:Totals
Day of Week Hour 0 Attributes hr 1 hr 2 hr 3 hr 4 … hr 23 ..
2008_01_Mon pickups, dropoffs, avg_occ, avg_dist .. .. .. .. .. ..
sum(pickups), sum(drop offs), avg(occ), avg(dist)
Hourly Aggregates by Day of Week
• HBase row level atomicity can be leveraged for transactional operations!
• Keyed producer in Kafka assures in-order delivery of messages (by key)!
• Simple operations for tool integration, followed by incremental complexity streamlines the development process
Takeaways
About Me• Previous Life - Senior Energy Analyst
(EnerNOC Inc.).
• M.S. Electrical Engineering - North Carolina State University (focus on robotics, control systems and smart grid).
• https://github.com/PreetikaKuls