mapmycab presentation

11
MapMyCab Preetika Kulshrestha Insight Data Engineering, Feb 2015

Upload: preetika-kulshrestha

Post on 15-Aug-2015

497 views

Category:

Documents


0 download

TRANSCRIPT

MapMyCabPreetika Kulshrestha!

Insight Data Engineering, Feb 2015

Motivation• Tool for Data Scientists and Cab dispatchers to analyze (by

time of day or day of week):!

• cab occupancy!

• miles travelled!

• pickups and drop-offs!

• An app for city dwellers to view real-time cab status for unoccupied cabs in a given area

Demo

Pipeline

Cab Data

Message Broker

Real-Time Streaming

HDFS

HBase UI

MrJob

11 million rows

Data Aggregation CabID Lat Long Occ Timestamp

Aggregate Metrics (per cab)

MrJob

year month day hour avocc pickup drop off

• Drop off event: Occupancy change from 1 to 0!

• Pickup event: Occupancy change from 0 to 1

Computing Trip Durations and Shift Times

• Used Windowing function in Hive to calculate idle times!

• Maximum idle time in a day points to a potential shift!

• 1 million trips

idle/shift time!(hours)

tripId hour idle (s) idle (h)

Occupancy Profile

occ (

%)

0

0.175

0.35

0.525

0.7

hour

0 1 2 3 4 5 6 7 8 9 10 11 13 12 14 15 16 17 18 19 20 21 22 23

potential !shift time!

Tables

• Hourly data organized by Day of Week!

• Aggregate metrics stored in the same table for fast retrieval

y_m_dow c:0 c:1 c:2 c:3 c:4 … c:23 c:Totals

Day of Week Hour 0 Attributes hr 1 hr 2 hr 3 hr 4 … hr 23 ..

2008_01_Mon pickups, dropoffs, avg_occ, avg_dist .. .. .. .. .. ..

sum(pickups), sum(drop offs), avg(occ), avg(dist)

Hourly Aggregates by Day of Week

• HBase row level atomicity can be leveraged for transactional operations!

• Keyed producer in Kafka assures in-order delivery of messages (by key)!

• Simple operations for tool integration, followed by incremental complexity streamlines the development process

Takeaways

About Me• Previous Life - Senior Energy Analyst

(EnerNOC Inc.).

• M.S. Electrical Engineering - North Carolina State University (focus on robotics, control systems and smart grid).

• https://github.com/PreetikaKuls

[email protected]

Batch Views

Batch Views