ds504/cs586: big data...

31
DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li Welcome to Time: 6:00pm – 8:50pm R Location: FL 311 Fall 2018

Upload: others

Post on 07-Aug-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DS504/CS586: Big Data Analyticsusers.wpi.edu/~yli15/courses/DS504Fall18/Slides/BDA-2-Cleaning.pdf · DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua

DS504/CS586: Big Data AnalyticsData Pre-processing and Cleaning

Prof. Yanhua Li

Welcome to

Time: 6:00pm – 8:50pm RLocation: FL 311

Fall 2018

Page 2: DS504/CS586: Big Data Analyticsusers.wpi.edu/~yli15/courses/DS504Fall18/Slides/BDA-2-Cleaning.pdf · DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua

Merging CS586 and DS504

Graded one review

Examples of Reviews/Critiques

Random selection.

Page 3: DS504/CS586: Big Data Analyticsusers.wpi.edu/~yli15/courses/DS504Fall18/Slides/BDA-2-Cleaning.pdf · DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua

Project 1Proposal Due Tomorrowin 2-pages

to canvas discuss forum

updates for the following weeks

Page 4: DS504/CS586: Big Data Analyticsusers.wpi.edu/~yli15/courses/DS504Fall18/Slides/BDA-2-Cleaning.pdf · DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua

Week 3 (9/06 R), Starting date Week 4 (9/14 F), Proposal (to discussion board) Week 5 (9/20 R), Methodology (to discussion board) Week 6 (9/27 R), Results (to discussion board) Week 7 (10/4 R), Conclusion, intro, and abstract

Week 8 (10/9 T), Final Report (roughly 8-pages) at 11:59pm EST & Self and Cross-evaluation due at 11:59pm ESTWeek 8 (10/11 R), In-class Presentation (in teams)

Page 5: DS504/CS586: Big Data Analyticsusers.wpi.edu/~yli15/courses/DS504Fall18/Slides/BDA-2-Cleaning.pdf · DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua

Cities OSPeople

The Environment

Win

Win

Win

Urban Computing

Tackle the Bigchallenges in Big cities

using Big data!

Urban Sensing & Data AcquisitionParticipatory Sensing, Crowd Sensing, Mobile Sensing

Traffic Road Networks

POIsAir Quality

Human mobility

Meteorology

Social Media

Energy

Urban Data ManagementSpatio-temporal index, streaming, trajectory, and graph data management,...

Urban Data AnalyticsData Mining, Machine Learning, Visualization

Service ProvidingImprove urban planning, Ease Traffic Congestion, Save Energy, Reduce

Air Pollution, ...

Urban Computing: concepts, methodologies, and applications. Zheng, Y., et al. ACM transactions on Intelligent Systems and Technology.

Page 6: DS504/CS586: Big Data Analyticsusers.wpi.edu/~yli15/courses/DS504Fall18/Slides/BDA-2-Cleaning.pdf · DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua

Ocean Biodiversity Informatics, Hamburg

The Data EquationOceans of Data

Praia de Forte, Brazil

Page 7: DS504/CS586: Big Data Analyticsusers.wpi.edu/~yli15/courses/DS504Fall18/Slides/BDA-2-Cleaning.pdf · DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua

Data Quality Dimensions

v Accuracy § Errors in dataExample:”Jhn” vs. “John”

v Currency§ Lack of updated dataExample: Residence (Permanent) Address: out-dated vs. up-to-dated

v Consistency§ Discrepancies into the dataExample: ZIP Code and City consistent

v Completeness§ Lack of data§ Partial knowledge of the records in a table

Page 8: DS504/CS586: Big Data Analyticsusers.wpi.edu/~yli15/courses/DS504Fall18/Slides/BDA-2-Cleaning.pdf · DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua

Ocean Biodiversity Informatics, Hamburg

Geographic outliers - GIS

Country, State, named district, etc.

Gazetteer of Brazilian localities

Page 9: DS504/CS586: Big Data Analyticsusers.wpi.edu/~yli15/courses/DS504Fall18/Slides/BDA-2-Cleaning.pdf · DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua

What do we mean by �Data Quality�?

An essential or distinguishing characteristic necessary for data to be fit for use.

SDTS 02/92

The general intent of describing the quality of a particular dataset or record is to describe the fitness of that dataset or record for a particular usethat one may have in mind for the data.

Chrisman, 1991

Page 10: DS504/CS586: Big Data Analyticsusers.wpi.edu/~yli15/courses/DS504Fall18/Slides/BDA-2-Cleaning.pdf · DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua

Loss of data quality

Loss of data quality can occur at many stages:v At the time of collectionv During digitisationv During documentationv During storage and archivingv During analysis and manipulationv At time of presentationv And through the use to which they are put

Don’t underestimate the simple elegance of quality improvement. Other than teamwork, training, and

discipline, it requires no special skills. Anyone who wants to can be an effective contributor.

(Redman 2001).

?

Page 11: DS504/CS586: Big Data Analyticsusers.wpi.edu/~yli15/courses/DS504Fall18/Slides/BDA-2-Cleaning.pdf · DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua

Data Cleaning

v Data cleaning tasks

§Accuracy: Smooth out noisy data

§Currency: Update the records

§Consistency: Correct inconsistent data

§Completeness: Fill in missing values

Page 12: DS504/CS586: Big Data Analyticsusers.wpi.edu/~yli15/courses/DS504Fall18/Slides/BDA-2-Cleaning.pdf · DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua

Map matching

Page 13: DS504/CS586: Big Data Analyticsusers.wpi.edu/~yli15/courses/DS504Fall18/Slides/BDA-2-Cleaning.pdf · DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua

Map-matchingv Problem: (Sampled data)

§ GPS trajectory = a sequence of GPS locations with time stamps

§ Map a GPS trajectory onto a road network§ a sequence of GPS points à a sequence of road segments

Page 14: DS504/CS586: Big Data Analyticsusers.wpi.edu/~yli15/courses/DS504Fall18/Slides/BDA-2-Cleaning.pdf · DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua

Spatial Data

v

e1 e2e3

e3.start

e3.end

e4

Page 15: DS504/CS586: Big Data Analyticsusers.wpi.edu/~yli15/courses/DS504Fall18/Slides/BDA-2-Cleaning.pdf · DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua

Map-Matching

v Why it is important§ A fundamental step in many transportation

applications• Navigation and driving• Traffic analysis• Taxi dispatching and recommendations

§ Examples: • Find the vehicles passing Institute Road• Calculate the average travel time from WPI

campus to MIT campus• When will the Bus 3 arrive at stop Highland St &

North Ashland St

Page 16: DS504/CS586: Big Data Analyticsusers.wpi.edu/~yli15/courses/DS504Fall18/Slides/BDA-2-Cleaning.pdf · DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua

Map-Matching

v Simple solution for high-sampling-rate data§ Weighted distance

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

Page 17: DS504/CS586: Big Data Analyticsusers.wpi.edu/~yli15/courses/DS504Fall18/Slides/BDA-2-Cleaning.pdf · DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua

Map-Matching (for low sampling rate)v Why difficult

? ??

???

b) Overpass(a) Parallel roads c) Spur

Page 18: DS504/CS586: Big Data Analyticsusers.wpi.edu/~yli15/courses/DS504Fall18/Slides/BDA-2-Cleaning.pdf · DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua

Map-Matching

v According to the additional information used§ Geometric § Topological§ Probabilistic § Advanced techniques

v According to the range of sampling points§ Local/incremental§ Global

Yu Zheng. Trajectory Data Mining: An Overview. ACM Transaction on Intelligent Systems and Technology, 6, 3, 2015.

Page 19: DS504/CS586: Big Data Analyticsusers.wpi.edu/~yli15/courses/DS504Fall18/Slides/BDA-2-Cleaning.pdf · DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua

Map-matching

v Insights

§ Consider both local and global information

§ Incorporating both spatial and temporal features

pa

pb

pc

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

Page 20: DS504/CS586: Big Data Analyticsusers.wpi.edu/~yli15/courses/DS504Fall18/Slides/BDA-2-Cleaning.pdf · DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua

Map-matching framework

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

Page 21: DS504/CS586: Big Data Analyticsusers.wpi.edu/~yli15/courses/DS504Fall18/Slides/BDA-2-Cleaning.pdf · DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua

Map-matchingv Solution (incorporating spatial information)

§ (Observation Probability) Model local possibility

• (Transmission Probability) Considering context (global)

§ Spatial analysis function

!"3

!"1

!"2

&"3

(" &"2

&"1

!"2 $"−1

$" !"1

$"+1

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

N(cji ) =1p2⇡�

e(x

j

i

�µ)2

2�2

V (cti�1 ! csi ) =di�1!i

w(i�1,t)!(i,s)

Fs(cti�1 ! csi ) = V (cti�1 ! csi ) ⇤N(csi )

Page 22: DS504/CS586: Big Data Analyticsusers.wpi.edu/~yli15/courses/DS504Fall18/Slides/BDA-2-Cleaning.pdf · DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua

Map-matching• Solution (Cosine Similarity)

• Temporal analysis function (Considering temporal information)• Shortest path is used.

Pi-1Pi

A Highway

A Service Road

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

Ft(cti�1 ! csi ) =

Pku=1(e

0u.v ⇥ v̄(i�1,t)!(i,s))qPk

u=1(e0u.v)

2 ⇥qPk

u=1 v̄2(i�1,t)!(i,s)

v̄(i�1,t)!(i,s) =

Pku=1 lu

�ti�1!i

Page 23: DS504/CS586: Big Data Analyticsusers.wpi.edu/~yli15/courses/DS504Fall18/Slides/BDA-2-Cleaning.pdf · DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua

Map-matching• Aggregating

– Spatial and temporal information – Local and global information

• Dynamic programing

• Spatio-temporal function

!11

!12

!13

!11 → !21

!13 → !22

!21

!22

!&1

!&2

P1's candidates P2's candidates Pn's candidates

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

F (cti�1 ! csi ) = Fs(cti�1 ! csi ) ⇤ Ft(c

ti�1 ! csi ), 2 i n

Page 24: DS504/CS586: Big Data Analyticsusers.wpi.edu/~yli15/courses/DS504Fall18/Slides/BDA-2-Cleaning.pdf · DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua

Map-matching• Path Selection

• Dynamic programing

!11

!12

!13

!11 → !21

!13 → !22

!21

!22

!&1

!&2

P1's candidates P2's candidates Pn's candidates

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

P = argmaxPcF (Pc)

F (Pc) = N(cs11 ) +nX

i=2

F (csi�1

i�1 ! csii )

Page 25: DS504/CS586: Big Data Analyticsusers.wpi.edu/~yli15/courses/DS504Fall18/Slides/BDA-2-Cleaning.pdf · DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua

Map-matching Example• Path Selection

Page 26: DS504/CS586: Big Data Analyticsusers.wpi.edu/~yli15/courses/DS504Fall18/Slides/BDA-2-Cleaning.pdf · DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua

Map-matching framework

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

Page 27: DS504/CS586: Big Data Analyticsusers.wpi.edu/~yli15/courses/DS504Fall18/Slides/BDA-2-Cleaning.pdf · DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua

Localized ST-Matching Strategy • Path Selection

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

Page 28: DS504/CS586: Big Data Analyticsusers.wpi.edu/~yli15/courses/DS504Fall18/Slides/BDA-2-Cleaning.pdf · DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua

Evaluations

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

AN =#Correctly Matched Road Seg

#all road segments

AL =

PLength Matched Road Seg

Length of the trajectory

Page 29: DS504/CS586: Big Data Analyticsusers.wpi.edu/~yli15/courses/DS504Fall18/Slides/BDA-2-Cleaning.pdf · DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua

29

Homework assginement

Page 30: DS504/CS586: Big Data Analyticsusers.wpi.edu/~yli15/courses/DS504Fall18/Slides/BDA-2-Cleaning.pdf · DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua

Project 1 ExampleUSPS

https://ribbs.usps.gov/intelligentmail_package/documents/tech_guides/PUB199IMPBImpGuide.pdf

Project 1 Proposals

Page 31: DS504/CS586: Big Data Analyticsusers.wpi.edu/~yli15/courses/DS504Fall18/Slides/BDA-2-Cleaning.pdf · DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua

31

Next Class: Data Management

v Do assigned readings before classv Be prepared, read and review required readings on your own in

advance!v Do literature survey: find and read related papers if anyv Bring your questions to the class and look for answers during

the class.

v Submit reviews/critiques v In Canvas before class

v Bring 2 hardcopies to the class

v Hand in one copy, and keep one copy with you.

Review Writing:

http://users.wpi.edu/~yli15/courses/DS504Fall18/Critiques.html

v Attend in-class discussionsv Please ask and answer questions in (and out of) class!

v Let�s try to make the class interactive and fun!