ds504/cs586: big data analytics -...

33
DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li Welcome to Time: 6:00pm –8:50pm R Location: AK232 Fall 2016

Upload: nguyennhi

Post on 11-Mar-2018

224 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

DS504/CS586: Big Data Analytics Data Management

Prof. Yanhua Li

Welcome to

Time: 6:00pm –8:50pm R Location: AK232

Fall 2016

Page 2: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

First Grading for Assignment 2

v  Summarizing the problem and solutions

v  Critiques/comments/ideas

Page 3: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

Cities OSPeople

The Environment

Win

Win

Win

Urban Computing

Tackle the Big

challenges in Big cities

using Big data!

Urban Sensing & Data AcquisitionParticipatory Sensing, Crowd Sensing, Mobile Sensing

Traffic Road Networks

POIsAir Quality

Human mobility

Meteorology

Social Media

Energy

Urban Data ManagementSpatio-temporal index, streaming, trajectory, and graph data management,...

Urban Data AnalyticsData Mining, Machine Learning, Visualization

Service ProvidingImprove urban planning, Ease Traffic Congestion, Save Energy, Reduce

Air Pollution, ...

Urban Computing: concepts, methodologies, and applications. Zheng, Y., et al. ACM transactions on Intelligent Systems and Technology.

Page 4: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

2D-Spatial Queries

K Nearest Neighbour (KNN) Queries

Region (Range) Query

Given a point or an object, find the nearest object that satisfies given conditions

Ask for objects that lie partially or fully inside a

specified region.

Page 5: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

Trajectory Data Management v  Range queries

E.g. Retrieve the trajectories of vehicles passing a given rectangular region R between 2pm-4pm in the past month

•  KNN queries q1

q2 Tr1

Tr2

Tr3

Tr1

Tr2

Tr3

Tr1

Tr2

Tr3qt

B) KNN Point QueryA) Range Query C) KNN Trajectory Query

R

p2

q1

q2 Tr1

Tr2

Tr3

Tr1

Tr2

Tr3

Tr1

Tr2

Tr3qt

B) KNN Point QueryA) Range Query C) KNN Trajectory Query

R

p2

E.g. Retrieve the trajectories of people with the minimum aggregated distance to a set of query points Publications: [1][2] for a single point query, [3] for multiple query points

E.g. Retrieve the trajectories of people with the minimum aggregated distance to a query trajectory Publications: Chen et al, SIGMOD05; Vlachos et al, ICDE02; Yi et al, ICDE98.

q1

q2 Tr1

Tr2

Tr3

Tr1

Tr2

Tr3

Tr1

Tr2

Tr3qt

B) KNN Point QueryA) Range Query C) KNN Trajectory Query

R

p2

[3] Zaiben Chen, et al. Searching Trajectories by Locations: An Efficiency Study, SIGMOD 2010

[1] E. Frentzos, et al. Algorithms for nearest neighbor search on moving object trajectories. Geoinformatica, 2007 [2] D. Pfoser, et al. Novel approaches in query processing for moving object trajectories. VLDB, 2000.

Page 6: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

Spatial/Temporal Indexing Structures

v  Temporal Indexing (1-D data) § List index § B-tree

v  Space Partition-Based Indexing Structures (2-D data) § Grid-based § Quad-tree

Page 7: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

Full B-Tree Structure

Page 8: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

B-Tree Index

v  B-tree is the most commonly used data structures for indexing.

v  It is fully dynamic, that is it can grow and shrink.

Page 9: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

Three Types B-Tree Nodes

v  Root node - contains node pointers to branch nodes.

v  Branch node - contains pointers to leaf nodes or other branch nodes.

v  Leaf node - contains index items

Page 10: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

Spatial/Temporal Indexing Structures

v  Temporal Indexing (1-D data) § B-tree

v  Space Partition-Based Indexing Structures (2-D data) § Grid-based § Quad-tree

Page 11: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

Grid-based Spatial Indexing

g1 p1 p3

g2 p4

g1 g2p1

p3p4

v  Indexing §  Partition the space into disjoint and uniform grids §  Build an index between each grid and the points in the grid

Page 12: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

Grid-based Spatial Indexing

v  Range Query §  Find the girds intersecting the range query §  Retrieve the points from the grids and identify the points in the range

p1 p3p2

p4

p2

p3

p1

p4g1

g2

g4

g3

Page 13: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

Grid-based Spatial Indexing

v  Nearest neighbor query §  Euclidian distance §  Road network distance is quite different

p1

p1p2 p1p2

The nearest object is within the grid

The nearest object is outside the grid

Fast approximation

Page 14: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

Grid-based Spatial Indexing

v  Advantages §  Easy to implement and understand §  Very efficient for processing range and nearest queries

v  Disadvantages §  Index size could be big §  Difficult to deal with unbalanced data

§  Think about what we discussed last time on the POI sampling and estimation.

Page 15: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

15

Quad-Tree •  Indexing

–  Each node of a quad-tree is associated with a rectangular region of space; the top node is associated with the entire target space.

–  Each non-leaf node divides its region into four equal sized quadrants –  Leaf nodes have between zero and some fixed maximum number of

points (set to 1 in example).

0 1

23

00

0203

30 31

3233

3000

0 1 2 3

Page 16: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

Quad-Tree•  Range query

0 1

23

00

0203

30 31

3233

0 1 2 3

20 23

Page 17: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

Quad-Tree•  Nearest Neighbour Query (hard)

0 1

23

00

0203

30 31

3233

0 1 2 3

20 23

Page 18: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

Sampling Big Trajectory Data

Page 19: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

Big Trajectory Data in Urban Networks

Taxi GPS Trajectory Mobile User Trajectory •  Urban roving sensors deliver big trajectory data.

•  Reveal moving patterns and urban issues.

How to manage the big trajectory data to enable efficient query processing.

Challenge

Page 20: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

Trajectory Aggregate Query

•  A trajectory aggregate query •  Retrieves statistics of distinct trajectories passing a

user-specified spatio-temporal region; •  Examples,

•  # of taxi trips with average speed of more than 5 miles per hour in New York City in 2014;

•  # of mobile users with iPhone in Hong Kong in 2013.

Page 21: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

Exhaustive Search

•  ri: a sequence of GPS points in (TID, Lat, Lng, Time) •  q: a trajectory aggregate query with Nq Trajectories •  Spatio-temporal indexing: B-tree, Quad-tree, etc,

Page 22: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

Challenges with Big Trajectory Data

•  Long responding time for large trajectory dataset •  In 2013, Shenzhen, China;

•  Query: # of iPhone users and taxi/bus trips

(System: A cluster of 3 machines with 24 Intel X5670 2.93GHz processors, 94GB memory.)

Mobility Data 788.6TB 6million usersTaxi GPS 1.58 TB 22,083 taxisBus GPS 1.34 TB 8,427 buses

iPhone Users 0.8 million Taxi GPS 302 million tripsBus GPS 1.38 billion trips

12 minutes to get the exact answers!

Page 23: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

Key Challenges on Exact Answer

•  A trajectory ri may traverse multiple index leaf nodes •  Cannot pre-compute and store the results on indices •  Summing up two answers leads to over-counting

r1 r2

2 1

Page 24: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

Motivation & Problem Definition

How to sample B index leaf nodes to estimate # of trajectories in q with a guaranteed error bound?

q covers n index leaf nodes

Page 25: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

Random Index Sampling B Sampled index leaf nodes Trajectory list

r1, r2 r3, r5 r6, r7 r9, r10 …

kq1, kq

2 kq

3, kq5

kq6, kq

7 kq

9, kq10

Occurrence time

Inverted index

r1

r2

r3 …

Lng

Lat Time

Index leaf node list

… query q

ST-indexed data Data Indexing Structure

Sampling and Estimation

r2 r6 r9 …

r5 r1 r6 r7

r3

Index leaf node list

Index leaf node list

Page 26: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

Random Index Sampling •  Stage 1: Sampling Stage:

•  Uniformly at random sample B index leaf nodes with replacement

•  Stage 2: Estimation Stage: (Unbiased Estimator)

•  Convergence analysis:

when , .

is the maximum number of trajectories in an index leaf node.

Page 27: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

Evaluation v  Dataset: 3TB real human mobility data in a large city

in eastern China

v  Baseline Algorithm § Exhaustive search

v  Evaluation metric § Relative error & Responding time

Statistics Value City Size 400 square miles

City Population Size three million people

Duration eight days at the end of 2010

Number of trajectories 109,914 3G users

# of spatio-temporal points 400 million (407, 040, 083)

Page 28: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

Evaluation Results

Relative error

0.1 0.2 0.4 0.8 1.6 −0.3

−0.2

−0.1

0

0.1

0.2

0.3

B/n(%)

Rel

ativ

eer

ror

rati

o

n=7kn=13kn=23kGround Truth

0.1 0.2 0.4 0.8 1.6 0

5

10

15

20

B/n(%)Q

uery

proc

essi

ngti

me

(s)

n=7k (ES: 112s)n=13k (ES: 115s)n=23k (ES: 117s)

Processing time

Up to 2% relative error 5 times reduction

5

2%

Page 29: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

Concurrent Random Index Sampling •  Practical Issue:

•  A large number of concurrent aggregate queries

•  Idea of Concurrent Random Index Sampling (CRIS): •  Sampling Reuse •  Stratified Sampling Technique

Page 30: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

Concurrent Random Index Sampling

Unbiased Estimators:

Page 31: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

Summary

v  Approximate query processing

§ Single trajectory aggregate query •  via Random Index Sampling (RIS)

§ Concurrent trajectory aggregate queries •  via Concurrent Random Index Sampling (CRIS)

Page 32: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

Any Comments & Critiques?

Page 33: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-3-Management.… · DS504/CS586: Big Data Analytics Data Management Prof. Yanhua

Weka

v  6 weeks

v  Each week, 10+5 minutes for a short summary of what you learned. §  As part of the oral evaluation of the class; §  One or two team members to present, and the whole group got

the same score

v  At the last week, turn in your e-certificate §  As part of the written evaluation of the class