generating wide-area content-based publish/subscribe workloads

43
Albert Yu , Pankaj K. Agarwal, Jun Yang Duke University 1 Generating Wide-Area Content-Based Publish/Subscribe Workloads

Upload: debbie

Post on 23-Feb-2016

57 views

Category:

Documents


0 download

DESCRIPTION

Generating Wide-Area Content-Based Publish/Subscribe Workloads. Albert Yu , Pankaj K. Agarwal, Jun Yang Duke University. Overview. Publish/Subscribe systems Data extraction Workload generation Conclusion and future work. Publish/Subscribe. Publishers. Subscribers. Brokers. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Albert Yu, Pankaj K. Agarwal, Jun YangDuke University

1

Generating Wide-Area Content-Based Publish/Subscribe

Workloads

Page 2: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Overview

2

Publish/Subscribe systemsData extractionWorkload generationConclusion and future work

Page 3: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Publish/Subscribe

3

Publishers

BrokersSubscribers

Broker network

Page 4: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Two tasks

4

Subscription processingMatch and process each publish event with a large

set of subscriptions.Notification dissemination

Notify those interested subscribers over a network.

Page 5: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Event and network spaces

5

Event spaceEvent is a point.Subscription defines a region (ex: rectangle).

Network spaceNetwork location is a point.Distance between two network locations

approximates the latency between them.

e1

e2

e3

e4S1

S2

Page 6: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Lack of publicly available, realistic workloads

6

Privacy concern and commercial interestsLack of widely deployed systems

supporting powerful content-based subscriptions

Page 7: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Goal

7

Collect the limited amount of various statistics available to public.

Generate a workload consistent with these statistics.

Generate other workloads according to user-defined deviations.

Page 8: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Workload components

8

A set of subscriptions, each of which corresponds to:A rectangular region of interest in the event

spaceA point in the network space

An event distribution over the event spaceA set of brokers (optional)

A point in the network space

Page 9: Generating Wide-Area Content-Based Publish/Subscribe Workloads

9

How to assign subscribers to brokers?

Balancing semantic similarity and network proximity in dissemination network design is a hard optimization.

Optimal tradeoff depends on the amounts of event matching shared versus disjoint interests.

Motivation: Broker-subscriber assignment

Letting a broker handle subscribers that are far awayViolate delivery latency requirements.Increase communication costs.

Cluster subscribers with similar interestsPotentially minimize network traffic.

Page 10: Generating Wide-Area Content-Based Publish/Subscribe Workloads

10

Take into accountSubscription interestSubscription locationEvent distribution

Exploring correlation between event and network spaces provides more optimization opportunities.

Motivation: Broker-subscriber assignment

Page 11: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Related work

11

Characterize pub/sub systemsProperties of RSS feeds [Liu et al.’ 05]Stock popularity in NYSE [Tock et al.’05]

Simple synthesized workloadsEvent space

Uniform and Gaussian distributions [Baldoni et al.’07]Zipf distribution [Bianchi et al.’07]

Network spaceSubscribers are located uniformly or randomly in the

network [Baldoni et al.’07, Papaemmanouil and Cetintemel’05]

Page 12: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Two work phases of our generator

12

Data extractionWorkload generation

Page 13: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Data extraction

13

Data extraction Data generatorSummary statistics of subscriber interests, locations, and events

Page 14: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Data extraction (Cont’d)

14

Event space

• Subscription count• Event count• Distribution of subscribers over the network

For each cell ,

Page 15: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Data extraction (Cont’d)

15

Data from Google GroupsData from PlanetLabOur approach can be applied to other data

sources that offer similar types of summary information.

Page 16: Generating Wide-Area Content-Based Publish/Subscribe Workloads

16

Google defines hierarchies over topics and regions.Google Groups

Tag each group with three attributes.Ex: Asian languages -> Eastern Asian languages -> Korean

Page 17: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Treat topic and language as dimensions of the event space.

Each interestPair of topic and language.

Google Groups (Cont’d)

17

l1

l2 l3

l4 l5 l6 l7Lang hierarchy

t1

t2 t3

t4 t5 t6 t7Topic hierarchy l4 l5 l6 l7

t4

t5

t6

t7

Event space

Interest = (t7, l4)

Page 18: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Google Groups (Cont’d)

18

Collect statistical summary for each interest◦# messages per month posted to

groups associated to that interest.◦# members in each group associated

with that interest.

Page 19: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Google Groups (Cont’d)

19

Divide all Google groups associated with the same interest by their geographic regions.

Count #members within each geographic region.

Rough indication of the distribution of subscribers over the network.

Page 20: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Google Groups (Cont’d)

20

Event space

• Subscription count• Event count• Distribution of subscribers by geographic region

For each cell (interest),

Page 21: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Network Location

21

Data from Google Groups gives us a rough distribution of subscriptions by geographic regions.

Still need actual network locations.

PlanetLab nodesIP-addressEmbed inter-node latencies in a low-

dimensional Euclidean space [Dabek et al’04, Ledlie et al’02, Ng et al’02]

Google Groups

PlanetLab nodes

Geographic regions Coordinates

Page 22: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Popularities of interests

22

By removing the top 24 interests, # members reduces from 8.1 million to 4.3 millon.Top three are (business services, English), (small business, English), (consulting, English).

Page 23: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Super-interest

Distribution of interests in event space for different geographic regions

23

Asia US

Europe

Simplified Chinese

Simplified Chinese

Simplified Chinese

English English

English

Super-interest

Page 24: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Two work phases of our generator

24

Data extractionWorkload generation

Page 25: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Workload generation

25

Data extraction

Workload generationA set of range subscriptions

A set of events

Skewness parameterInterest generalization parameterRange perturbation parameterWorkload size parameter

Summary statistics of subscriber interests, locations, and events

Page 26: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Workload generation

26

Interest diffusionInterest generalizationCategorical-to-range subscription

conversionWorkload with different size

Page 27: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Workload generation

27

Interest diffusionInterest generalizationCategorical-to-range subscription

conversionWorkload with different size

97.5 16.5187.5277.5

10

10

30

20

50

100

200

50

100 300

90 10

200

10 10

10

97.5

16.5 16.5 16.5

16.5

16.5

25.5

34.5 52.5

52.5

187.5

88.5Event space

Page 28: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Workload generation

28

Interest diffusionInterest generalizationCategorical-to-range subscription

conversionWorkload with different size

Topic hierarchy Language hierarchy

Page 29: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Workload generation

29

Interest diffusionInterest generalizationCategorical-to-range subscription

conversionWorkload with different size

(soccer, Korean)

Page 30: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Workload generation

30

Interest diffusionInterest generalizationCategorical-to-range subscription

conversionWorkload with different size

Page 31: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Interest diffusion

31

Popularity of an interest = number of subscriptions in its subtree.

Siblings of an interest are “related.”Reduce the popularity variance among the

silbings.

t1

t2 t3

t4 t5 t6 t7

Page 32: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Interest diffusion

32

GoalGiven a user-specified value p, reduce all

popularity variances by a factor of p for all levels of granularity.

Under the following constraints Total subscription count remains constant. Popularity of an interest = sum of child popularities.

t1

t2 t3

t4 t5 t6 t7

Page 33: Generating Wide-Area Content-Based Publish/Subscribe Workloads

t1

t2 t3

t4 t5 t6 t7

l1

l2 l3

l4 l5 l6 l7

l4 l5 l6 l7

t4

t5

t6

t7

10

10

30

20

50

100

200

50

100 300

90 10

200

10 10

10

Lang hierarchyTopic hierarchy

Subscription countl2 l3

t2

t3 500 230

70 400

33

Page 34: Generating Wide-Area Content-Based Publish/Subscribe Workloads

l4 l5 l6 l7

t4

t5

t6

t7

10

10

30

20

50

100

200

50

100 300

90 10

200

10 10

10

l2 l3

t2

t3 500 230

70 400

Mean: (500 + 230 + 70 + 400) / 4 = 300

Variance: [(500 – 300)2 + (230 – 300) 2 + (70 – 300) 2 + (400 – 300) 2 ]/4= 26950

34

Page 35: Generating Wide-Area Content-Based Publish/Subscribe Workloads

l4 l5 l6 l7

t4

t5

t6

t7

10

10

30

20

50

100

200

50

100 300

90 10

200

10 10

10

l2 l3

t2

t3 500 230

70 400

Mean: 300Variance: 26950

Mean: 125Variance: 11425

Mean: 57.5Variance: 6768.75

Mean: 17.5Variance: 68.75

Mean: 100Variance: 3750

Goal : Given a user-specified value p, reduce all popularity variances by a factor of p.35

Page 36: Generating Wide-Area Content-Based Publish/Subscribe Workloads

l2 l3

t2

t3 New: C2*

Old mean: CNew mean: C*

Goal : Given a user-specified value p, reduce all popularity variances by a factor of p.

Old: C2New: C1*Old: C1

New: C4*Old: C4

New: C3*Old: C3

36

Page 37: Generating Wide-Area Content-Based Publish/Subscribe Workloads

l4 l5 l6 l7

t4

t5

t6

t7

10

10

30

20

50

100

200

50

100 300

90 10

200

10 10

10

l2 l3

t2

t3 500 230

70 400

Mean: 300Variance: 26950

Mean: 125Variance: 11425

Mean: 57.5Variance: 6768.75

Mean: 17.5Variance: 68.75

Mean: 100Variance: 3750Proceed top-down from the

coarsest level of granularity to the finest level of granularity

37

Page 38: Generating Wide-Area Content-Based Publish/Subscribe Workloads

l4 l5 l6 l7

t4

t5

t6

t7

10

10

30

20

50

100

200

50

100 300

90 10

200

10 10

10

l2 l3

t2

t3 480 237

93400

Mean: 300Variance: 26950

Mean: 125Variance: 11425

Mean: 57.5Variance: 6768.75

Mean: 17.5Variance: 68.75

Mean: 100Variance: 3750P = 0.81

39070

500 230

21829.5120

23.25

59.25

97.5

38

Page 39: Generating Wide-Area Content-Based Publish/Subscribe Workloads

l4 l5 l6 l7

t4

t5

t6

t7

l2 l3

t2

t3 480 237

93400

Mean: 300Variance: 26950

Mean: 125Variance: 11425

Mean: 57.5Variance: 6768.75

Mean: 17.5Variance: 68.75

Mean: 100Variance: 3750P = 0.81

39070

500 230

21829.5120

23.25

59.25

97.5

97.5 16.5187.5277.5

10

10

30

20

50

100

200

50

100 300

90 10

200

10 10

10

97.5

16.5 16.5 16.5

16.5

16.5

25.5

34.5 52.5

52.5

187.5

88.5

9254.25 5482.69

55.6875 3037.5

39

Page 40: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Along the language dimension

40

Before diffusion After diffusion

Page 41: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Along the topic dimension

41

Before diffusion After diffusion

Page 42: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Conclusion and Future work

42

Make the best out of the limited amount of publicly available information to generate realistic workloads.

Make deviations easy to understand and control by users.

ExtensionsChanges to event distributions and

subscriptions over time.Subscriptions beyond multi-dimensional

range predicates.Statistical models.

Page 43: Generating Wide-Area Content-Based Publish/Subscribe Workloads

Thank you

43