generating wide-area content-based publish/subscribe workloads

Albert Yu, Pankaj K. Agarwal, Jun YangDuke University

1

Generating Wide-Area Content-Based Publish/Subscribe

Workloads

Overview

2

Publish/Subscribe systemsData extractionWorkload generationConclusion and future work

Publish/Subscribe

3

Publishers

BrokersSubscribers

Broker network

Two tasks

4

Subscription processingMatch and process each publish event with a large

set of subscriptions.Notification dissemination

Notify those interested subscribers over a network.

Event and network spaces

5

Event spaceEvent is a point.Subscription defines a region (ex: rectangle).

Network spaceNetwork location is a point.Distance between two network locations

approximates the latency between them.

e1

e2

e3

e4S1

S2

Lack of publicly available, realistic workloads

6

Privacy concern and commercial interestsLack of widely deployed systems

supporting powerful content-based subscriptions

Goal

7

Collect the limited amount of various statistics available to public.

Generate a workload consistent with these statistics.

Generate other workloads according to user-defined deviations.

Workload components

8

A set of subscriptions, each of which corresponds to:A rectangular region of interest in the event

spaceA point in the network space

An event distribution over the event spaceA set of brokers (optional)

A point in the network space

9

How to assign subscribers to brokers?

Balancing semantic similarity and network proximity in dissemination network design is a hard optimization.

Optimal tradeoff depends on the amounts of event matching shared versus disjoint interests.

Motivation: Broker-subscriber assignment

Letting a broker handle subscribers that are far awayViolate delivery latency requirements.Increase communication costs.

Cluster subscribers with similar interestsPotentially minimize network traffic.

10

Take into accountSubscription interestSubscription locationEvent distribution

Exploring correlation between event and network spaces provides more optimization opportunities.

Motivation: Broker-subscriber assignment

Related work

11

Characterize pub/sub systemsProperties of RSS feeds [Liu et al.’ 05]Stock popularity in NYSE [Tock et al.’05]

Simple synthesized workloadsEvent space

Uniform and Gaussian distributions [Baldoni et al.’07]Zipf distribution [Bianchi et al.’07]

Network spaceSubscribers are located uniformly or randomly in the

network [Baldoni et al.’07, Papaemmanouil and Cetintemel’05]

Two work phases of our generator

12

Data extractionWorkload generation

Data extraction

13

Data extraction Data generatorSummary statistics of subscriber interests, locations, and events

Data extraction (Cont’d)

14

Event space

• Subscription count• Event count• Distribution of subscribers over the network

For each cell ,

Data extraction (Cont’d)

15

Data from Google GroupsData from PlanetLabOur approach can be applied to other data

sources that offer similar types of summary information.

16

Google defines hierarchies over topics and regions.Google Groups

Tag each group with three attributes.Ex: Asian languages -> Eastern Asian languages -> Korean

Treat topic and language as dimensions of the event space.

Each interestPair of topic and language.

Google Groups (Cont’d)

17

l1

l2 l3

l4 l5 l6 l7Lang hierarchy

t1

t2 t3

t4 t5 t6 t7Topic hierarchy l4 l5 l6 l7

t4

t5

t6

t7

Event space

Interest = (t7, l4)


18

Collect statistical summary for each interest◦# messages per month posted to

groups associated to that interest.◦# members in each group associated

with that interest.


19

Divide all Google groups associated with the same interest by their geographic regions.

Count #members within each geographic region.

Rough indication of the distribution of subscribers over the network.


20

Event space

• Subscription count• Event count• Distribution of subscribers by geographic region

For each cell (interest),

Network Location

21

Data from Google Groups gives us a rough distribution of subscriptions by geographic regions.

Still need actual network locations.

PlanetLab nodesIP-addressEmbed inter-node latencies in a low-

dimensional Euclidean space [Dabek et al’04, Ledlie et al’02, Ng et al’02]

Google Groups

PlanetLab nodes

Geographic regions Coordinates

Popularities of interests

22

By removing the top 24 interests, # members reduces from 8.1 million to 4.3 millon.Top three are (business services, English), (small business, English), (consulting, English).

Super-interest

Distribution of interests in event space for different geographic regions

23

Asia US

Europe

Simplified Chinese

Simplified Chinese

Simplified Chinese

English English

English

Super-interest

Two work phases of our generator

24

Data extractionWorkload generation

Workload generation

25

Data extraction

Workload generationA set of range subscriptions

A set of events

Skewness parameterInterest generalization parameterRange perturbation parameterWorkload size parameter

Summary statistics of subscriber interests, locations, and events

Workload generation

26

Interest diffusionInterest generalizationCategorical-to-range subscription

conversionWorkload with different size

Workload generation

27



97.5 16.5187.5277.5

10

10

30

20

50

100

200

50

100 300

90 10

200

10 10

10

97.5

16.5 16.5 16.5

16.5

16.5

25.5

34.5 52.5

52.5

187.5

88.5Event space

Workload generation

28



Topic hierarchy Language hierarchy

Workload generation

29



(soccer, Korean)

Workload generation

30



Interest diffusion

31

Popularity of an interest = number of subscriptions in its subtree.

Siblings of an interest are “related.”Reduce the popularity variance among the

silbings.

t1

t2 t3

t4 t5 t6 t7

Interest diffusion

32

GoalGiven a user-specified value p, reduce all

popularity variances by a factor of p for all levels of granularity.

Under the following constraints Total subscription count remains constant. Popularity of an interest = sum of child popularities.

t1

t2 t3

t4 t5 t6 t7

t1

t2 t3

t4 t5 t6 t7

l1

l2 l3

l4 l5 l6 l7

l4 l5 l6 l7

t4

t5

t6

t7

10

10

30

20

50

100

200

50

100 300

90 10

200

10 10

10

Lang hierarchyTopic hierarchy

Subscription countl2 l3

t2

t3 500 230

70 400

33

l4 l5 l6 l7

t4

t5

t6

t7

10

10

30

20

50

100

200

50

100 300

90 10

200

10 10

10

l2 l3

t2

t3 500 230

70 400

Mean: (500 + 230 + 70 + 400) / 4 = 300

Variance: [(500 – 300)2 + (230 – 300) 2 + (70 – 300) 2 + (400 – 300) 2 ]/4= 26950

34

l4 l5 l6 l7

t4

t5

t6

t7

10

10

30

20

50

100

200

50

100 300

90 10

200

10 10

10

l2 l3

t2

t3 500 230

70 400

Mean: 300Variance: 26950


Mean: 57.5Variance: 6768.75



Goal : Given a user-specified value p, reduce all popularity variances by a factor of p.35

l2 l3

t2

t3 New: C2*

Old mean: CNew mean: C*

Goal : Given a user-specified value p, reduce all popularity variances by a factor of p.

Old: C2New: C1*Old: C1

New: C4*Old: C4

New: C3*Old: C3

36

l4 l5 l6 l7

t4

t5

t6

t7

10

10

30

20

50

100

200

50

100 300

90 10

200

10 10

10

l2 l3

t2

t3 500 230

70 400





Mean: 100Variance: 3750Proceed top-down from the

coarsest level of granularity to the finest level of granularity

37

l4 l5 l6 l7

t4

t5

t6

t7

10

10

30

20

50

100

200

50

100 300

90 10

200

10 10

10

l2 l3

t2

t3 480 237

93400





Mean: 100Variance: 3750P = 0.81

39070

500 230

21829.5120

23.25

59.25

97.5

38

l4 l5 l6 l7

t4

t5

t6

t7

l2 l3

t2

t3 480 237

93400





Mean: 100Variance: 3750P = 0.81

39070

500 230

21829.5120

23.25

59.25

97.5

97.5 16.5187.5277.5

10

10

30

20

50

100

200

50

100 300

90 10

200

10 10

10

97.5

16.5 16.5 16.5

16.5

16.5

25.5

34.5 52.5

52.5

187.5

88.5

9254.25 5482.69

55.6875 3037.5

39

Along the language dimension

40

Before diffusion After diffusion

Along the topic dimension

41

Before diffusion After diffusion

Conclusion and Future work

42

Make the best out of the limited amount of publicly available information to generate realistic workloads.

Make deviations easy to understand and control by users.

ExtensionsChanges to event distributions and

subscriptions over time.Subscriptions beyond multi-dimensional

range predicates.Statistical models.

Thank you

43

generating wide-area content-based publish/subscribe workloads

Documents

network baldoni

network locations

network proximity

network traffic

network spacesubscribers

event spaceinterest

event spaceevent

event spacea point