efficient computation of personal aggregate queries on blogs ka cheung sia 1 junghoo cho 1 yun chi 2...

Efficient Computation of Personal Aggregate Queries on Blogs

Ka Cheung Sia1 Junghoo Cho1

Yun Chi2 Belle L. Tseng3

1University of California, Los Angeles

2NEC Labs America

3Yahoo! Inc.

ACM SIGKDD 2008

2

Motivation

User-generated content in Blogosphere and Web2.0 services contains rich information of recent events

Aggregation of individual user opinions to show current popular trends

3

Motivation

Global aggregation Recent news are picked up automatically

“Dark Knight” in the week of July 18 “Olympics” related in the week of August 8

Potential drawbacks

What if I am not interested in sports at all? Groups of bloggers collaborated to promote

advertisement videos

Personal aggregation Users selectively aggregate from different sources Efficient strategy to handle large number of

users and sources

4

From Global to Personal Aggregation

Dark KnightOlympics

Michael Phelps SIGKDD

Las Vegas

bloggers

items(phrases)

Dark Knight is great, more entertaining

than watching Olympics and shows in Las

Vegas!

Um.. it will be good if there is a free show of Dark Knight in

SIGKDD

Michael Phelps performance in

Olympics is awesome...

Finished watching

Michael Phelps in Olympics, got

to attend SIGKDD now...

5

Matrix formulation

Endorsement matrix (E) E(bj,ok) how much a

blogger endorse an object

Object can be phrases or URLs

321b4

475Total

101b3

030b2

023b1

O3o2o1E

0.50.500u3

0.60.60.20.2u2

000.80.8u1

b4b3b2b1T

Trust matrix (T) T(ui,bj) how much a user

trust a blogger whether a user reads

the blog or how often he reads

6

Personal aggregation

PersonalizedEndorsement score is the summation of endorsement score weighted by a user's trust vector

Endorsement (blog_id, item, score) Trust (user_id, blog_id, score)• Personal Aggregate Query as SQL (Q1):

SELECT t.item, sum(t.score*e.score) AS scoreFROM Endorsement e, Trust tWHERE e.blog_id = t.blog_id ANDt.user_id = <user id>GROUP BY t.itemORDER BY score DESC LIMIT 20

21.01.0u3

2.42.21.8u2

0.04.02.4u1

o3o2o1TE

7

Two baseline approaches

OTF Maintain two tables, compute the weighted sum per each

personal aggregate query on-the-fly High query cost

VIEW Pre-compute the results of every user and store as views High update cost

OTF VIEW

8

Best of both worlds

Identify “template” users- typical users interested in sports / politics / technology / ...

Results of template users are pre-computed

Results of individual users are combined from partially computed results

9

Trust matrix decomposition

Trust matrix reflects user's interest Decompose the T into two sub-matrices W and H

Non-negative Matrix Factorization (NMF)

W : <individual users : template users> relationship

H : <template users : blogs> relationship

User 2’s trust vector is expressed as linear combination of the trust vectors of template user 1 and 2

10

Reconstruction of results

PersoanlizedEndorsement scores of template users are precomputed, results of individual users are computed on request

(HE) is maintained as sorted lists for all template users

W * (HE) is the personal aggregation result Computed using Threshold Algorithm

Top-K list (HE) are sorted lists W * (HE) is weighted linear combination

11

Partition of trust matrix

Decomposition is useful when the matrix is dense Real life data is skewed Hybrid method: uses decomposition only when it is

effective

Users with more subscription

Blogs withmore subscribers

Users with >30 subscriptionsFeeds with >30 subscribers

10k feeds, 24k users~1M subscription pairs

2.7M subscription pairs

1. OTF

2. VIEW

3. NMF

12

Experiments

Bloglines.com : online RSS reader Trust matrix T (1-0 version): subscription profile

91,366 users

487,694 RSS feeds

Endorsement matrix E : blog - keywords occurrence

Feed content collected between Nov 2006 and Jul 2007

Keywords filtered by nouns and high tf-idf values in entries

Platform

Python implementation of proposed scheme

MySQL server on linux with data on RAID disk

13

How different is personalization?

Week 2007 Jan 7 – 2007 Jan 13major event: iphone released

personal aggregation results differ from global aggregation

irangooglequarterphonesaddamcathartikpricesbusinesstroopsvideocompaniessoftwaredeptkibbutzappledevelopmentavenueargentinabushmanagementviewsvegasiraqiraqpresidentsearchchicagomanagerbushreutersiphoneappleiraqiguazubeefiphoneyorkerbrazilcattlesalesUser 91017User 90550User 90439Global

2007-01-07 to 2007-01-13

14

How different is personalization?

Overlap comparison of global aggregation and personal aggregation

LG – global top 20 itemsLi – individual top 20 items of user i

Personal aggregation results also differ among users

Overlap degree withglobal aggregation result

Pair-wise among usersLG∩Li L i∩L j

15

Approximation accuracy

Dense region of subscriptionmatrix

>30 subscribers: 10152 feeds

>30 subscriptions: 24340 users

L2 norm comparison

Sparsity of W (23%), H (13%)

NMF approximation is close to SVD with sparseness adv.

833.0823.2120

837.9829.0110

844.6835.1100

850.1841.690

856.9848.580

NMFSVDRank

16

Approximation accuracy

How many items are approximated by NMF in top 20 list? Ti – top 20 items of user i computed by OTF

Ai – top 20 items of user i computed by NMF 70 % approximation and more accurate for higher rank items

Correlation with rank∣Ai∩T i∣/∣T i∣

17

Efficiency of proposed method

Update cost OTF (222K) < NFM (3.2M) < VIEW (23.6M)

Query response time average over 1000 users with highest number of

subscriptions OTF: execute SQL query Q1 on MySQL server

NMF: python implementation of Threshold Algorithm that interface MySQL server for loading NMF template users' tables

Average query response time reduced by 75%, eliminated outliers of significant delay

0.007s2.84s0.53s0.46sNMF

0.037s84.42s3.60s2.05sOTF

minmaxstdavgMethod

18

Conclusion and future work

Deliver tailored results to users by personal aggregation Proposed a model for personal aggregate queries

Optimization by NMF & Threshold Algorithm

Real life dataset study shows query response time can be reduced by significantly with acceptable approximation accuracy

Handle updates of trust matrix change Parallelism Better phrase extraction (e.g. opinion orientation)

19

Thank you!

Q and A

20

Threshold algorithm

Proposed by Fagin et.al. [2001]Efficient computation of top-K items from multiple lists with a monotone aggregate function

users

blogs

user groups

21

Illustration of matrix partition

FeedswithMore

subscribers

User with more subscriptions

2 subscriptions 8 subscriptions

2 subscribers

9 subscribers

efficient computation of personal aggregate queries on blogs ka cheung sia 1 junghoo cho 1 yun chi 2...

Documents