efficient computation of personal aggregate queries on blogs ka cheung sia 1 junghoo cho 1 yun chi 2...
Post on 19-Dec-2015
215 views
TRANSCRIPT
Efficient Computation of Personal Aggregate Queries on Blogs
Ka Cheung Sia1 Junghoo Cho1
Yun Chi2 Belle L. Tseng3
1University of California, Los Angeles
2NEC Labs America
3Yahoo! Inc.
ACM SIGKDD 2008
2
Motivation
User-generated content in Blogosphere and Web2.0 services contains rich information of recent events
Aggregation of individual user opinions to show current popular trends
3
Motivation
Global aggregation Recent news are picked up automatically
“Dark Knight” in the week of July 18 “Olympics” related in the week of August 8
Potential drawbacks
What if I am not interested in sports at all? Groups of bloggers collaborated to promote
advertisement videos
Personal aggregation Users selectively aggregate from different sources Efficient strategy to handle large number of
users and sources
4
From Global to Personal Aggregation
Dark KnightOlympics
Michael Phelps SIGKDD
Las Vegas
bloggers
items(phrases)
Dark Knight is great, more entertaining
than watching Olympics and shows in Las
Vegas!
Um.. it will be good if there is a free show of Dark Knight in
SIGKDD
Michael Phelps performance in
Olympics is awesome...
Finished watching
Michael Phelps in Olympics, got
to attend SIGKDD now...
5
Matrix formulation
Endorsement matrix (E) E(bj,ok) how much a
blogger endorse an object
Object can be phrases or URLs
321b4
475Total
101b3
030b2
023b1
O3o2o1E
0.50.500u3
0.60.60.20.2u2
000.80.8u1
b4b3b2b1T
Trust matrix (T) T(ui,bj) how much a user
trust a blogger whether a user reads
the blog or how often he reads
6
Personal aggregation
PersonalizedEndorsement score is the summation of endorsement score weighted by a user's trust vector
Endorsement (blog_id, item, score) Trust (user_id, blog_id, score)• Personal Aggregate Query as SQL (Q1):
SELECT t.item, sum(t.score*e.score) AS scoreFROM Endorsement e, Trust tWHERE e.blog_id = t.blog_id ANDt.user_id = <user id>GROUP BY t.itemORDER BY score DESC LIMIT 20
21.01.0u3
2.42.21.8u2
0.04.02.4u1
o3o2o1TE
7
Two baseline approaches
OTF Maintain two tables, compute the weighted sum per each
personal aggregate query on-the-fly High query cost
VIEW Pre-compute the results of every user and store as views High update cost
OTF VIEW
8
Best of both worlds
Identify “template” users- typical users interested in sports / politics / technology / ...
Results of template users are pre-computed
Results of individual users are combined from partially computed results
9
Trust matrix decomposition
Trust matrix reflects user's interest Decompose the T into two sub-matrices W and H
Non-negative Matrix Factorization (NMF)
W : <individual users : template users> relationship
H : <template users : blogs> relationship
User 2’s trust vector is expressed as linear combination of the trust vectors of template user 1 and 2
10
Reconstruction of results
PersoanlizedEndorsement scores of template users are precomputed, results of individual users are computed on request
(HE) is maintained as sorted lists for all template users
W * (HE) is the personal aggregation result Computed using Threshold Algorithm
Top-K list (HE) are sorted lists W * (HE) is weighted linear combination
11
Partition of trust matrix
Decomposition is useful when the matrix is dense Real life data is skewed Hybrid method: uses decomposition only when it is
effective
Users with more subscription
Blogs withmore subscribers
Users with >30 subscriptionsFeeds with >30 subscribers
10k feeds, 24k users~1M subscription pairs
2.7M subscription pairs
1. OTF
2. VIEW
3. NMF
12
Experiments
Bloglines.com : online RSS reader Trust matrix T (1-0 version): subscription profile
91,366 users
487,694 RSS feeds
Endorsement matrix E : blog - keywords occurrence
Feed content collected between Nov 2006 and Jul 2007
Keywords filtered by nouns and high tf-idf values in entries
Platform
Python implementation of proposed scheme
MySQL server on linux with data on RAID disk
13
How different is personalization?
Week 2007 Jan 7 – 2007 Jan 13major event: iphone released
personal aggregation results differ from global aggregation
irangooglequarterphonesaddamcathartikpricesbusinesstroopsvideocompaniessoftwaredeptkibbutzappledevelopmentavenueargentinabushmanagementviewsvegasiraqiraqpresidentsearchchicagomanagerbushreutersiphoneappleiraqiguazubeefiphoneyorkerbrazilcattlesalesUser 91017User 90550User 90439Global
2007-01-07 to 2007-01-13
14
How different is personalization?
Overlap comparison of global aggregation and personal aggregation
LG – global top 20 itemsLi – individual top 20 items of user i
Personal aggregation results also differ among users
Overlap degree withglobal aggregation result
Pair-wise among usersLG∩Li L i∩L j
15
Approximation accuracy
Dense region of subscriptionmatrix
>30 subscribers: 10152 feeds
>30 subscriptions: 24340 users
L2 norm comparison
Sparsity of W (23%), H (13%)
NMF approximation is close to SVD with sparseness adv.
833.0823.2120
837.9829.0110
844.6835.1100
850.1841.690
856.9848.580
NMFSVDRank
16
Approximation accuracy
How many items are approximated by NMF in top 20 list? Ti – top 20 items of user i computed by OTF
Ai – top 20 items of user i computed by NMF 70 % approximation and more accurate for higher rank items
Correlation with rank∣Ai∩T i∣/∣T i∣
17
Efficiency of proposed method
Update cost OTF (222K) < NFM (3.2M) < VIEW (23.6M)
Query response time average over 1000 users with highest number of
subscriptions OTF: execute SQL query Q1 on MySQL server
NMF: python implementation of Threshold Algorithm that interface MySQL server for loading NMF template users' tables
Average query response time reduced by 75%, eliminated outliers of significant delay
0.007s2.84s0.53s0.46sNMF
0.037s84.42s3.60s2.05sOTF
minmaxstdavgMethod
18
Conclusion and future work
Deliver tailored results to users by personal aggregation Proposed a model for personal aggregate queries
Optimization by NMF & Threshold Algorithm
Real life dataset study shows query response time can be reduced by significantly with acceptable approximation accuracy
Handle updates of trust matrix change Parallelism Better phrase extraction (e.g. opinion orientation)
19
Thank you!
Q and A
20
Threshold algorithm
Proposed by Fagin et.al. [2001]Efficient computation of top-K items from multiple lists with a monotone aggregate function
users
blogs
user groups
21
Illustration of matrix partition
FeedswithMore
subscribers
User with more subscriptions
2 subscriptions 8 subscriptions
2 subscribers
9 subscribers