hybrid filtering. computational journalism week 6
DESCRIPTION
Jonathan Stray, Columbia University, Fall 2015Syllabus at http://www.compjournalism.com/?p=133TRANSCRIPT
![Page 1: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/1.jpg)
Frontiers of Computational Journalism
Columbia Journalism School
Week 6: Hybrid Filtering
October 16, 2015
![Page 2: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/2.jpg)
Filtering Comments
Thousands of comments, what are the “good” ones?
![Page 3: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/3.jpg)
Comment voting
Problem: putting comments with most votes at top doesn’t work. Why?
![Page 4: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/4.jpg)
Reddit Comment Ranking (old)
Up – down votes plus time decay
![Page 5: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/5.jpg)
Reddit Comment Ranking (new)
Hypothetically, suppose all users voted on the comment, and v out of N up-voted. Then we could sort by proportion p = v/N of upvotes.
N=16 v = 11 p = 11/16 = 0.6875
![Page 6: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/6.jpg)
Reddit Comment Ranking
Actually, only n users out of N vote, giving an observed approximate proportion p’ = v’/n
n=3 v’ = 1 p’ = 1/3 = 0.333
![Page 7: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/7.jpg)
Reddit Comment Ranking
Limited sampling can rank votes wrong when we don’t have enough data.
p’ = 0.333 p = 0.6875
p’ = 0.75 p = 0.1875
![Page 8: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/8.jpg)
Random error in sampling If we observe p’ upvotes from n random users, what is the distribution of the true proportion p?
Distribution of p’ when p=0.5
![Page 9: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/9.jpg)
Confidence interval Given observed p’, interval that true p has a probability α of lying inside.
![Page 10: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/10.jpg)
Rank comments by lower bound of confidence interval
p’ = observed proportion of upvotes n = how many people voted zα= how certain do we want to be before we assume that p’ is “close” to true p
Analytic solution for confidence interval, known as “Wilson score”
![Page 11: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/11.jpg)
User-‐‑item matrix
Stores “rating” of each user for each item. Could also be binary variable that says whether user clicked, liked, starred, shared, purchased...
![Page 12: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/12.jpg)
User-‐‑item matrix • No content analysis. We know nothing about what is “in” each
item. • Typically very sparse – a user hasn’t watched even 1% of all
movies. • Filtering problem is guessing “unknown” entry in matrix. High
guessed values are things user would want to see.
![Page 13: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/13.jpg)
Filtering process
![Page 14: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/14.jpg)
How to guess unknown rating?
Basic idea: suggest “similar” items. Similar items are rated in a similar way by many different users. Remember, “rating” could be a click, a like, a purchase.
o “Users who bought A also bought B...” o “Users who clicked A also clicked B...” o “Users who shared A also shared B...”
![Page 15: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/15.jpg)
Similar items
![Page 16: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/16.jpg)
Item similarity Cosine similarity!
![Page 17: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/17.jpg)
Other distance measures “adjusted cosine similarity”
Subtracts average rating for each user, to compensate for general enthusiasm (“most movies suck” vs. “most movies are great”)
![Page 18: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/18.jpg)
Generating a recommendation
Weighted average of item ratings by their similarity.
![Page 19: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/19.jpg)
Matrix factorization recommender
![Page 20: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/20.jpg)
Matrix factorization recommender
![Page 21: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/21.jpg)
Matrix factorization plate model
r
v
u
user rating of item
variation in user topics
λu
λv
variation in item topics
topics for user
topics for item
i users
j items
![Page 22: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/22.jpg)
Combining collaborative filtering and topic modeling
![Page 23: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/23.jpg)
K topics
topic for word word in doc topics in doc topic
concentration parameter
word concentration parameter
Content modeling -‐‑ LDA
D docs
words in topics
N words in doc
![Page 24: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/24.jpg)
K topics topic for word word in doc topics in doc (content)
topic concentration
weight of user selections
variation in per-‐‑user topics topics for user
user rating of doc topics in doc
(collaborative)
Collaborative Topic Modeling
![Page 25: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/25.jpg)
content only
content + social
![Page 26: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/26.jpg)
Different Filtering Systems Content: Newsblaster analyzes the topics in the documents. No concept of users. Social: What I see on Twitter determined by who I follow. Reddit comments filtered by votes as input. Amazon "people who bought X also bought Y" No content analysis. Hybrid: Recommend based both on content and user behaviur.
![Page 27: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/27.jpg)
Item Content My Data Other Users’ Data
Text analysis, topic modeling, clustering...
who I follow
what I’ve read/liked
social network structure,
other users’ likes
![Page 28: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/28.jpg)
How to evaluate/optimize?
![Page 29: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/29.jpg)
How to evaluate/optimize? • Netflix: try to predict the rating that the user gives a
movie after watching it.
• Amazon: sell more stuff.
• Google web search: human raters A/B test every change
![Page 30: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/30.jpg)
• Does the user understand how the filter works? • Can they configure it as desired? • Can they correctly predict what they will and won't
see?
How to evaluate/optimize?
![Page 31: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/31.jpg)
• Can it be gamed? Spam, "user-generated censorship," etc.
How to evaluate/optimize?
![Page 32: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/32.jpg)
"ʺDuring the 2012 election, The ~2000 members of an anti-‐‑Ron Paul subreddit discovered that anything they posted, anywhere on reddit, was being rapidly, repeatedly downvoted. They created a diagnostic subreddit and began posting otherwise meaningless text to verify this otherwise odd behavior."ʺ
![Page 33: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/33.jpg)
Filter design problem Formally, given
U = user preferences, history, characteristics S = current story {P} = results of function on previous stories {B} = background world knowledge (other users?)
Define
r(S,U,{P},{B}) in [0...1] relevance of story S to user U
![Page 34: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/34.jpg)
Filter design problem, restated When should a user see a story? Aspects to this question:
normative personal: what I want societal: emergent group effects UI how do I tell the computer I want? technical constrained by algorithmic possibility economic cheap enough to deploy widely
![Page 35: Hybrid Filtering. Computational Journalism week 6](https://reader031.vdocuments.mx/reader031/viewer/2022032306/563db85c550346aa9a92f9f7/html5/thumbnails/35.jpg)
How to evaluate/optimize?
Does it improve the user's life?