Frontiers of Computational Journalism
Columbia Journalism School
Week 6: Hybrid Filtering
October 16, 2015
Filtering Comments
Thousands of comments, what are the “good” ones?
Comment voting
Problem: putting comments with most votes at top doesn’t work. Why?
Reddit Comment Ranking (old)
Up – down votes plus time decay
Reddit Comment Ranking (new)
Hypothetically, suppose all users voted on the comment, and v out of N up-voted. Then we could sort by proportion p = v/N of upvotes.
N=16 v = 11 p = 11/16 = 0.6875
Reddit Comment Ranking
Actually, only n users out of N vote, giving an observed approximate proportion p’ = v’/n
n=3 v’ = 1 p’ = 1/3 = 0.333
Reddit Comment Ranking
Limited sampling can rank votes wrong when we don’t have enough data.
p’ = 0.333 p = 0.6875
p’ = 0.75 p = 0.1875
Random error in sampling If we observe p’ upvotes from n random users, what is the distribution of the true proportion p?
Distribution of p’ when p=0.5
Confidence interval Given observed p’, interval that true p has a probability α of lying inside.
Rank comments by lower bound of confidence interval
p’ = observed proportion of upvotes n = how many people voted zα= how certain do we want to be before we assume that p’ is “close” to true p
Analytic solution for confidence interval, known as “Wilson score”
User-‐‑item matrix
Stores “rating” of each user for each item. Could also be binary variable that says whether user clicked, liked, starred, shared, purchased...
User-‐‑item matrix • No content analysis. We know nothing about what is “in” each
item. • Typically very sparse – a user hasn’t watched even 1% of all
movies. • Filtering problem is guessing “unknown” entry in matrix. High
guessed values are things user would want to see.
Filtering process
How to guess unknown rating?
Basic idea: suggest “similar” items. Similar items are rated in a similar way by many different users. Remember, “rating” could be a click, a like, a purchase.
o “Users who bought A also bought B...” o “Users who clicked A also clicked B...” o “Users who shared A also shared B...”
Similar items
Item similarity Cosine similarity!
Other distance measures “adjusted cosine similarity”
Subtracts average rating for each user, to compensate for general enthusiasm (“most movies suck” vs. “most movies are great”)
Generating a recommendation
Weighted average of item ratings by their similarity.
Matrix factorization recommender
Matrix factorization recommender
Matrix factorization plate model
r
v
u
user rating of item
variation in user topics
λu
λv
variation in item topics
topics for user
topics for item
i users
j items
Combining collaborative filtering and topic modeling
K topics
topic for word word in doc topics in doc topic
concentration parameter
word concentration parameter
Content modeling -‐‑ LDA
D docs
words in topics
N words in doc
K topics topic for word word in doc topics in doc (content)
topic concentration
weight of user selections
variation in per-‐‑user topics topics for user
user rating of doc topics in doc
(collaborative)
Collaborative Topic Modeling
content only
content + social
Different Filtering Systems Content: Newsblaster analyzes the topics in the documents. No concept of users. Social: What I see on Twitter determined by who I follow. Reddit comments filtered by votes as input. Amazon "people who bought X also bought Y" No content analysis. Hybrid: Recommend based both on content and user behaviur.
Item Content My Data Other Users’ Data
Text analysis, topic modeling, clustering...
who I follow
what I’ve read/liked
social network structure,
other users’ likes
How to evaluate/optimize?
How to evaluate/optimize? • Netflix: try to predict the rating that the user gives a
movie after watching it.
• Amazon: sell more stuff.
• Google web search: human raters A/B test every change
• Does the user understand how the filter works? • Can they configure it as desired? • Can they correctly predict what they will and won't
see?
How to evaluate/optimize?
• Can it be gamed? Spam, "user-generated censorship," etc.
How to evaluate/optimize?
"ʺDuring the 2012 election, The ~2000 members of an anti-‐‑Ron Paul subreddit discovered that anything they posted, anywhere on reddit, was being rapidly, repeatedly downvoted. They created a diagnostic subreddit and began posting otherwise meaningless text to verify this otherwise odd behavior."ʺ
Filter design problem Formally, given
U = user preferences, history, characteristics S = current story {P} = results of function on previous stories {B} = background world knowledge (other users?)
Define
r(S,U,{P},{B}) in [0...1] relevance of story S to user U
Filter design problem, restated When should a user see a story? Aspects to this question:
normative personal: what I want societal: emergent group effects UI how do I tell the computer I want? technical constrained by algorithmic possibility economic cheap enough to deploy widely
How to evaluate/optimize?
Does it improve the user's life?