recommender systems, part 1 - introduction to approaches and algorithms

Recommender Systems

Ramzi Alqrainy Search Guru

[email protected]

- 1 -

Recommender Systems

♣ Applica;on areas

- 3 -

In the Social Web

- 4 -

Even more …

♣ Personalized search

♣ "Computa;onal adver;sing"

- 5 -

About the speaker

♣ Ramzi Alqrainy -‐  Ramzi is one of the well recognized experts within Ar8ficial Intelligence and

Informa8on Retrieval fields in Middle East. Ac8ve researcher and technology blogger

with the focus on informa8on retrieval.

- 6 -

Agenda ♣ What are recommender systems for? -‐ Introduc8on

♣ How do they work (Part I) ? -‐ Collabora8ve Filtering

♣ How to measure their success? -‐ Evalua8on techniques

♣ How do they work (Part II) ? -‐ Content-‐based Filtering

-‐ Knowledge-‐Based Recommenda8ons -‐ Hybridiza8on Strategies

♣ Advanced topics -‐ Explana8ons

-‐ Human decision making

- 7 -

Why using Recommender Systems? ♣ Value for the customer -‐ Find things that are interes8ng

-‐ Narrow down the set of choices -‐ Help me explore the space of op8ons -‐ Discover new things -‐ Entertainment -‐

…

♣ Value for the provider -‐ Addi8onal and probably unique personalized service for the customer

-‐ Increase trust and customer loyalty -‐ Increase sales, click trough rates, conversion etc. -‐ Opportuni8es for promo8on, persuasion -‐ Obtain more knowledge about customers -‐

…

- 9 -

Real-‐world check

♣ Myths from industry -‐ Amazon.com generates X percent of their sales through the recommenda8on

lists (30 < X < 70) -‐ NeVlix (DVD rental and movie streaming) generates X percent of their sales through the recommenda8on lists (30 < X < 70)

♣ There must be some value in it -‐ See recommenda8on of groups, jobs or people on LinkedIn

-‐ Friend recommenda8on and ad personaliza8on on Facebook -‐ Song recommenda8on at last.fm -‐ News recommenda8on at Forbes.com (plus 37% CTR)

♣ Academia

-‐ A few studies exist that show the effect ♣ increased sales, changes in sales behavior

- 10 -

Problem domain

♣ Recommenda;on systems (RS) help to match users with items -‐ Ease informa8on overload

-‐ Sales assistance (guidance, advisory, persuasion,…)

RS are so)ware agents that elicit the interests and preferences of individual consumers […] and make recommenda<ons accordingly. They have the poten<al to support and improve the quality of the decisions consumers make while searching for and selec<ng products online. » [Xiao & Benbasat, MISQ, 2007]

♣ Different system designs / paradigms -‐ Based on availability of exploitable data

-‐ Implicit and explicit user feedback -‐ Domain characteris8cs

- 11 -

Recommender systems

♣ RS seen as a func;on [AT05]

♣ Given:

-‐ User model (e.g. ra8ngs, preferences, demographics, situa8onal context) -‐ Items (with or without descrip8on of item characteris8cs)

♣ Find:

-‐ Relevance score. Used for ranking.

♣ Finally:

-‐ Recommend items that are assumed to be relevant

♣ But:

-‐ Remember that relevance might be context-‐dependent -‐ Characteris8cs of the list itself might be important (diversity)

- 12 -

Paradigms of recommender systems

Recommender systems reduce informa;on overload by es;ma;ng relevance

- 13 -


Personalized recommenda;ons

- 14 -


Collabora;ve: "Tell me what's popular among my peers"

- 15 -


Content-‐based: "Show me more of the same what I've liked"

- 16 -


Knowledge-‐based: "Tell me what fits based on my needs"

- 17 -


Hybrid: combina;ons of various inputs and/or composi;on of different mechanism

- 18 -

Recommender systems: basic techniques

Pros

Cons Collabora8ve

No knowledge-‐

Requires some form of ra8ng engineering effort,

feedback, cold start for new users serendipity of results,

and new items learns market segments

Content-‐based

No community required,

Content descrip8ons necessary, comparison between

cold start for new users, no items possible

surprises

Knowledge-‐based

Determinis8c recommenda8ons, assured quality, no cold-‐ start, can resemble sales dialogue

Knowledge engineering effort to bootstrap, basically sta8c, does not react to short-‐term trends

- 19 -

- 20 -

Collabora;ve Filtering (CF)

♣ The most prominent approach to generate recommenda;ons -‐ used by large, commercial e-‐commerce sites

-‐ well-‐understood, various algorithms and varia8ons exist -‐ applicable in many domains (book, movies, DVDs, ..)

♣ Approach

-‐ use the "wisdom of the crowd" to recommend items

♣ Basic assump;on and idea

-‐ Users give ra8ngs to catalog items (implicitly or explicitly) -‐ Customers who had similar tastes in the past, will have similar tastes in the future

- 21 -

User-‐based nearest-‐neighbor collabora;ve filtering (1)

♣ The basic technique: -‐ Given an "ac8ve user" (Alice) and an item I not yet seen by Alice

-‐ The goal is to es<mate Alice's ra<ng for this item, e.g., by ♣ find a set of users (peers) who liked the same items as Alice in the past and

who have rated item I ♣ use, e.g. the average of their ra8ngs to predict, if Alice will like item I ♣ do this for all items Alice has not seen and recommend the best-‐rated

Item1

Item2

Item3

Item4

Item5 Alice

5

3

4

4

? User1

3

1

2

3

3

User2

4

3

4

3

5

User3

3

3

1

5

4

User4

1

5

5

2

1 - 24 -

User-‐based nearest-‐neighbor collabora;ve filtering (2)

♣ Some first ques;ons -‐ How do we measure similarity?

-‐ How many neighbors should we consider? -‐ How do we generate a predic8on from the neighbors' ra8ngs?

Item1

Item2

Item3

Item4

Item5

Alice

5

3

4

4

? User1

3

1

2

3

3

User2

4

3

4

3

5

User3

3

3

1

5

4

User4

1

5

5

2

1

- 25 -

Measuring user similarity

♣ A popular similarity measure in user-‐based CF: Pearson correla;on

a, b : users ra,p

: ra8ng of user a for item p P

: set of items, rated both by a and b Possible similarity values between -‐1 and 1;

ͿԨԩԪԫԬԭԮԯՠֈ֍֎ࢽࢼࢻࢺࢹࢸࢷࢶࢴࢳࢲࢱࢰࢯࢮࢭࢡࡪࡩࡨࡧࡦࡥࡤࡣࡢࡡࡠ߿߾ׯॸঀৼ৽ͿԨԩԪԫԬԭԮԯՠֈ֍֎ࢽࢼࢻࢺࢹࢸࢷࢶࢴࢳࢲࢱࢰࢯࢮࢭࢡࡪࡩࡨࡧࡦࡥࡤࡣࡢࡡࡠ߿߾ׯॸঀৼ৽, ͿԨԩԪԫԬԭԮԯՠֈ֍֎ࢽࢼࢻࢺࢹࢸࢷࢶࢴࢳࢲࢱࢰࢯࢮࢭࢡࡪࡩࡨࡧࡦࡥࡤࡣࡢࡡࡠ߿߾ׯॸঀৼ৽ͿԨԩԪԫԬԭԮԯՠֈ֍֎ࢽࢼࢻࢺࢹࢸࢷࢶࢴࢳࢲࢱࢰࢯࢮࢭࢡࡪࡩࡨࡧࡦࡥࡤࡣࡢࡡࡠ߿߾ׯॸঀৼ৽

= user's average ra8ngs

Item1

Alice 5 User1 3

Item2 Item3 3 4 1 2

Item4 Item5 4 ? 3 3

sim = 0,85 sim = 0,70 sim = -‐0,79 User2

4

3

4

3

5

User3

3

3

1

5

4

User4

1

5

5

2

1

- 26 -

Pearson correla;on

♣ Takes differences in ra;ng behavior into account

6

5

4

Ratings 3

2

1

0 Item1 Item2

Alice

User1

User4

Item3 Item4

♣ Works well in usual domains, compared with alterna;ve measures -‐ such as cosine similarity

- 27 -

Making predic;ons

♣ A common predic8on func8on:

♣ Calculate, whether the neighbors' ra8ngs for the unseen item i are higher or lower than their average

♣ Combine the ra8ng differences -‐ use the similarity as a weight ♣ Add/subtract the neighbors' bias from the ac8ve user's average and use

this as a predic8on

- 28 -

Making recommenda;ons

♣ Making predic;ons is typically not the ul;mate goal

♣ Usual approach (in academia)

-‐ Rank items based on their predicted ra8ngs

♣ However

-‐ This might lead to the inclusion of (only) niche items -‐ In prac;ce also: Take item popularity into account

♣ Approaches

-‐ "Learning to rank" ♣ Op8mize according to a given rank evalua8on metric (see later)

- 29 -

Improving the metrics / predic;on func;on

♣ Not all neighbor ra;ngs might be equally "valuable" -‐ Agreement on commonly liked items is not so informa8ve as agreement on

controversial items -‐ Possible solu;on: Give more weight to items that have a higher variance

♣ Value of number of co-‐rated items

-‐ Use "significance weigh8ng", by e.g., linearly reducing the weight when the number of co-‐rated items is low

♣ Case amplifica;on -‐ Intui8on: Give more weight to "very similar" neighbors, i.e., where the

similarity value is close to 1.

♣ Neighborhood selec;on -‐ Use similarity threshold or fixed number of neighbors

- 30 -

Memory-‐based and model-‐based approaches

♣ User-‐based CF is said to be "memory-‐based" -‐ the ra8ng matrix is directly used to find neighbors / make predic8ons

-‐ does not scale for most real-‐world scenarios -‐ large e-‐commerce sites have tens of millions of customers and millions of items

♣ Model-‐based approaches -‐ based on an offline pre-‐processing or "model-‐learning" phase

-‐ at run-‐8me, only the learned model is used to make predic8ons -‐ models are updated / re-‐trained periodically -‐ large variety of techniques used -‐ model-‐building and upda8ng can be computa8onally expensive

- 31 -

Item-‐based collabora;ve filtering

♣ Basic idea: -‐ Use the similarity between items (and not users) to make predic8ons

♣ Example:

-‐ Look for items that are similar to Item5 -‐ Take Alice's ra8ngs for these items to predict the ra8ng for Item5

Item1

Item2

Item3

Item4

Item5

Alice

5

3

4

4

? User1

3

1

2

3

3

User2

4

3

4

3

5

User3

3

3

1

5

4

User4

1

5

5

2

1

- 33 -

The cosine similarity measure

♣ Produces beber results in item-‐to-‐item filtering -‐ for some datasets, no consistent picture in literature

♣ Ra;ngs are seen as vector in n-‐dimensional space ♣ Similarity is calculated based on the angle between the vectors

♣ Adjusted cosine similarity -‐ take average user ra8ngs into account, transform the original ra8ngs

- 34 -

Pre-‐processing for item-‐based filtering

♣ Item-‐based filtering does not solve the scalability problem itself

♣ Pre-‐processing approach by Amazon.com (in 2003)

-‐ Calculate all pair-‐wise item similari8es in advance -‐ The neighborhood to be used at run-‐8me is typically rather small, because only items are taken into account which the user has rated

-‐ Item similari8es are supposed to be more stable than user similari8es

♣ Memory requirements

-‐ Up to N2 pair-‐wise similari8es to be memorized (N = number of items) in theory

-‐ In prac8ce, this is significantly lower (items with no co-‐ra8ngs) -‐ Further reduc8ons possible ♣ Minimum threshold for co-‐ra8ngs (items, which are rated at least by n users)

♣ Limit the size of the neighborhood (might affect recommenda8on accuracy)

- 35 -

More on ra;ngs

♣ Pure CF-‐based systems only rely on the ra;ng matrix

♣ Explicit ra;ngs

-‐ Most commonly used (1 to 5, 1 to 7 Likert response scales) -‐ Research topics ♣ "Op8mal" granularity of scale; indica8on that 10-‐point scale is berer accepted in

movie domain ♣ Mul8dimensional ra8ngs (mul8ple ra8ngs per movie) -‐ Challenge

♣ Users not always willing to rate many items; sparse ra8ng matrices ♣ How to s8mulate users to rate more items? ♣ Implicit ra;ngs

-‐ clicks, page views, 8me spent on some page, demo downloads … -‐ Can be used in addi8on to explicit ones; ques8on of correctness of interpreta8on

- 36 -

Data sparsity problems

♣ Cold start problem -‐ How to recommend new items? What to recommend to new users?

♣ Straigheorward approaches

-‐ Ask/force users to rate a set of items -‐ Use another method (e.g., content-‐based, demographic or simply non-‐ personalized) in the ini8al phase

♣ Alterna;ves -‐ Use berer algorithms (beyond nearest-‐neighbor approaches)

-‐ Example: ♣ In nearest-‐neighbor approaches, the set of sufficiently similar neighbors might

be to small to make good predic8ons ♣ Assume "transi8vity" of neighborhoods

- 37 -

Example algorithms for sparse datasets

♣ Recursive CF -‐ Assume there is a very close neighbor n of u who however has not rated the

target item i yet. -‐ Idea: ♣ Apply CF-‐method recursively and predict a ra8ng for item i for the neighbor

♣ Use this predicted ra8ng instead of the ra8ng of a more distant direct neighbor

Item1

Item2

Item3

Item4

Item5

Alice

5

3

4

4

? sim = 0,85

User1

3

1

2

3

?

User2 4 3

4 3 5 Predict

User3 3 3 User4 1 5

1 5 4 ra8ng for User1

5 2 1 - 38 -

Graph-‐based methods

♣ "Spreading ac;va;on" (sketch) -‐ Idea: Use paths of lengths > 3

to recommend items -‐ Length 3: Recommend Item3 to User1 -‐ Length 5: Item1 also recommendable

- 39 -

More model-‐based approaches

♣ Plethora of different techniques proposed in the last years, e.g., -‐ Matrix factoriza8on techniques, sta8s8cs

♣ singular value decomposi8on, principal component analysis -‐ Associa8on rule mining

♣ compare: shopping basket analysis -‐ Probabilis8c models

♣ clustering models, Bayesian networks, probabilis8c Latent Seman8c Analysis -‐ Various other machine learning approaches

♣ Costs of pre-‐processing -‐ Usually not discussed

-‐ Incremental updates possible?

- 40 -

Matrix factoriza;on

• SVD:

Uk Dim1

Alice 0.47

M

k

Dim2 -‐0.30

T =U×Σ×V

k k k

T

Vk Dim1

-‐0.44 -‐0.57 0.06 0.38 0.57 Bob -‐0.44 0.23

Dim2 0.58 -‐0.66

0.26 0.18 -‐0.36 Mary 0.70 -‐0.06

Sue 0.31 0.93

Σ Dim1 Dim2

k

• Predic;on:

r

ui =

r +U (Alice)×Σ u k

T ×V (EPL)

k k

Dim1 5.63 0 Dim2 0 3.23

= 3 + 0.84 = 3.84

- 43 -

Associa;on rule mining

♣ Commonly used for shopping behavior analysis -‐ aims at detec8on of rules such as

"If a customer purchases baby-‐food then he also buys diapers in 70% of the cases"

♣ Associa;on rule mining algorithms -‐ can detect rules of the form X => Y (e.g., baby-‐food => diapers) from a set of

sales transac8ons D = {t1, t2, … tn} -‐ measure of quality: support, confidence

- 44 -

Probabilis;c methods

♣ Basic idea (simplis;c version for illustra;on): -‐ given the user/item ra8ng matrix

-‐ determine the probability that user Alice will like an item i -‐ base the recommenda8on on such these probabili8es

♣ Calcula;on of ra;ng probabili;es based on Bayes Theorem -‐ How probable is ra8ng value "1" for Item5 given Alice's previous ra8ngs?

-‐ Corresponds to condi8onal probability P(Item5=1 | X), where ♣ X = Alice's previous ra8ngs = (Item1 =1, Item2=3, Item3= … )

-‐ Can be es8mated based on Bayes' Theorem

♣ Usually more sophis;cated methods used

-‐ Clustering -‐ pLSA …

- 45 -

Summarizing recent methods

♣ Recommenda;on is concerned with learning from noisy observa;ons (x, y), where

f(x)=ŷ

2

has to be determined such that is minimal.

∑ ˆ

( ˆ -y)

♣ A variety of different learning strategies have been applied trying to es;mate f(x)

-‐ Non parametric neighborhood models -‐ MF models, SVMs, Neural Networks, Bayesian Networks,…

- 48 -

Collabora;ve Filtering Issues

♣ Pros: -‐ well-‐understood, works well in some domains, no knowledge engineering required

♣ Cons: -‐ requires user community, sparsity problems, no integra8on of other knowledge sources,

no explana8on of results

♣ What is the best CF method? -‐ In which situa8on and which domain? Inconsistent findings; always the same domains

and data sets; differences between methods are o|en very small (1/100)

♣ How to evaluate the predic;on quality? -‐ MAE / RMSE: What does an MAE of 0.7 actually mean?

-‐ Serendipity: Not yet fully understood

♣ What about mul;-‐dimensional ra;ngs?

- 49 -

- 50 -

Recommender Systems in e-‐Commerce

♣ One Recommender Systems research ques;on -‐ What should be in that list?

- 51 -


♣ Another ques;on both in research and prac;ce -‐ How do we know that these are good

recommenda8ons?

- 52 -


♣ This might lead to … -‐ What is a good recommenda8on?

-‐ What is a good recommenda8on strategy? -‐ What is a good recommenda8on strategy for my

business?

These have been in stock for quite a while now …

- 53 -

What is a good recommenda;on? What are the measures in prac;ce? ♣ Total sales numbers ♣ Promo;on of certain items ♣

…

♣ Click-‐through-‐rates ♣ Interac;vity on plaeorm ♣

…

♣ Customer return rates ♣ Customer sa;sfac;on and loyalty

- 54 -

You have Ques8ons and we have Answers

recommender systems, part 1 - introduction to approaches and algorithms

Technology