survey recomender system algorithm

Upload: ronny-sugianto

Post on 08-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/7/2019 Survey Recomender System Algorithm

    1/33

    Survey of Recommendation

    Systems and Algorithms

    Term Paper for

    EE 380L: DATA MINING

    Spring 2000

    By

    Yuan Qu

    Xiaoyun Yang

    Tianping Huang

  • 8/7/2019 Survey Recomender System Algorithm

    2/33

    May 5, 2000

    Table of Contents

    I. Introduction ........................................................ 3

    II. Recommendation Systems................................... 4

    I. Algorithms.......................................................... 14

    II. Discussion.......................................................... 29

    III. Reference........................................................... 31

    2

    2

  • 8/7/2019 Survey Recomender System Algorithm

    3/33

    1. Introduction

    In our daily life, we make our choices at most cases relying on recommendations

    from other people either by word of mouth, recommendation letters, movie and book

    reviews printed in newspapers, or general surveys. In this information age, each day tons

    of news published through the Internet. This leads to a clear demand for automated

    methods that locate and retrieve information with respect to users individual interests.

    More and more people accessing the Internet also provide new possibilities to organize

    and recommend information.

    Recommendation systems can assist and augment this natural social process.

    These systems can recommend what you want according what you want in previous time.

    The main purpose of the recommendation systems is to provide tools for people to

    leverage the information hunting and gathering activities of other people or groups of

    people. Recommendation systems have been an important application area and the focus

    of considerable recent academic and commercial interest.

    Recommendation systems basically are divided into two categories. One is called

    content-base filtering; the other is collaborative filtering (or social filtering). In content-

    based filtering system, each user is assumed to operate independently. As a result,

    document representations in content-based filtering systems can exploit only information

    that can be derived from document contents. In collaborative filtering system, the

    representation of a document is based on an evaluation to that document made by prior

    readers of the document. They consider that communities of shared interest could be

    automatically identified by exchanging this sort of information. In practical, collaborative

    3

    3

  • 8/7/2019 Survey Recomender System Algorithm

    4/33

    filtering system provides a basis for selection of information items, regardless of whether

    their content can be represented in a way that is useful for selection. In this paper, the

    focus will be on the collaborative filtering.

    Collaborative Filtering was presented by the developers of the first

    recommendation system, Tapestry, in 1992 [Goldberg, et al. 1992]. Several years later

    the concept of collaborative filtering had already applied in dozens of publicly available

    systems, several proprietary systems, and even some commercially available systems. In

    1996, dozens of the researchers in the academic and business areas gathered at the UC-

    Berkeley to share their ideas and experiences about these emerging filtering methods

    [Collaborative Filtering workshop, 1996]. They presented the vision and definition of

    collaborative filtering, and provided some applications of this technique. Right now

    more and more published articles demonstrated their applications of the collaborative

    filtering methods.

    In this paper, a survey was made for all the recommendation systems available in

    the Internet. Then, the characteristics of each recommendation system are displayed. And

    last, some algorithms of famous recommendation systems are introduced in detail.

    II. Recommendation Systems

    There are a lot of recommendation systems on web sites. According to the

    purposes of their application, the recommendation systems can be classified into three

    categories [Resnick, 1997], shown in Figure 1.

    4

    4

  • 8/7/2019 Survey Recomender System Algorithm

    5/33

    Figure 1. The recommendation systems categories

    The systems in first category are used for recommending movies, music, videos or

    other services. In this category, the database is relative stable, such as the population

    database, it may not be changed in years. The typical systems include EachMovie,

    Firefly, and Morse. The second category is used for news or articles in a newsgroup.

    The users in the newsgroup generally have the similar goals or interests. The database is

    also relative stable. It may be updated in weeks or short time. The representatives of

    these systems are Tapestry, GroupLens, and Lotus Notes. The last one is for web pages

    recommendation. The information in this category is dynamic, that means, the new page

    can be added or deleted in the system at any time. At the same time, the users may have

    different tastes. Phoaks, GAB, and Fab are most useful systems of this kind.

    The brief introduction of each recommendation system is given as follows:

    5

    5

    E a c h M

    M o r s

    F i r e f

    . . .

    m o v i e s

    T a p e

    G r o u p

    L o t u s

    . . .

    n e w s o r

    P h o a

    G A B

    F a b

    . . .

    w e b p a

    r e c o m m e n d

  • 8/7/2019 Survey Recomender System Algorithm

    6/33

    Do-I-Care

    When a user revisits a favorite Web page, Do-I-Care [Turnbull, 1998;

    Collaborative Filtering workshop, 1996] system provides a function that alerts the user

    when this Web page is changed. The system uses the model-based algorithm. It uses

    Bayesian classifier technology. After some users training the model many times, the other

    users can get good prediction.

    According to the report from Mark Ackerman (U. of California-Irvine)

    [Collaborative Filtering workshop, 1996], the accuracy of Do-I-Care can reach 70-90%.

    It is said the accuracy of the system reaches 100% in tracking airline fare sales

    application.

    Fab

    In a collaborative filtering system, if a new item or new user enters the system,

    the system has no clue to calculate the similarity between users and the system has no

    way to consider the new item unless some users have rated it, or recommended it. This

    problem is called cold-start problem. But for content-based filtering, there does not exist

    such problem. To eliminate this problem, Fab recommendation system [Turnbull, 1998]

    combines both collaborative and content-based filtering systems.

    The Fab system is a web based recommendation service that incorporates both

    collaborative and content-based filtering methods. Users profiles are constructed as a

    collection of keywords contained in those documents that each user rate highly.

    6

    6

  • 8/7/2019 Survey Recomender System Algorithm

    7/33

    Documents are presented for rating when either the content of the document matches

    previous documents that were rated highly, or neighboring users rate a document highly.

    Every time a favorable or unfavorable rating is received, the profile of the user is updated

    to reflect the new rating.

    Collection agents are sent out over the web to look for documents with specific

    content, each agent using a different set of keywords. After retrieving the documents,

    they are passed to a central server where a selection agent matched to each user's profile,

    scours through the documents looking for interesting material. Relevant documents are

    then presented to the user for rating. This rating dynamically affects the selection agents

    behavior and changes the user's profile. The rating also affects the collection agent that

    retrieved the document. Unpopular collection agents are removed and replaced with more

    successful ones over time.

    The Fab system combines the best features of both content-based and

    collaborative filtering methods and also manages to keep the system dynamically updated

    to the current users' tastes. One potential shortcoming is Fab's reliance on explicit user

    feedback.

    Firefly

    The system [Turnbull, 1997 and 1998] is based on similarities of users to provide

    recommendation. At the beginning, this system was used for music and movies

    recommendation. Right now it extends to other media recommendation, such as

    newsgroup, books, and web pages.

    7

    7

  • 8/7/2019 Survey Recomender System Algorithm

    8/33

    The system used users profiles as input, and used constrained Pearson algorithm

    to make the best predictions between users. The basic idea of the algorithm is: a) the

    system maintains a user profile, which includes like or dislike of specific items; b) the

    system compares the similarities of users and decides which kind of users that the user

    belongs, and c) according to the similar users profile and gives a good recommendation.

    GAB

    GAB [Wittenburg, et. al., 1998] stands for group asynchronous browsing. The

    idea of GAB system is that the system collects and merges bookmarks and hotlists files

    of users and then serves these files to users. That means, the system has the ability to

    reach users bookmarks and extract information. This raises privacy concerns. To

    overcome the privacy problem, the system has provide a mechanism to let user save

    his/her bookmark in private or public.

    The system uses multi-tree data structure for the bookmarks. To avoid getting

    lost in hyperspace and to increase the connectivity in merged subject tree database, the

    system has defined sibling and cousin relations. Sibling relation of item A and B means

    that A and B belong to the same specific subject, while cousin relation of A and B means

    that A and B belong to the broad subject but not the same specific subject. The system

    also has applied for monitoring the change of content of web page.

    Grassroots

    8

    8

  • 8/7/2019 Survey Recomender System Algorithm

    9/33

    Grassroots system [Turnbull, 1998] is described as "A System Providing A

    Uniform Framework for Communicating, Structuring, Sharing Information, and

    Organizing People.

    This system provides a special interface of Web pages to access all of the

    information it works with. In practice, Grassroots also lets participants continue using

    other mechanisms, and takes as much advantage of them as possible. The main engine in

    the Grassroots system is a Web server and Proxy server setup that can be used with any

    Web browser.

    GroupLens

    Resnick [Resnick, et al. 1994] presented the GroupLens system, which is built

    based on a simple premise "the heuristic that people who agreed in the past will probably

    agree again". This system uses the same Pearson algorithm to provide algorithm. At

    early stage, the system uses explicit vote ( 1 to 5 scale, 1 stands for dislike it, 5 for like

    it). The updated one also includes using implicit method to get the feedback from the

    user, such as monitoring reading time. The most characteristics of the system are its

    openness and scalability.

    Openness means that this system provides other researchers an access to create

    clients that work with the system servers or to even change those servers if there are

    better improvements. When users number increases, the system still can provide

    accurate prediction but the database for the system or the calculation time will be very

    huge.

    9

    9

  • 8/7/2019 Survey Recomender System Algorithm

    10/33

    Letizia & Lets Browse

    Let's browse and its predecessor, Letizia, [Lieberman, 1996; Pryor, 1998] are web

    agents that assist a user during his/her browsing experience. By monitoring a users

    behavior, or browsing time on a web page, Letizia system learns the users interests and

    provides recommendation. Lets Browse, improved from Letizia, provides

    recommendation by using groups profiles instead of by using a single profile. If

    multiple users are reading the same page at the same time, Lets Browse can determine

    which users are in the area of monitor, and use their profiles to provide recommendation

    sites for entire group.

    Lotus Notes

    Lotus Notes [Turnbull, 1998] is a system that is used as a foundation for

    Collaborative Filtering techniques. The system serves for the newsgroup. All Notes

    Users should have similar goals or information interests because they are working in the

    same group

    Lotus provides a feature to let people annotate documents. After annotation, the

    user can send or distribute these links or comments to others. To protect users privacy,

    the system uses an agent to represent an individual. These agents extract significant

    phrases from the document that the user reads, and then exchange the learning results

    anonymously.

    10

    10

  • 8/7/2019 Survey Recomender System Algorithm

    11/33

    Mosaic

    Mosaic system [Turnbull,1997] was the first Web tool that facilitated

    collaborative. Like recommendation system Pointers, the Mosaic users in the system can

    publish and distribute the bookmarks and add the comments to the web page. This

    simple feature enabled users to actively share information with others.

    PHOAKS

    Terveen [Terveen et. al, 1997] first introduced PHOAKS (People Helping One

    Another Know Stuff) system that recommends the URLs that will be very interesting to

    users. The system will automatically recognize web resource references in a new group

    message and then attempt to classify it, and introduce it to other users. That means the

    system scans and checks the groups messages and then gets the most important URLs in

    theses messages. After sorting these links, the system recommends this URLs to users.

    The system uses implicit feedback and also considers the role specialization.

    Pointers

    This system [Maltz, 1995] is implemented inside Lotus Notes environment. As

    we know if one person is an expert in these areas, then other users in this group would

    like to see his/her recommendation. So the system provide a mechanism to let the

    information mediators in a workgroup easily distribute references and commentary of

    11

    11

  • 8/7/2019 Survey Recomender System Algorithm

    12/33

    documents they find. This mechanism is realized by using pointer. This pointer is

    consists of URL link, contextual information, and optimal comments by the sender. The

    system is very easy to use but not anonymous.

    Siteseer

    Siteseer [Turnbull, 1997] is a collaborative system using web browser bookmarks

    to find neighbors and recommend sites. Users with significant overlap in bookmark

    listings are determined to be close to one another, allowing previously unvisited sites to

    be recommended to one another.

    Tapestry

    This is the first collaborative recommendation system [Goldberg, 1992]. It uses

    free annotations or explicit like it or hate it annotations. This system is used for

    newsgroup. So it is not easy for the group exploring new area.

    Yahoo!

    Turnbull [Turnbull, 1998] considered Yahoo! as a recommendation system that

    uses manual way to realize collaborative filtering. They have one expert to update

    Yahoo! Index as quickly as possible. That means that every site is examined by a people

    when it is added. Also the system allows web users to submit pages. Because of its

    12

    12

  • 8/7/2019 Survey Recomender System Algorithm

    13/33

    openness, the form of Yahoo! index has become very popular and has become a

    classification standard.

    WebWatcher

    The WebWatcher system [Joachims, 1996] likes a tour guide in a museum. It

    provides interactive communication between server and users and provides

    recommendation. The user who enter the system can ask question by typing what is

    his/her interest, and then the system will recommend the related web sites. This is not the

    same thing as keyword-based search engine. It does use the user profile and other users

    previous tour, and calculate the similarities of users and predict the users interest. The

    system also uses the users experience to reinforce learning.

    III. Algorithms on Collaborative Filtering

    Today recommendation systems have been used in many fields, virtually all

    topics that could be of potential interest to users are covered by special purpose

    recommendation systems: Web pages, news stories, emails, movies, music videos, books,

    CDs, restaurants, and many more. These recommendation systems predict the users

    interest and preference based on all users profiles, using information retrieval

    techniques. The underlying techniques used in todays recommendation systems fall into

    two distinct categories: content-based filtering and collaborative filtering methods. The

    content-based filtering uses actual content features of items, while the collaborative

    13

    13

  • 8/7/2019 Survey Recomender System Algorithm

    14/33

    filtering predict new users preference using other users rating, assuming the like-

    minded people tend to have similar choices. Here, we concentrate on the algorithms used

    on the collaborative filtering.

    Collaborative filtering or recommender systems predict additional topics or

    products of a new user might like, based on a user preference database. There have been

    a lot of collaborative filtering algorithms. Breese, et.al.,1998, classified these algorithms

    into two categories: Memory-based Algorithm and Model-based Algorithms. Based on

    their classification, we collect and classified the available algorithms so far on

    Collaborative Filtering.

    Memory-based Algorithms

    The reason that they define these algorithms as memory-based algorithm is

    because that these algorithms operate over the entire user database to make predictions.

    Basically, these algorithms all try to find the similarity or correlation between the new

    active user and other users in the database. All users preferences could be represented by

    their votes (explicit or implicit) to the products (which could be anything related to the

    users interests.). The new user has an average vote over the products he/she has rated.

    Then the predicted votes of the new users over other products could be calculated by

    adding weighted sum of other users votes. The weights could be determined by the

    similarity between the new user and other users. The more similar they are, the more

    contributions they have to the sum, so the large the weights are. The users average vote

    14

    14

  • 8/7/2019 Survey Recomender System Algorithm

    15/33

    could be represented as below, the iI is the set of items the new user i has voted, ijv is

    the user i vote to product j. Then the average vote is:

    =iIj

    ji

    i

    i vI

    v ,||

    1

    The predicted vote of the new (active) user is:

    = +=

    n

    iijiaja vviawkvp

    1,, ))(,(

    where the k is a normalizing factor, while ),( iaw is the weight that the user i

    contributes to the active user.

    The weights are calculated by comparing a set of common products, which the

    active user and all other users in the database have rated. Here we collected three major

    methods to define the weights.

    Mean Squared Differences:

    This method defines the weight as the inverse of the mean square distance.

    2)(

    1),(

    aj VViaw

    =

    Pearson Correlation:

    15

    15

  • 8/7/2019 Survey Recomender System Algorithm

    16/33

    =

    j jijiaja

    jijiaja

    vvvv

    vvvviaw

    2

    ,

    2

    ,

    ,,

    )()(

    ))((),(

    Vector Similarity:

    This method defines the weight based on the angle size between the active user

    vector and the other user vector.

    =

    ia Ikki

    ji

    jIk ka

    ja

    v

    v

    v

    viaw2

    ,

    ,

    2

    ,

    ,),(

    Improvement on Memory-based Algorithms

    In order to improve the performance of standard memory-based algorithms,

    several modifications are proposed.

    Default Voting:

    book1 book2 book3 book4 book5 book6

    user 1 5 1

    user 2 3 1 5user 3 3 5 4

    user 4 4 2 ?

    16

    16

  • 8/7/2019 Survey Recomender System Algorithm

    17/33

    Usually, we are dealing with very sparse databases, also there are a lot of products which

    users didnt vote on (explicit or implicit). When using memory-based algorithms, we are

    only using the entries at the intersection. For the example above, to calculate the weight

    user1 contributes, we can only use the rates for book1. In order to deal with this problem,

    default votes are introduced. In most case, a neutral or negative preference is given to the

    unobserved products. So the union of voted set could be used in weights calculation

    instead of intersection. But this method may not necessarily improve the performance of

    the memory-based algorithms, an unobserved product may not mean that its less

    interesting.

    Inverse User Frequency:

    The idea of inverse user frequency is that universally liked products are not as

    useful as the less common products in capturing the similarity between users. So the

    weight is modified by introducing a jf , which is defined as below:

    j

    jn

    nf log=

    Where n is total number of users, while jn is the total number of users who have

    voted for product j. Then the relative correlation weight would be

    UV

    vfvfvvffiaw

    j j j j jijjajjijajj =

    ))(()(),(

    ,,,,

    17

    17

  • 8/7/2019 Survey Recomender System Algorithm

    18/33

    Where,

    =j j j

    jajjajj vfvffU ))((2

    ,

    2

    ,

    =j j j

    jijjijj vfvffV ))((2

    ,

    2

    ,

    Case Amplification:

    Case amplification emphasizes the contribution of the most similar users to the

    prediction by amplifying the weights close to 1. The new weights are calculated as

    below:

    {0)(

    0

    ,,

    ,,',

  • 8/7/2019 Survey Recomender System Algorithm

    19/33

    categories. See the same example below, this time the original 4 by 6 matrix is changed

    to be 4 by 3 and users have more common votes.

    book1 book2 book3 book4 book5 book6

    user 1 5 1

    user 2 3 1 5

    user 3 3 5 4

    user 4 4 2 ?

    catagory1 category2 catagory3

    The new votes of users to categories are calculated as below:

    cjvv jici = ,,,

    Now the entry of the new matrix is the average over the votes of the products per each

    category for a given user.

    The categories could be pre-defined or unknown. To deal with unknown

    categories, EM algorithm could be used.

    The method could be used on all other algorithms (including the Model-based

    Algorithms). We put it here because the original author uses it along with the correlation

    algorithm.

    Model-based Algorithms

    19

    19

  • 8/7/2019 Survey Recomender System Algorithm

    20/33

    Model-based algorithms first generate a descriptive model by compiling the users

    preferences; recommendations are then predicted by appealing to the model. From a

    probabilistic perspective, the collaborative filtering can be viewed as calculating the

    expected value of a vote, given users profile or previous votes.

    =

    ===m

    i

    akajajajaiIkvivvEP

    0

    ,,,,),|Pr()(

    Cluster Models:

    Based on the idea that there are certain groups or types of users capturing a

    common set of preferences and tastes, Breese, et.al, proposed a cluster method, in which

    like-minded users are classified into the same group. Given a users class membership,

    the users votes are assumed to be independent, then the joint probability of class and

    votes could be calculated by the nave Bayes formulation,

    n

    i

    in cCvcCvvcC1

    1 )|Pr()Pr(),...,,Pr(=

    ====

    Once we know the probability of observing an individual of a class with a set of votes,

    the expectation of the future vote could be easily calculated. Since the classes and

    number of class are unknown, EM algorithm is used to find the model structure with

    maximum likelihood.

    20

    20

  • 8/7/2019 Survey Recomender System Algorithm

    21/33

    Ungar [Unger, et. al.,1998] proposed a new clustering methods, unlike the

    standard cluster models, they assume that people are from classes: e.g, intellectual or fun

    and products are also from classes. Here is an example in their paper,

    Batman Rambo Andre Hiver Whispers Star Wars

    Lyle y y

    Ellen y y y

    Jason y y

    Fred y y

    Dean y y y

    In this movie database example, people can be classified as intellectual or fun,

    and movies could belong to three categories: action, foreign, classic. y in the table

    means people like the movies associated. For each person/movie pair, the probability that

    there is a y in the table is

    action foreign classic

    intellectual 0/6 5/9. 2/3.fun 3/4. 0/6 2/2.

    Based on the observation above, they establish a model, which contains three sets

    of parameters: kP (probability a random person is in class k), lP (probability a random

    movie is in class l), klP (probability a person in class k is linked to a movie in class l).

    Here, the class assignments are unknown. They tried repeated clustering and

    Gibbs sampling methods. In repeated clustering method, firstly, people are clustered

    based on movies and movies based on people; on the second, and later passes, people are

    clustered based movie clusters and movies based on people clusters. To do clustering,

    21

    21

  • 8/7/2019 Survey Recomender System Algorithm

    22/33

    they use k-means clustering instead of EM algorithm due to the constraint that a person is

    always in the same class and a movie is always in the same class. They claimed that the

    Gibbs sampling method over-performances repeated clustering.

    Bayesian Network Models:

    An alternative model formulation for probabilistic collaborative filtering is a

    Bayesian belief network with a node corresponding to each product in the database. The

    missing data can be represented by a no vote value. After applying an algorithm to train

    the belief network, in the resulting network, each item will have a set of parent items that

    are the best predictors of its votes. A decision tree could be used to represent the

    conditional probability table.

    Neural Network Models:

    Similar as the Bayesian Network models, collaborative filtering can be seen as a

    classification task. Based on a set of ratings from users for products, we could induce a

    model for each user that allows us to classify unseen products into two or more classes.

    The missing data could be indicted by a no vote state. Here is an example given in

    Billsus [Billsus, D. and Pazzani, M., 1998] paper.

    I1 I2 I3 I4 I5

    U1 4 3

    U2 1 2

    U3 3 4 2 4

    U4 4 2 1 ?

    22

    22

  • 8/7/2019 Survey Recomender System Algorithm

    23/33

    Where Ui is the ith user, Ii is the ith item. Users rate the items from 1 to 4, while 4 is the

    highest rating. Since finally they only recommend the items the active user would like,

    they reform the rating matrix by replacing rating > 2 by 1 otherwise 0. To represent the

    no vote value, they further split every user set into two sets (like and dislike).

    E1 E2 E3

    U1 like 1 0 1

    U1 dislike 0 0 0

    U2 like 0 0 0

    U2 dislike 0 1 0

    U3 like 1 1 0

    U3 dislike 0 0 1

    Class like dislike dislike

    Here U4s ratings for I1, I2, I3 are class labels. After converting a data set of user ratings

    for items into this format, we can apply virtually any supervised learning algorithm.

    Other Algorithms

    A hybrid memory- and model-based approach:

    Pennock [Pennock, David M. and Horvitz, Eric 1999] proposed a CF method

    called personality diagnosis (PD) which can be seen as a hybrid between memory- and

    model-based approaches. All data is maintained throughout the process, new data can be

    added incrementally, and predictions have a meaningful probabilistic semantics.

    In this algorithm, each users preferences are interpreted as a manifestation of

    their underlying personal type. Based on the fact that users voting are affected by the

    other environmental factors, such as previous users votes, current users mood , they

    23

    23

  • 8/7/2019 Survey Recomender System Algorithm

    24/33

    assumed that all users report their rating with Gaussian noise. If we define a users

    personality type as a vector of true ratingtrue

    iV , then user is actually rating could be

    drawn from an independent normal distribution,

    22 2/)(

    ,, )|Pr(yxtrue

    jiji ekyvxv

    ===

    Where is a free parameter.

    They further assumed that the distribution of voting vector in the database is

    representative of the distribution of that in target population of users. So we have,

    nVV i

    truea

    1)Pr( ==

    Where n is the total number of users in the database. Then the probability that the active

    user has the same personality type with any other user can by calculated by applying

    Bayes rule.

    )Pr()|Pr()|Pr(

    ),...,|Pr(

    ,,,1,1,11,

    ,11,

    itrue

    amitrue

    mammaitrueaa

    mmaaitrue

    a

    VVvvxvvvxv

    xvxvVV

    =====

    ===

    Then the active users vote of an unseen product would be,

    24

    24

  • 8/7/2019 Survey Recomender System Algorithm

    25/33

    =====

    ====

    ),

    ,...,11,

    |()|,

    (

    ),

    ,...,11,

    |,

    (

    mx

    mavx

    av

    iV

    truea

    Vr

    pi

    Vtruea

    Vj

    xja

    vr

    p

    mx

    mavx

    av

    jx

    jav

    rp

    Improvements:

    Now we have seen the memory-based and model-based collaborative filtering

    methods. Both methods have their advantages and drawbacks. Memory-based methods

    are simple and easy to implement. But they may be time- and space- consuming. At lease,

    for memory-based methods, its hard to handle two problems mentioned below:

    1) Missing data: To find the similarity between users, the difference (distance) between

    users has to be computed. If there are missing data, either only the products which all

    users voted are used, or give a vote to missing data. In first case, it has problem with

    sparse databases. In second case, giving average votes or somewhat negative votes to

    the missing data may shadow the similarity between users.

    2) Memory-based methods can not handle the situation that two user are very similar but

    have not rated the same set of products. For example,

    product1 product2 product3 product4 product5 product6

    user1 1 0 1 1 1

    user2 0 1 1 1 1

    user3 1 ?

    User1 and user2 are very similar in this example, however, when we use memory-

    based methods to predict user3s preference on product6, only user1s votes could be

    used to predict.

    25

    25

  • 8/7/2019 Survey Recomender System Algorithm

    26/33

    For model-based methods, clustering methods could somewhat handle missing data

    by clustering products into fewer categories, the new votes for categories are averaged

    over available votes for the products in the category. But Clustering methods may over-

    generalize, and hurt the performance. Bayesian network or neural network models could

    handle the missing data and the problem (2) mentioned before reasonably well. But for

    large databases containing many users, we will end up with thousands of features while

    our amount of training data is very limited, those models will become not practical.

    Recently, a promising algorithm is proposed. The idea is that users are rating their

    products based on the latent features of products. All products in the database share a set

    of common features. Users rate products highly because they rate those features highly.

    So by factoring peoples ratings into features using linear algebra, we could predict how

    users will react to documents they have not seen before based on their preferences for

    these features. Singular Value Decomposition (SVD) allows us to break down data sets

    into these components and analyze the principal components of the data. We will see

    below how SVD could be used to capture the hidden features and help to reduce the

    dimension of databases.

    Singular Value Decomposition:

    The user rating vectors can be represented by a m n matrix A, with m users and

    n products,

    ][ , jiaA=

    26

    26

  • 8/7/2019 Survey Recomender System Algorithm

    27/33

    Where jia , is the rating of user i for product j . Through singular value

    decomposition, A can by factored into TUSV , where U and V are orthogonal matrices

    and the S is a zero matrix, except for the diagonal entries which are defined as the

    singular value of A. U is representative of the response of each user to certain features. V

    is representative of the amount of each feature present in each product. S is a matrix

    related to the feature importance in overall determination of the rating. Here is an

    example given by Pryor [Pryor, H. Michael,1998] in his report. Suppose the rating matrix

    A is,

    =

    4146

    2573

    6245

    A

    The SVD of A would be:

    =

    7278.04099.05498.0

    0192.08136.05811.0

    6855.04124.06000.0

    U

    =

    0000.06550.10000.00000.0

    0000.00000.09324.40000.0

    0000.00000.00000.04890.14

    S

    =

    1437.07031.05041.04805.0

    6764.03306.05744.03213.0

    6088.01835.04878.05982.03889.06023.04218.05551.0

    V

    27

    27

  • 8/7/2019 Survey Recomender System Algorithm

    28/33

    We can find that the feature described by 14.4890 in S is the most important

    feature. So the dimension of S could drop off by selecting only most important features,

    in this case only the one represented by 14.4890. Then the new rating matrix could be

    generated, by converting the original rating matrix into the feature space.

    USAV =

    The new rating matrix M,

    'USM =

    In this case, [ ]4890.14'=S , after we get the new rating matrix M in the feature space.

    We can implement memory-based or model-based methods on this new rating matrix. It

    has been shown that exploiting latent structure in matrices of user ratings can lead to

    improved predictive performance.

    In current recommender systems, Content-Based Filtering (CBF) methods and

    Collaborative Filtering (CF) Methods are used. CBF filters information based on

    matching information content with users interests. CBF is able to filter information that

    has not been evaluated by other people. So CBF and CF are combined in recommender

    systems. CBF could be used to deal with unlearn products, while CF recommend new

    products based on previous users votes.

    IV. Discussion

    28

    28

  • 8/7/2019 Survey Recomender System Algorithm

    29/33

    As we introduced above, the future recommendation systems should have

    following features:

    1) Solve the cold-start problem.

    General collaborative recommendation systems have suffered this problem, that

    is, system has no clue to recommend a new item to users or to provide an accurate

    predictions for a new user. Since content-based filtering is based on the feature of the

    item, there is no such cold-start problem. Fab system has integrated these content-based

    fitering and collaborative filtering. Based on this integration, Michelle Keim Condliff et

    al[1998], propose a Bayesian methodology for recommendation system. This proposal

    uses Bayesian theory to give a good prediction by fully incorporating all of the available

    data, such as user ratings, user features, and item features . Claypool [Mark Claypool, et

    al. 1999] also provide an approach to solve this cold-start problem. This system bases on

    a weighted average of the content-based filtering prediction and collaborative filtering

    prediction.

    2) Easy for users to participate or vote

    Generally speaking, people do not like to provide recommendation although they

    like to receive recommendation. Since the system depends on the votes of users and then

    to calculate the similarities of users, so it is very important to get enough data from the

    users. So the system should provide very easy interface for a user to vote or provide

    annotation. Although explicit annotations or votes will leverage the calculation, implicit

    feedback of the users will be more helpful to decrease the sparse matrices, which is used

    for similarity calculation. The implicit methods include monitoring users behavior and

    29

    29

  • 8/7/2019 Survey Recomender System Algorithm

    30/33

    monitoring users browsing time on the page. The longer time a person stays, the more

    interesting the person shows. The system also can use compensation methods. For

    example, if one needs further recommendation, one must vote what he reads.

    3) Privacy

    Privacy becomes an issue when a system collects information about its user, so

    important social issue s arise on an individual scale as well. In collaborative filtering,

    users share the document annotations. In one side, people do not like the release their

    private identification, on the other side, people like to see who make the annotations. For

    example, if annotation is provided by an expert in this area, people in this group would

    like more to read this information. The system should provide a mechanism to allow user

    to adopt a pseudonym, also it should provide different level of privacy protection.

    4) Algorithm

    The good algorithm should have following features:

    1. handling missing data

    2. handling sparse data

    3. cost-efficiency

    5. Reference:

    Ariyoshi, Yusuke: 1999. Improvement of combination Information Filtering Method

    based on Reliabilities. http://www-ai.cs.uni-dortmund.de/EVENTS/IJCAI99-

    MLIF/papers.html

    Billsus, D. and Pazzani, M., 1998. Learning Collaborative Filters. Proceedings of

    ICML98, 46-53. Morgan Kaufman Eds.

    30

    30

    http://www-ai.cs.uni-dortmund.de/EVENTS/IJCAI99-http://www-ai.cs.uni-dortmund.de/EVENTS/IJCAI99-
  • 8/7/2019 Survey Recomender System Algorithm

    31/33

    Breese, J., Heckerman, D., Kadie, C., 1998. Empirical Analysis of Predictive Algorithms

    for collaborative Filtering. Proceedings of the Fourteenth Conference on

    Uncertainty in Artificial Intelligence, Madison, WI.

    Claypool, Mark; Gokhale, Anuja and Miranda, Tim et. al., 1999, Combining Content-

    Based and Collaborative Filters in an online Newspaper.

    http://www.cs.wpi.edu/~claypool/papers/content-collab/

    Collaborative Filtering workshop, 1996, Berkeley, CA. Webpage:

    http://www.sims.berkeley.edu/resources/collab/collab-report.htr.

    Condliff, Michelle Keim; Lewis, David D.; Madigan, David and Posse, Christian ; 1998,

    Bayesian Mixed-Effects Models for Recommender Systems.

    http://www.cs.umbc.edu/~ian/sigir99-rec/

    Goldberg, D. Nichols, D. Oki, B. M. and Terry, D.: Using collaborative filtering to weave

    an information tapestry. Commun. ACM35, 12, 1992.

    Joachims, Thorsten; Freitag, Dayne and Mitchell, Tom 1996, WebWatcher: A Tour

    Guide for the World Wide Web.

    http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-6/web-agent/www/project-

    home.html

    Lieberman, H. 1996: Letizia: An Agent That Assists Web Browse, in MIT Media Lab.

    Maltz, David and Ehrlich, Kate 1995: Pointing the way: active collaborative filtering.

    http://www.acm.org/sigchi/chi95/Electronic/documnts/papers/ke_bdy.htm.

    Oard, Douglas W. and Marchionini, Gary 1996, A Conceptual FrameWork for Text

    Filtering. http://www.ee.umd.edu/medlab/filter/papers/filter/filter.html

    Pennock, David M. and Horvitz, Eric 1999. Collaborative Filtering by Personality

    31

    31

    http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-6/web-agent/www/project-http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-6/web-agent/www/project-
  • 8/7/2019 Survey Recomender System Algorithm

    32/33

    Diagnosis: A Hybrid Memory- and Model-Based Approach.

    http://www.research.microsoft.com/~horvitz/cfpd.htm

    Pryor, H. Michael,1998. The Effects of Singular Value Decomposition on Collaborative

    Filtering. Computer Science Technical Report, Dartmouth College. PCS-TR98-

    338.

    Resnick, Paul and Varian, Hal R. 1997, Recommender Systems. COMMUNICATIONS

    OF THE ACM. March 1997/vol. 40, No.3.

    Resnick, Paul; Iacovou, Neophytos and et al;, 1994, GroupLens : An Open Architecture

    for Collaborative Filtering of Netnews. From Proceedings of ACM 1994

    Conference on Computer Supported Cooperative Work, Chapel Hill, NC: pages

    175-186.

    Shardanand, Upendra and Maes, Pattie 1995. Social Information Filtering: Algorithms

    for Automating Word of Mouth.

    http://www.acm.org/sigchi/chi95/Electronic/documnts/papers/us_bdy.htm

    Terveen, Loren G., Hill, William C. and et al;, 1998, Building Task-Specific Interfaces

    to High Volume Conversational Data.

    http://www.acm.org/sigchi/chi97/proceedings/paper/lgt.htm

    Turnbull, Don: Augmenting Information Seeking on the World Wide Web Using

    Collaborative Filtering Techniques. 1998,

    http://donturn.fis.utoronto.ca/research/augmentis.htn

    Turnbull, Don: KMDI Final Summary: Collaborative Filtering. 1997,

    http://donturn.fis.utoronto.ca/research/kmdi-cf.html

    Ungar, Lyle H., and Foster, Dean P. Foster, 1998. A Formal Statistical Approach to

    32

    32

    http://donturn.fis.utoronto.ca/research/augmentis.htnhttp://donturn.fis.utoronto.ca/research/augmentis.htn
  • 8/7/2019 Survey Recomender System Algorithm

    33/33

    Collaborative Filtering in AAAI Workshop on Recommendation System.

    http://www.cis.upenn.edu/~ungar/papers.html

    Wittenburg, Kent, Duco Das, Will Hill, and Larry Stead, 1998, Group Asynchronous

    Browsing on the World Wide Web.

    http://www.w3.org/Conferences/WWW4/Papers/98/

    33