rlcf: a collaborative filtering approach based on …mllab.sogang.ac.kr/publications/rlcf.pdf ·...

7
Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=tasj20 Download by: [Sogang University], [Jihoon Yang] Date: 04 October 2016, At: 20:08 Intelligent Automation & Soft Computing ISSN: 1079-8587 (Print) 2326-005X (Online) Journal homepage: http://www.tandfonline.com/loi/tasj20 RLCF: A Collaborative Filtering Approach Based on Reinforcement Learning With Sequential Ratings Jungkyu Lee, Byonghwa Oh, Jihoon Yang & Unsang Park To cite this article: Jungkyu Lee, Byonghwa Oh, Jihoon Yang & Unsang Park (2016): RLCF: A Collaborative Filtering Approach Based on Reinforcement Learning With Sequential Ratings, Intelligent Automation & Soft Computing, DOI: 10.1080/10798587.2016.1231510 To link to this article: http://dx.doi.org/10.1080/10798587.2016.1231510 Published online: 30 Sep 2016. Submit your article to this journal View related articles View Crossmark data

Upload: buicong

Post on 26-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RLCF: A Collaborative Filtering Approach Based on …mllab.sogang.ac.kr/publications/RLCF.pdf ·  · 2016-12-13Recommender systems; markov decision ... and Friedrich (2011) for a

Full Terms & Conditions of access and use can be found athttp://www.tandfonline.com/action/journalInformation?journalCode=tasj20

Download by: [Sogang University], [ Jihoon Yang] Date: 04 October 2016, At: 20:08

Intelligent Automation & Soft Computing

ISSN: 1079-8587 (Print) 2326-005X (Online) Journal homepage: http://www.tandfonline.com/loi/tasj20

RLCF: A Collaborative Filtering Approach Based onReinforcement Learning With Sequential Ratings

Jungkyu Lee, Byonghwa Oh, Jihoon Yang & Unsang Park

To cite this article: Jungkyu Lee, Byonghwa Oh, Jihoon Yang & Unsang Park (2016): RLCF: ACollaborative Filtering Approach Based on Reinforcement Learning With Sequential Ratings,Intelligent Automation & Soft Computing, DOI: 10.1080/10798587.2016.1231510

To link to this article: http://dx.doi.org/10.1080/10798587.2016.1231510

Published online: 30 Sep 2016.

Submit your article to this journal

View related articles

View Crossmark data

Page 2: RLCF: A Collaborative Filtering Approach Based on …mllab.sogang.ac.kr/publications/RLCF.pdf ·  · 2016-12-13Recommender systems; markov decision ... and Friedrich (2011) for a

IntellIgent AutomAtIon & Soft ComputIng, 2016http://dx.doi.org/10.1080/10798587.2016.1231510

RLCF: A Collaborative Filtering Approach Based on Reinforcement Learning With Sequential Ratings

Jungkyu Leea, Byonghwa Ohb, Jihoon Yangb and Unsang Parkb

aSearch Quality team, Daum Kakao Corp, Seoul, Korea; bDepartment of Computer Science and engineering, Seoul, Korea

ABSTRACTWe present a novel approach for collaborative filtering, RLCF, that considers the dynamics of user ratings. RLCF is based on reinforcement learning applied to the sequence of ratings. First, we formalize the collaborative filtering problem as a Markov Decision Process. Then, we learn the connection between the temporal sequences of user ratings using Q-learning. Experiments demonstrate the feasibility of our approach and a tight relationship between the past and the current ratings. We also suggest an ensemble learning in RLCF and demonstrate its improved performance.

© 2016 tSI® press

KEYWORDSRecommender systems; markov decision process; Q-learning; ensemble learning

CONTACT Jihoon Yang [email protected]

1. Introduction

Recommender systems provide users with personalized sug-gestions for products or services. Recommender systems can be combined with various technologies (e.g. e-commerce sys-tems (Zhong, Zhang, Wang, & Shu, 2014)) and make appro-priate suggestions in different spatiotemporal situations. Collaborative Filtering (CF) is one of the most active domains in the recommender systems since it avoids using information about the content, but it rather makes use of historical data (e.g. user ratings). Typically, the data used in CF can be represented as a large sparse matrix with (user, item, rating) entries and the goal is to predict the missing entries of users’ ratings for items they have not yet considered.

Many researchers have developed CF algorithms based on a variety of methods (or models) such as Restricted Boltzmann Machines (RBMs) (Salakhutdinov, Mnih, & Hinton, 2007), mean of K-Nearest Neighbors (i.e. Movie Means (MM)) (Bell & Koren, 2007), Singular Value Decomposition (SVD) (Paterek, 2007), and so forth. RBM is a probabilistic model based on the energy function of the bipartite graph of hidden and visible nodes wherein the user ratings are represented. The relationship (i.e. connections) between the set of hidden and visible nodes can be efficiently learned by contrastive diver-gence (Hinton, 2002), and then applied to predict the unknown ratings. K-Nearest Neighbors (KNN) is a simple algorithm (Altman, 1992) locating the most similar instances (users or items in CF). The main idea of KNN-based CF is to estimate the unknown ratings based on the ratings of similar users or items. SVD is a matrix factorization method to find the latent features or characteristics of the matrix (Stewart, 1993). In CF, SVD is used to discover the relationship between the users and items for rate prediction (Paterek, 2007). Other than these representative CF methods, there also exists an approach to mitigate the data sparsity problem by introducing an enhanced similarity among users, and to determine the ratings proba-bilistically (Zou, Wang, Wei, Li, & Yang, 2014). (See Jannach,

Zanker, Felfernig, and Friedrich (2011) for a thorough survey on CF and recommender systems).

This paper presents a new CF algorithm based on rein-forcement learning, RLCF. Reinforcement learning has been successful in a variety of applications (Busoniu, Babuska, Schutter, & Ernst, 2010; Kaelbling, Littman, & Moore, 1996; Szepesvari, 2010; Wang, Zhang, & Liu, 2009; Xu, Zuo, & Huang, 2014). The motivation for applying reinforcement learning to CF comes from the contrast effect, which is the enhancement or diminishment of a perception and related performance as a result of immediate previous or simultaneous exposure to the stimulus in the same dimension, with a smaller or greater value (Thurstone, 1927). The contrast effect can take place in CF. For instance a user can evaluate the same movie differently depending on what he watched previously. Thus, the prediction of a user’s ratings for an unseen movie can be optimized by taking into account the sequence of movies she or he has seen. Against this background, we formulate the contrast effect on CF as a Markov Decision Process (MDP) and apply one of the popular reinforcement learning algorithms, Q-learning (Sutton & Barto, 1998; Watkins & Dayan, 1992) to it. Therefore, RLCF attempts to improve the performance of CF with the help of reinforcement learning. This is different from other methods that directly construct recommender systems based on the idea of reinforcement learning (e.g. n-armed bandit problem and online feedback) (Liebman, Saar-Tsechansky, & Stone, 2015; Wang, Wang, Hsu, & Wang, 2014). The performance (in terms of recommendation accuracy) of RLCF depends on the base CF algorithm used in reinforcement learning. Furthermore, an ensemble approach is also proposed to maximize the performance.

This paper is organized as follows: Section 2 introduces the problem definition, and formalizes the CF problem as an MDP. Section 3 describes RLCF and its ensemble approach. Section 4 presents the experimental setup and results. Section 5 con-cludes with a summary and some future research directions.

Page 3: RLCF: A Collaborative Filtering Approach Based on …mllab.sogang.ac.kr/publications/RLCF.pdf ·  · 2016-12-13Recommender systems; markov decision ... and Friedrich (2011) for a

2 J. LEE ET aL.

2. Preliminaries

2.1 Problem Definition

Suppose the user-movie data-set consist of ratings (of 1~5 stars) for N users and M movies. Let X ∊ RN×M be the matrix of the ratings. Let A  ∊  RN×M be a matrix including entry (user, movie, rating) ∉ X as real (correct) ratings. The perfor-mance of a CF algorithm can be measured by the error between the prediction values of the algorithm and A. Let I ∊ {0, 1}N×M be an indicator function such that:

Let P ∊  RN×M be the matrix of predictions made by the CF algorithm. The goal is to estimate the missing entry of X such that the Root Mean Squared Error (RMSE) is minimal. RMSE is defined as follows:

2.2 Problem Formulation as MDP

MDP provides a mathematical framework for a sequential decision problem (Bellman, 1957; Puterman, 1990), and has been used for a wide range of optimization problems (Sutton & Barto, 1998). MDP consists of a tuple of 5 elements (S, A, {Psa}, γ, R) where

• State S: The set of states S is defined as the ratings of mov-ies. If the range of ratings is integer 1 to K, the size of states |S| is K. s(i)t ∈ S refers to the rating of the t-th movie that user i watched.

• Action A: As defined above, s(i)t and s(i)t+1

refer to the ratings of the t-th and (t+1)-th movie that user i watched, respectively. State s(i)t transitions to s(i)

t+1 by taking action a(i)t ∈ A. We repre-

sent this process as follows:

Note that |A| = |S|.• Transition Probability Psa: We assume that the MDP for

CF is deterministic. That is, if a user takes action a(i)t at state s(i)t , the transition to the next state s(i)

t+1 is not random. Since |A| =

|S| and the MDP is deterministic, we can see that a(i)t = s(i)t+1

in our problem.

• Discount factor γ: 0 < γ < 1 is called the discount factor, whose role will be described later.

• Reward r: When user i takes action a(i)t at state s(i)t , user i receives a reward or penalty signals defined as follows:

predictor(i, t) is the prediction that a CF algorithm estimates for movie j and user i, and s(i)

t+2 is the second next state from

time t. For instance, the means (or averages) of the ratings or those predicted by the SVD algorithm can be a base predictor. s(i)t+2

can be known from the data-set. (A detailed description of constructing training data is included in Section 4.3, and the reason why we use the second next state s(i)

t+2 (instead of the

next state s(i)t+1

) is explained in Section 4.4.

(1)Iij =

{1 if movie j is rated by user i in A

0 if the ratings are missing in A

(2)RMSE(P,A) =

����

∑n

i=1

∑m

j=1 Iij(Aij − Pij)2

∑n

i=1

∑m

j=1 Iij

(3)s(i)ta(i)t���������→ s(i)

t+1

(4)r(s(i)t , a(i)t ) = s(i)

t+2− predictor(i, t)

A policy is a function π : S → A, which maps states to actions. Given an initial state s0 and policy π, the dynamics of our MDP proceed as follows:

We also define the value function Vπ(s0) as follows:

Vπ(s) is simply the sum of discounted rewards upon starting in state s, and taking actions according to π. The optimal value function V*(s) is:

In other words, this is the best sum of discounted rewards that can be obtained executing any policy. Now, we can define the optimal policy π* : S → A as follows:

Let us define the action-value function Q(s,a) as follows:

Where δ(x,a) denotes the state resulting from applying action a to state x. Q(s,a) is interpreted as the estimated total sum of rewards that the user in state s obtains by taking the action a. We can redefine the optimal value function as follows:

Then using Eq. (10), we rewrite Eq. (9) as:

Eq. (9) provides the basis for the update rule of Q-learning algorithm (Sutton & Barto, 1998).

2.3 Generating Episodes

As mentioned earlier, X  ∊  RN×M consists of (user, movie, rating) paired entries. If we use a CF data with (user, movie, rating, order) paired entries, we can generate episodes as follows:

(5)s0�(s0)

���������������→ s1�(s1)

���������������→ s2�(s2)

���������������→ s3�(s3)

���������������→ ⋯

(6)

V�(s

0) = r(s

0,�(s

0)) + �r(s

1,�(s

1)) + �2r(s

2,�(s

2)) +⋯

= r(s0,�(s

0)) + �[r(s

1,�(s

1)) + �r(s

2,�(s

2)) +⋯]

= r(s0,�(s

0)) + �V�(s

1)

(7)V ∗(s) = max�

V�(s)

(8)�∗(s) = argmax�

V�(s)

(9)Q(s, a) = r(s, a) + �V ∗(�(s, a))

(10)

V∗(s) = max

�V

�(s) (from Eq. (7))

= max�

[r(s,�(s)) + V�(�(s,�(s)))] (from Eq. (6))

= maxa�[r(s, a�) + V

∗(�(s, a�))]

= maxa�Q(s, a�) (from Eq. (9))

(11)Q(s, a) = r(s, a) +maxa�

Q(�(s, a), a�)

s(1)1

a(1)1

����������→ s(1)2

a(1)2

����������→ s(1)3

a(1)3

����������→ ⋯

a(1)T1−1

����������������→ s(1)T1

s(2)1

a(2)1

����������→ s(2)2

a(2)2

����������→ s(3)3

a(3)3

����������→ ⋯

a(2)T2−1

����������������→ s(2)T2

s(N)

1

a(N)

1

������������→ s(N)

2

a(N)

2

������������→ s(N)

3

a(N)

3

������������→ ⋯

a(N)

TN −1

�����������������→ s(N)

TN

Page 4: RLCF: A Collaborative Filtering Approach Based on …mllab.sogang.ac.kr/publications/RLCF.pdf ·  · 2016-12-13Recommender systems; markov decision ... and Friedrich (2011) for a

InTELLIgEnT aUTOMaTIOn & SOFT COMPUTIng 3

s(i)j

is the rating that user i scores for j-th watched movie, and Ti is the length of the i-th episode. For instance, if the (user, movie, rating, order) entries of user-movie data are like Table 1 (The entry (1, 1, 2, 1st) means that user 1 watched movie 1 first and gave 2 stars), the episodes are generated as follows:

3. RLCF

RLCF consists of two phases; training and prediction. The for-mer is to compute the Q-table of the MDP, and the latter is to predict the rate for a user-movie pair using both the output of a CF algorithm and the value of the Q-table.

3.1 Training

Based on the MDP formulation and preliminaries, we learn the Q function using Q-learning, which can be summarized as (see Sutton and Barto (1998); Watkins and Dayan (1992) for detailed derivation):

Where Q(s,a) is the old estimated sum of reward, r(s, a) +maxa� Q(�(s, a), a

�) is new estimated sum of reward after a step forward, and α is the learning rate. Since the num-ber of possible actions in state s is |S|, Q(s,a) is a |S|×|S| table. The episodes generated as described in Section 2.3 are used as training data for Q-learning. Here is the training algorithm:

1: Input:α : learning rateγ: discount factorTi : the number of movies user i has rated2: Output:Q(s,a) : the estimated sum of reward3: Initialize ∀s ∊ S, ∀a ∊ A, Q(s,a) = 0;4: Convert X ∊ RN×M to a training episode set as in Section 2.3;5: for each user i = 1 : N do6: for each rated movie j = 1 : Ti do7: Calculate the reward: r(s(i)

j, a(i)

j) = s(i)

j+2− predictor(i, j);

8: Update the Q-function Q(s,a):Q(s, a) = Q(s, a) + �[r(s, a) +maxa� Q(�(s, a), a

�) − Q(s, a)];9: end for10: end forAlgorithm 1. Training Algorithm in RLCFAlgorithm 1 has a polynomial time complexity of O(NM)

since it takes O(NM) for both the episode generation (line 4) and Q-learning (lines 5–10) for N users and M movies.

2 → 5 → 2

1 → 1 → 3

5 → 3 → 5

1 → 4

2 → 4 → 5 → 5

(12)

Q(s, a) = Q(s, a) + �[r(s, a) +maxa�

Q(�(s, a), a�) − Q(s, a)]

3.2 Prediction

The prediction p(i)j

of the entry xij ∊ X that user i gives for movie j can be calculated by:

In Eq. (4), the reason why we used the second next state s(i)t+2

, not the next state s(i)

t+1, in the reward function r(s(i)t , a

(i)t ) is related

to Eq. (6). If we used the next state s(i)t+1

, prediction p(i)j

must be defined as:

However, it is impossible to know Q(s(i)j−1, a(i)

j−1) = Q(s(i)

j−1, s(i)

j)

when we try to predict p(i)j

. This is because the parameter s(i)j

is exactly the rating we want to predict. Therefore, we consider the most recent value from the Q-table constructed using the information up to the second previous time steps.

4. Experiment

4.1 Data-set

We adopted MovieLens data-set, which contains 10,000,054 ratings applied to 10,681 movies by 71,567 users of the online movie recommender service (Harper & Konstan, 2015). Users who rated at least 20 movies were chosen and their ratings were sorted in chronological order. 20% the most recent movie rating of the entire data is used for testing and the remaining 80% are used for training.

4.2 Base Predictors

First, we chose a simple movie means algorithm (MM (Bell & Koren, 2007)) that outputs the mean of movie ratings as a base predictor. In addition, we adopted SVD (Paterek, 2007) and SVD++ (Koren, 2008), which is an improvement over standard SVD by incorporating implicit feedback. So, we briefly describe a SVD model in this section. Remember that the user-movie data-set consists of the N users’ ratings for M movies. SVD finds a low-rank matrix Y  =  U’V that minimizes the sum-squared distance to X ∊ RN×M defined in Section 2. That is,

The regularization term �(‖‖Ui⋅‖‖2+‖‖‖Vj⋅

‖‖‖

2

) is to restrict the domains of U and V in order to prevent over-fitting, so that the resulting model has a good generalization capability. Minimization is performed by stochastic gradient descent. SVD++ is an improvement of SVD. (See the reference for detailed descriptions on the algorithm.)

4.3 Ensemble

As mentioned in Section 1, we also considered different combinations of weak predictors to produce a strong one. The technique we used to generate the ensemble is the linear combination of individual predictors. In other words, given two predictors p1 and p2, we obtain a family of predictors pensemble = w1p1 + w2p2, and use linear regression to find w1, w2 minimizing the RMSE of pensemble. Therefore we are guaranteed to obtain pensemble, which is at least as good as p1 or p2.

(13)p(i)j= predictor(i, j) + Q(s(i)

j−2, a(i)

j−2)

(14)p(i)j= predictor(i, j) + Q(s(i)

j−1, a(i)

j−1)

(15)minU ,V

xij∈X

(

xij −

K∑

k=1

UikV�

kj

)2

+ �

(‖‖Ui⋅

‖‖2+‖‖‖Vj⋅

‖‖‖

2)

Table 1. Sample (user, movie, rating, order) Data.

Movie 1 Movie 2 Movie 3 Movie 4 Movie 5user 1 (2,1st) missing (5,2nd) missing (2,3rd)user 2 (1,1st) (1,2nd) missing (3,3rd) missinguser 3 (5,1st) missing missing (3,2nd) (5,3rd)user 4 (1,1st) missing missing missing (4,2nd)user 5 (2,1st) (4,2nd) (5,3rd) (5,4th) missing

Page 5: RLCF: A Collaborative Filtering Approach Based on …mllab.sogang.ac.kr/publications/RLCF.pdf ·  · 2016-12-13Recommender systems; markov decision ... and Friedrich (2011) for a

4 J. LEE ET aL.

point when the users give 5 points to the (t-2)-th and (t-1)-th seen movies.

Next, we compared the RMSE of RLCF combined with dif-ferent base predictors, with that of pure base predictors. We trained SVD, SVD++ predictors with 10 latent features. There are two parameters in RLCF to be tuned: The learning rate and the discount factor γ of Q-learning. Table 2 displays experi-mental results for each predictor with the parameter settings, which was determined experimentally. As expected, SVD++ produced the best performance while MM produced the worst. RLCF improved the performance regardless of the base pre-dictor. That is, MM, SVD, and SVD++ with RLCF reduced RMSE by 0.0272, 0.0044, and 0.0046 respectively than its base predictors. This verifies our idea of incorporating the contrast effect via reinforcement learning.

Finally, we evaluated ensemble predictors. Table 3 shows the RMSE for each predictor and diverse ensemble predictors. We

4.4 Results

The state S is defined with ratings in [0.5, 5.0] with 0.5 intervals, making |S| = 10 and Q(s,a) is a 10 × 10 table. The resulting Q(s,a) of RLCF is depicted in Figure 1. Figure 1(a) and (b) rep-resent the numerical values and 3D-surface of Q(s,a) respec-tively when predictor (in Eq. (4)) is MM (Case 1). Figure 1(c) and (d) represent those when predictor is SVD predictor (Case 2). Note that the height difference between surfaces in Case 1 is greater than that of Case 2. This shows the effectiveness of reinforcement learning in MM. It is also shown that the state transition from st-2 to st-1 does influence st, which verifies the main idea of RLCF. For example, as shown in Figure 1(a), Q(5,5) shows a tendency that users raise the rate about 0.6

Figure 1. learned RlCf.

Table 2. performance of RlCf.

algorithm Base RLCF improvement γ αmm(Bell &

Koren, 2007)0.9381 0.9109 0.0272 0.5 0.000003

SVD(paterek, 2007)

0.8014 0.7970 0.0044 0.5 0.000006

SVD++(thurstone, 1927)

0.8000 0.7954 0.0046 0.65 0.000006

Table 3. performance of ensembles.

Ensemble ID algorithm RMSE#1 SVD++ 0.8000#2 SVD 0.8014#3 mm 0.9381#4 mmRlCf 0.9109#1+#2 ensemble 0.7970#1+#2+#3 ensemble 0.7967#1+#2+#3+#4 ensemble 0.7907

Figure 2. ensemble effect.

Page 6: RLCF: A Collaborative Filtering Approach Based on …mllab.sogang.ac.kr/publications/RLCF.pdf ·  · 2016-12-13Recommender systems; markov decision ... and Friedrich (2011) for a

InTELLIgEnT aUTOMaTIOn & SOFT COMPUTIng 5

Byonghwa Oh is a researcher of Algorithm Lab at Hyundai Card Corp. He received M.S. and Ph.D. degrees in Computer Science and Engineering from Sogang University, Korea, in 2009 and 2016, respectively. His research interests focus on machine learning, compu-tational intelligence, and data mining. In particular, he is interested in recommender

systems, semi-supervised learning and reinforcement learning.

Jihoon Yang is a professor of Computer Science and Engineering at Sogang University. His research interests include machine learning, data mining and knowledge discovery, artifi-cial intelligence, pattern recognition, evolu-tionary computation, and bioinformatics. He holds a B.S. in Computer Science from Sogang University, and M.S. and Ph.D. degrees in Computer Science from Iowa State University.

Unsang Park received B.S. and M.S. degrees from the Department of Materials Engineering, Hanyang University, South Korea, in 1998 and 2000, respectively. He received M.S. and Ph.D. degrees from the Department of Computer Science and Engineering, Michigan State University, in 2004 and 2009, respectively. From 2012, he was an assistant Professor in the

Department of Computer Science and Engineering at Sogang University. His research interests include pattern recognition, image processing, computer vision, and machine learning.

ReferencesAltman, N. (1992). An introduction to kernel and nearest-neighbor

nonparametric regression. American Statistician, 46, 175–185.Bell, R., & Koren, Y. (2007). Improved Neighborhood-based Collaborative

Filtering. KDDCup and Workshop at the 13th ACM SIGKDD International Conference on Knowledge Discovery And Data Mining, 7–14.

Bellman, R. (1957). A Markovian decision process. Journal of Mathematics and Mechanics, 6, 679–684. Indiana University.

Bengio, Y. (2009). Learning deep architetures for AI. Foundations and trends® in machine learning, 2, 1–127.

Busoniu, L., Babuska, R., Schutter, B., & Ernst, D. (2010). Reinforcement learning and dynamic programming using function approximators. Boca Raton: CRC Press.

Harper, F.M., & Konstan, J.A. (2015). The movielens datasets: History and context. ACM transactions on interactive intelligent systems (TiiS), 5, Article 19 (December 2015), 1–19. doi: 10.1145/2827872

Hinton, G. (2002). Training products of experts by minimizing contrastive divergence. Neural computation, 14, 1711–1800.

Hinton, G., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural computation., 18, 1527–1554.

Jannach, D., Zanker, M., Felfernig, A., & Friedrich, G. 2011. Recommender systems: An introduction. New York, NY: Cambridge University Press.

Kaelbling, L., Littman, M., & Moore, A. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4, 237–285.

Koren, Y. (2008). Factorization meets the neighborhood: A multifaceted collaborative filtering model. Proceedings of the 14th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.

Liebman, E., Saar-Tsechansky, M., & Stone, P. (2015). DJ-MC: A reinforement learning agent for music playlist recommendation. Proceedings of the 14th International Conference on Autonomous Agents and Multiagent Systems, 591–599.

Paterek, A. (2007). Improving regularized singular value decomposition for collaborative filtering. Proceedings of KDD Cup and Workshop.

adopted MM in RLCF generating MMRLCF since it produced the biggest performance gain as shown above.

Figure 2 plots the reduction of RMSE when each predictor is incrementally added, starting with SVD++. We can see that the performance of MMRLCF is not better than any other predic-tor. However when it was added to the ensemble reducing the RMSE as much as 0.006 (from 0.7967 to 0.7907). Our ensemble method of combining base predictors with RLCF lends itself to capturing the hidden relationship among user ratings and thus produces improved performance over simple CF algorithms.

5. Conclusion

We presented a new reinforcement learning based algorithm, RLCF, for CF. We first formalize the CF as MDP to model the effect that the sequences of a users’ previous ratings influence current ratings. Then, the effect is learned and recorded in a Q-table via Q-learning. The experimental results show that the order of past ratings affect the current ratings significantly, and is thus useful to elicit more accurate predictions. Furthermore, we also proposed an ensemble approach to combine pure pre-dictors with RLCF and verified its performance.

There are some interesting directions for future work. When we formalize the CF as MDP, states can be defined by other factors as well as ratings. For example, the genre, season, and diverse tags can be defined as states in MDP. If the movie genre is used, RLCF can discover the effect that influences ratings for action movies when a user watches an action movie after having watched a comedy. In addition, different CF algorithms (e.g. Restricted Boltzmann Machines (Salakhutdinov et al., 2007), Latent Factor Transition (Zhang, Wang, Yu, Sun, & Lim, 2014)), possibly combined with new learning paradigms (e.g. deep learning (Bengio, 2009; Hinton, Osindero, & Teh, 2006; Salakhutdinov & Hinton, 2009)) and ensemble methods, can be considered. Finally, additional datasets can be used to evaluate the performance of RLCF if temporal information is available.

AcknowledgementsThis research was supported by Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (No. 2012M3C4A7033348) and by the ICT R&D program of MSIP/IITP (R0126-16-1112, Development of Media Application Framework based on Multi-modality which enables Personal Media Reconstruction).

Disclosure statementNo potential conflict of interest was reported by the authors.

Notes on Contributors

Jungkyu Lee is a researcher at Daum Kakao Corp. He received an M.S. degree in Computer Science and Engineering from Sogang University in 2010. He is a member of Data Mining Laboratory of Sogang University from 2009 to present. His current research interests focus on machine learning, computational intelligence, and data mining. In particular, he

is interested in recommender systems, reinforcement learning and link analyses.

Page 7: RLCF: A Collaborative Filtering Approach Based on …mllab.sogang.ac.kr/publications/RLCF.pdf ·  · 2016-12-13Recommender systems; markov decision ... and Friedrich (2011) for a

6 J. LEE ET aL.

Wang, X., Wang, Y., Hsu, D., & Wang, Y. (2014). Exploration in interactive personalized music recommendation: A reinforcement learning approach. ACM transactions on multimedia computing, communications, and applications, 11, 1–22.

Watkins, C., & Dayan, P. (1992). Q-learning. Machine learning, 8, 279–292. Boston: Kluwer Academic Publishers.

Xu, X., Zuo, L., & Huang, Zhenhua. (2014). Reinforcement learning algorithms with function approximation: Recent advances and applications. Information sciences, 261, 1–31.

Zhang, C., Wang, K., Yu, H., Sun, J., & Lim, E. (2014). Latent factor transition for dynamic collaborative filtering. Proceedings of the SIAM International Conference on Data Mining. 452–460.

Zhong, H., Zhang, S., Wang, Y., & Shu, Y. (2014). Study on directed trust graph based recommendation for E-commerce system. International journal of Computers communications & control, 9, 510–523.

Zou, T., Wang, Y., Wei, X., Li, Z., & Yang, G. (2014). An effective collaborative filtering via enhanced similarity and probability interval prediction. Intelligent automation and soft computing, 20, 555–566.

Puterman, M.L. (1990). Markov decision processes. Handbooks in operations research and management science, 2, 331–433.

Salakhutdinov, R., & Hinton, G. (2009). Deep Boltzmann machines. Proceedings of the 12th International Conference on Artificial Intelligence And Statistics, 448–455.

Salakhutdinov, R., Mnih, A., & Hinton, G. (2007). Restricted Boltzmann machines for collaborative filtering. Proceedings of the 24th International Conference on Machine learning. 791–798.

Stewart, G. (1993). On the early history of the singular value decomposition. SIAM review, 35, 551–566.

Sutton, R., & Barto, A. (1998). Reinforcement learning: An introduction. Cambridge: MIT Press.

Szepesvari, C. (2010). Algorithms for reinforcement learning. San Rafael, CA: Morgan and Claypool Publishers.

Thurstone, L. (1927). A law of comparative judgment. Psychological review, 34, 273–286.

Wang, F., Zhang, H., & Liu, D. (2009). Adaptive dynamic programming: An introduction. IEEE computational intelligence magazine, 4, 39–47.