personalized key frame...

10
Personalized Key Frame Recommendation Xu Chen * Tsinghua University [email protected] Yongfeng Zhang * University of Massachuses Amherst [email protected] Qingyao Ai University of Massachuses Amherst [email protected] Hongteng Xu Georgia Institute of Technology [email protected] Junchi Yan East China Normal University IBM Research -- China [email protected] Zheng Qin Tsinghua University [email protected] ABSTRACT Key frames are playing a very important role for many video ap- plications, such as on-line movie preview and video information retrieval. Although a number of key frame selection methods have been proposed in the past, existing technologies mainly focus on how to precisely summarize the video content, but seldom take the user preferences into consideration. However, in real scenar- ios, people may cast diverse interests on the contents even for the same video, and thus they may be aracted by quite dierent key frames, which makes the selection of key frames an inherently personalized process. In this paper, we propose and investigate the problem of personalized key frame recommendation to bridge the above gap. To do so, we make use of video images and user time-synchronized comments to design a novel key frame recom- mender that can simultaneously model visual and textual features in a unied framework. By user personalization based on her/his previously reviewed frames and posted comments, we are able to encode dierent user interests in a unied multi-modal space, and can thus select key frames in a personalized manner, which, to the best of our knowledge, is the rst time in the research eld of video content analysis. Experimental results show that our method performs beer than its competitors on various measures. KEYWORDS Key Frame; Personalization; Recommender Systems; Collaborative Filtering; Video Content Analysis ACM Reference format: Xu Chen * , Yongfeng Zhang * , Qingyao Ai, Hongteng Xu, Junchi Yan, and Zheng Qin . 2017. Personalized Key Frame Recommendation. In Proceedings of SIGIR ’17, Shinjuku, Tokyo, Japan, August 07-11, 2017, 10 pages. DOI: hp://dx.doi.org/10.1145/3077136.3080776 * Both authors contributed equally to this work. Corresponding author. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permied. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. Request permissions from [email protected]. SIGIR ’17, Shinjuku, Tokyo, Japan © 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM. 978-1-4503-5022-8/17/08. . . $15.00 DOI: hp://dx.doi.org/10.1145/3077136.3080776 1 INTRODUCTION All along, key frames are of fundamental importance for the video- based applications, for example, users in the on-line movie websites can readily preview a video under the help the exhibited key frames even without watching the whole content, video search engines can eciently return the results by matching the queries with the key frames of the candidate items. Although key frame extraction methods [10, 22, 58] have been widely investigated in the past, the proposed technologies mostly aim to summarize the video content, but seldom consider user preferences on the extracted key frames. However, in many applications, dierent people may be interested in various video contents, and thus may be aracted by quite dierent key frames. Without the guidance of tailored key frames, users may easily miss their potentially favorite videos. erefore, in this paper, we would like to ask “whether it is possible to design an eective model to select and recommend personalized key frames according to users’ dierent tastes.” In real scenarios, the main challenge to answer the above ques- tion is the lack of users’ personalized interaction information that reveals their “frame-level” viewing preferences. Fortunately, the emerging of video sharing websites such as Niconico 1 , BiliBili 2 , and AcFun 3 may shed light on this problem, where users are al- lowed to express opinions directly to the frames of interest by time-synchronized comments (or TSCs, rst introduced in [49], see Figure 1) in a real-time manner. Intuitively, the user behav- iors of commenting on a frame can be regarded as implicit feed- back reecting the frame-level preference, while the image features of the reviewed frame and the text features in the posted time- synchronized comment can further help to model the user specic (or ner-grained) preference from dierent perspectives. For exam- ple in Figure 1, user A expresses her preference on a frame with time-synchronized comment “… I like his overcoat, it looks cool and also must be very comfortable with good quality”. On one hand, from the content of the time-synchronized comment we can indicate that the user express a positive sentiment on this frame, and the particular aspect that aracts her aention is the quality of the clothes. On the other hand, from the frame image, we can further infer the visual features of her interests, such as clothing texture, which are usually dicult to describe with text. By leveraging all the historical implicit feedback as well as the features (image and 1 hp://www.nicovideo.jp 2 hp://www.bilibili.com 3 hp://www.acfun.cn

Upload: others

Post on 01-Nov-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Personalized Key Frame Recommendationcseweb.ucsd.edu/classes/fa17/cse291-b/reading/personalized-keyframe-sigir17...In real scenarios, the main challenge to answer the above ques-tion

Personalized Key Frame RecommendationXu Chen∗

Tsinghua [email protected]

Yongfeng Zhang∗University of Massachuse�s Amherst

[email protected]

Qingyao AiUniversity of Massachuse�s Amherst

[email protected]

Hongteng XuGeorgia Institute of Technology

[email protected]

Junchi YanEast China Normal University

IBM Research −− [email protected]

Zheng Qin†Tsinghua University

[email protected]

ABSTRACTKey frames are playing a very important role for many video ap-plications, such as on-line movie preview and video informationretrieval. Although a number of key frame selection methods havebeen proposed in the past, existing technologies mainly focus onhow to precisely summarize the video content, but seldom takethe user preferences into consideration. However, in real scenar-ios, people may cast diverse interests on the contents even for thesame video, and thus they may be a�racted by quite di�erent keyframes, which makes the selection of key frames an inherentlypersonalized process. In this paper, we propose and investigatethe problem of personalized key frame recommendation to bridgethe above gap. To do so, we make use of video images and usertime-synchronized comments to design a novel key frame recom-mender that can simultaneously model visual and textual featuresin a uni�ed framework. By user personalization based on her/hispreviously reviewed frames and posted comments, we are able toencode di�erent user interests in a uni�ed multi-modal space, andcan thus select key frames in a personalized manner, which, tothe best of our knowledge, is the �rst time in the research �eld ofvideo content analysis. Experimental results show that our methodperforms be�er than its competitors on various measures.

KEYWORDSKey Frame; Personalization; Recommender Systems; CollaborativeFiltering; Video Content Analysis

ACM Reference format:XuChen∗, Yongfeng Zhang∗, QingyaoAi, HongtengXu, Junchi Yan, and ZhengQin†. 2017. Personalized Key Frame Recommendation. In Proceedings ofSIGIR ’17, Shinjuku, Tokyo, Japan, August 07-11, 2017, 10 pages.DOI: h�p://dx.doi.org/10.1145/3077136.3080776

* Both authors contributed equally to this work.† Corresponding author.Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor pro�t or commercial advantage and that copies bear this notice and the full citationon the �rst page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permi�ed. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior speci�c permissionand/or a fee. Request permissions from [email protected] ’17, Shinjuku, Tokyo, Japan© 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM.978-1-4503-5022-8/17/08. . .$15.00DOI: h�p://dx.doi.org/10.1145/3077136.3080776

1 INTRODUCTIONAll along, key frames are of fundamental importance for the video-based applications, for example, users in the on-line movie websitescan readily preview a video under the help the exhibited key frameseven without watching the whole content, video search enginescan e�ciently return the results by matching the queries with thekey frames of the candidate items. Although key frame extractionmethods [10, 22, 58] have been widely investigated in the past,the proposed technologies mostly aim to summarize the videocontent, but seldom consider user preferences on the extracted keyframes. However, in many applications, di�erent people may beinterested in various video contents, and thus may be a�ractedby quite di�erent key frames. Without the guidance of tailoredkey frames, users may easily miss their potentially favorite videos.�erefore, in this paper, we would like to ask “whether it is possibleto design an e�ective model to select and recommend personalizedkey frames according to users’ di�erent tastes.”

In real scenarios, the main challenge to answer the above ques-tion is the lack of users’ personalized interaction information thatreveals their “frame-level” viewing preferences. Fortunately, theemerging of video sharing websites such as Niconico1, BiliBili2,and AcFun3 may shed light on this problem, where users are al-lowed to express opinions directly to the frames of interest bytime-synchronized comments (or TSCs, �rst introduced in [49],see Figure 1) in a real-time manner. Intuitively, the user behav-iors of commenting on a frame can be regarded as implicit feed-back re�ecting the frame-level preference, while the image featuresof the reviewed frame and the text features in the posted time-synchronized comment can further help to model the user speci�c(or �ner-grained) preference from di�erent perspectives. For exam-ple in Figure 1, user A expresses her preference on a frame withtime-synchronized comment “… I like his overcoat, it looks cool andalso must be very comfortable with good quality”. On one hand, fromthe content of the time-synchronized comment we can indicatethat the user express a positive sentiment on this frame, and theparticular aspect that a�racts her a�ention is the quality of theclothes. On the other hand, from the frame image, we can furtherinfer the visual features of her interests, such as clothing texture,which are usually di�cult to describe with text. By leveraging allthe historical implicit feedback as well as the features (image and

1h�p://www.nicovideo.jp2h�p://www.bilibili.com3h�p://www.acfun.cn

Page 2: Personalized Key Frame Recommendationcseweb.ucsd.edu/classes/fa17/cse291-b/reading/personalized-keyframe-sigir17...In real scenarios, the main challenge to answer the above ques-tion

SIGIR ’17, August 07-11, 2017, Shinjuku, Tokyo, JapanXu Chen∗, Yongfeng Zhang∗, Qingyao Ai, Hongteng Xu, Junchi Yan, and Zheng Qin†

text) of users’ interest, we can collaboratively match a target userwith her potentially favorite frames.

Based on the above motivation and intuition, we describe andanalyze a novel Key Frame Recommender by modeling user time-synchronized Comments and the key frame Images simultaneously(called KFRCI). �e main building block of our proposed methodis to integrate the power of model-based collaborative �ltering andlong-short term memory network. �e carefully designed collabo-rative �ltering component aims to capture user preferences basedon image features, while the modi�ed long-short term memory net-work component aims to model user time-synchronized commentsto excavate her personalized opinions toward di�erent frames. Fur-thermore, by integrating these two components, we build a uni�edframework that can encode user preference in a multi-modal spaceso as to facilitate comprehensive user pro�ling and accurate keyframe recommendation.

Compared with previous works, the main contributions of ourpaper are as follows:

• We propose and investigate the problem of personalizedkey frame recommendation, and to the best of our knowl-edge, this is the �rst work to select video key frames basedon users’ personalized preferences.

• For be�er addressing the above novel problem, we presenta novel neural architecture that combines collaborative�ltering and long-short term memory network together tojointly model user time-synchronized comments and videokey frame images.

• We conduct extensive experiments to demonstrate the ef-fectiveness of our proposed models, and also we presentdetailed analysis on the parameters as well as the e�ectsof di�erent information sources (image and text) in ourframework.

In the rest of the paper, we �rst introduce the related work insection 2, and then formally de�ne our problem in section 3. Ourframework is illustrated in section 4. In section 5, we verify thee�ectiveness of our method with experimental results. Conclusionsand outlooks of this work are presented in section 6.

2 RELATEDWORK2.1 Review-based RecommendationRecommender system is a well studied �eld with many e�ectivemodels proposed [5, 7, 17, 19, 35, 39]. Recently, for be�er captur-ing user preferences, user reviews has a�racted more and moreresearch interest [3, 42, 43], and many review-based models havebeen proposed to improve the recommendation performance [2,4, 9, 11, 29, 31, 37, 41, 45, 46, 51, 56] or enhance the interpretabil-ity [16, 38, 50, 50, 57].

According to the review processing methods, these models canbe generally classi�ed into two categories. On one hand, somemethods leverage the review text on document- or review-level,which take every piece of user review as a whole for global analy-sis. Speci�cally, [31, 41] link the latent factors in rating data withthe topics in the textual review to generate more accurate recom-mendations, and [9, 51] propose to leverage probabilistic graphicalmethod to include more �exible prior knowledge for review mod-eling. To be�er capture the local semantic information in user

Figure 1: A simple example of TSC. Di�erent users mayexpress real-time opinions directly upon their interestedframes. �e comments aremanually translated into Englishby the authors.

reviews, [56] combine traditional matrix factorization technologywith word2vec [34] for more precise review modeling and recom-mendation.,

On the other hand, some approaches try to leverage textualreviews on a feature- or aspect-level, which �rst extract productfeatures and user sentiments from user reviews, and then representthe unstructured free-text reviews as structured feature-opinionpairs to facilitate �ner-grained user preference modeling. Partic-ularly, [57] use multi-matrix factorization to generate explainablerecommendations based on the extracted product features. [4] fur-ther captures user favorite product features in a learning to rankmanner.

Compared with the aforementioned methods, we leverage long-short term memory network in our model to capture the wordsequential properties mirrored in the time-synchronized comments,which has mostly been ignore by the previous works.

2.2 Image-based RecommendationRecently, there is a trend to incorporate visual features into theresearch of personalized recommendation [8, 15, 32]. Speci�cally,[15] infuses the image features into the traditional ranking-basedmethod to improve the performance of Top-N recommendation.[8] combines the product images and item descriptions togetherto make dynamic predictions. [32] leverages visual features todiscover di�erent relationships between items.

Generally, these methods aim to take advantage of image fea-tures to boost the performance of traditional item recommendation,such as product recommendation in E-commerce. Instead, we aimat a very di�erent task of key frame (which itself is an image) rec-ommendation, where image is not a side information but the targetto process. Besides, previous works did not capture user prefer-ence from the time-synchronized comments, which is another maindi�erence when compared with our model.

2.3 Time-synchronized CommentsTime-Synchronized Comments (TSC) is �rst introduced in [49] forautomatic video shot tagging, where the authors proposed a novelmethod to extract time-synchronized video tags by automaticallyexploiting crowd-sourcing comments. [52] further leverages TSC to

Page 3: Personalized Key Frame Recommendationcseweb.ucsd.edu/classes/fa17/cse291-b/reading/personalized-keyframe-sigir17...In real scenarios, the main challenge to answer the above ques-tion

Personalized Key Frame Recommendation SIGIR ’17, August 07-11, 2017, Shinjuku, Tokyo, Japan

extract highlight shots for a video with a frequency-based method.However, the extracted highlight shots are static and could notprovide tailed key frames for di�erent users, which is the inherentdi�erence from the personalized key frame recommendation tasktargeted in our work.

2.4 Key Frame ExtractionA similar research direction to our work is key frame extraction (orvideo summarization), it has a�racted much research interest andmany models have been proposed in the past. Early works [13, 28]usually extract visual features of frames at �rst, and then clusterframes accordingly to generate key frames. To improve the perfor-mance, other types of information beyond visual features are con-sidered in recent work, including the viewer a�ention [30, 53, 54],audio signal [23], subtitles [27], etc. Moreover, semantic informa-tion has also been exploited to summarize videos, including specialevents [47, 48], key people and objects [24, 26], and story-lines [25].Compared with these models, our work essentially make a furtherstep forward by distinguishing user preferences on the extractedkey frames.

3 PRELIMINARIES AND PROBLEM DEFINITION3.1 Dataset Inspection�e data used in our work is crawled from a well-known videosharing website Bilibili4. We obtained the time-synchronized com-ments from the movie category till December 10th, 2015. To be�erunderstand the insights of this dataset, we conducted preliminarystatistic analyses, which are listed as Table 1 and Figure 2.

Table 1: Overall statistics of the time-synchronized com-ment (TSC) dataset.

Total number of users (#users) 1,133,750Total number of movies (#movies) 7,166Total number of TSCs (#TSCs) 11,842,166

Ave. TSCs per movie 1,652.5Ave. users per movie 465.9Ave. TSCs per user 10.4

Max/Min number of TSCs for a movie 8,028/101Max/Min number of users for a movie 3,370/1Max/Min number of TSCs for a user 68,236/1

As can be seen in Figure 2, the quantity of active users is relativelysmall, and most users only sent a small number of TSCs, whichconforms to the “long tail” theory that is frequently observed inuser behavior analysis. Similar results can also be found betweenthe number of users and movies.

3.2 Problem FormalizationSuppose there are N users u = {u1,u2, ...,uN } and M videos v ={v1,v2, ...,vM }. We �rst segment each movie vi ∈ v into L shotslvi = {l

1vi , l

2vi , ..., l

Lvi }, and then generate key frames correspond-

ingly leveraging existing methods such as [33, 36]. �e key frame4h�p://www.bilibili.com

Figure 2: On the le� is the relation between the number ofcomments and users, while on the right is the relation be-tween the number of users and movies.

in shot l jvi is de�ned as k jvi , and K represents the whole set of keyframes among all the videos.

Suppose the visual features of key frame k is de�ned asvslk , andlet user u’s time-synchronized comment on key frame k be tscukwith sentiment polarity poluk , which is determined by the Stanfordsentiment analysis toolkit5. �e word list in tscuk is de�ned aswtscuk = {w

0tscuk ,w

1tscuk , ...,w

suk−1tscuk }, where suk is the length of

the comment.Given all the visual featuresVSL = {vslk |k ∈ K } as well as users’

historical time-synchronized commentsW = {wtscuk |u ∈ u,k ∈K } and their corresponding sentiments POL = {poluk |u ∈ u,k ∈K }, for a target useru and one of her unseen movievi ∈ V with pre-selected key frames {k1vi ,k

2vi , ...k

Lvi }, our task is to �nd a function

д(·) to re-rank these key frames according to u ′s interest, thatis, д({k1vi ,k

2vi , ...,k

Lvi }|u,W ,POL,VSL) = {ko1vi ,k

o2vi , ...k

oLvi }, where

{o1,o2, ...,oL} is an ordering of {1, 2, ...,L}. �e top n key framesamong the �nal results are at last recommended (shown) to useru. To make more clear presentation, we list the notations usedthroughout the paper in Table 2.

4 KEY FRAME RECOMMENDERWe �rst propose an improved collaborative �ltering method tocapture users’ frame-level preference by making use of image vi-sual features. �en to model the textual features mirrored in time-synchronized comments, we modify the long short term memorynetwork by infusing per-user personalization information. Lastly,we design a uni�ed framework to jointly model frame images aswell as time-synchronized comments.

For clarity and integrality, we �rst re-describe the widely usedmatrix factorization (MF) model as a neural network. Formally, letpu and qk represent the latent factors of user u and item k , thenthe likeness (or score) of u to k can be predicted as yuk = pTuqk .In the context of neural network (see Figure 3), the user/item IDswith one-hot format can be seen as inputs fed into the architecture,then the embedding layer projects these sparse representations intodenser vectors, which can be regarded as the latent factors in matrixfactorization models. At last, the �nal result yuk is computed asthe vector inner product between pu and qk .

5h�ps://nlp.stanford.edu/sentiment/

Page 4: Personalized Key Frame Recommendationcseweb.ucsd.edu/classes/fa17/cse291-b/reading/personalized-keyframe-sigir17...In real scenarios, the main challenge to answer the above ques-tion

SIGIR ’17, August 07-11, 2017, Shinjuku, Tokyo, JapanXu Chen∗, Yongfeng Zhang∗, Qingyao Ai, Hongteng Xu, Junchi Yan, and Zheng Qin†

Table 2: Notations and descriptions.

Notations Descriptionsu �e set of N users

{u1, u2, ..., uN }.v �e set of M movies

{v1, v2, ..., vM }.lvi �e set of L shots

{l 1vi , l2vi , ...l

Lvi } for movie

vi .k, k jvi , K An arbitrary key frame, the key

frame of shot l jvi , and the set ofall key frames.

vslk , V SL �e preprocessed visual feature offrame k , and the set of all visualfeatures.

tscuk , poluk u ’s time-synchronized commenton key frame k , and the polarityof tscuk

wtscuk ,W �e word list{w0

tscuk, w1

tscuk, ...wsuk−1

tscuk}

in tscuk , and the set of alltime-synchronized comments.

pu , qk Latent factors of user u and framek .

O+, O− �e set of positive, and samplednegative feedbacks.

Kneд Number of negative instances.Nword Size of the word vocabulary.ht �e hidden state in LSTM at itera-

tion tD �e number of non-linear layers.

epr e0uk

, epr e1uk

, ..., epr eDuk

Preference embeddings of u on k .W imaдe ,W i , wout put Weighting matrix that mapsvslk

into a K dimensional vector, thecoe�cient matrix used to weightepr ei−1uk

, and a vector that mapsepr eDuk

into a scalar.MERGE (), LOGIST IC (), LSTM () �e merge function, logistic func-

tion, and LSTM network.yimaдeuk , yTSCuk , yinteдrateduk User u ’s predicted likeness score

to frame k when using image,TSC, and integrated information.

4.1 Image-based ModelTo capture user preference from frame images, we fuse the visualfeatures into the above framework. Speci�cally, our principled de-sign is shown in Figure 4. Suppose the dimension of the user/framelatent factors (embedding) is d , then each visual feature is �rstmapped into a d dimensional vector, which is then merged with theframe latent factor to generate a new embedding (blue circle in the�gure). Lastly, the user latent factors together with the newly gen-erated embedding are fed into the inner product layer to computethe �nal prediction.

Image enhanced key frame representation. Again, let puand qk be the latent factors of user u and key frame k , respectively.We �rst use Ca�e deep learning framework [21] to generate visualfeatures from the original frame images, where we adopted the

0 1 0 0 0 0 … 0 0 0 0 0 1 0 … 0

User 𝑢 Item 𝑘Sparse one-hot representation

Embedding layer

𝑝$

Embedding layer𝑞&

𝑦($&

Inner Product

Figure 3: Matrix Factorization as Neural Network.

Ca�e reference model with 5 convolutional layers followed by 3fully-connected layers that has been pre-trained on 1.2 millionImageNet (ILSVRC2010) images. For frame k , we use the output ofFC7, namely, the second fully-connected layer, as the �nal visualfeaturevslk , which is a feature vector of length 4096.

LetW imaдe ∈ Rd×4096 be the weighting matrix that mapsvslkinto a d dimensional vector, then the new key frame representationcan be derived as:

q∗k = MERGE (qk ,Wimaдe ·vslk ) (1)

where MERGE : Rd × Rd → Rd is a function that merges two ddimension vectors into one. �e particular choice ofMERGE in ourmodel is a simple element-wise multiplication, i.e.,

MERGE((a1,a2, ...,aK ), (b1,b2, ...,bK )

)= (a1b1,a2b2, ...,aKbK )

(2)however, it is not necessarily restricted to this function and manychoices can be used in practice according to the speci�c applicationscenario.

Sentiment-baseduser preferencemodeling. Intuitively, userswould comment on their favorite frames with positive sentiment.�erefore, we �rst determine the polarity of each time-synchronizedcomment by the Stanford Sentiment analysis toolkit6, and then forsimplicity, we set the polarity of tscuk – i.e., poluk – as 1 if theresult is very positive, positive, or neutral , and 0 otherwise.

In our framework, we take the prediction of a user’s likeness toa frame as a binary classi�cation problem, where 1 means a userlikes a frame, and 0 otherwise; the likeness of user u to frame k –i.e., yimaдe

uk ∈ [0, 1] – can be predicted as:

yimaдeuk = LOGIST IC (pu · q

∗k ). (3)

where LOGIST IC (x ) = 11+e−x is the logistic function, and “·” de-

notes inner product.�en we use the binary cross-entropy as our loss function to

model user preference, whose superiority has been explored anddemonstrated in [18], and the �nal objective function to be maxi-mized is:

6h�ps://nlp.stanford.edu/sentiment/

Page 5: Personalized Key Frame Recommendationcseweb.ucsd.edu/classes/fa17/cse291-b/reading/personalized-keyframe-sigir17...In real scenarios, the main challenge to answer the above ques-tion

Personalized Key Frame Recommendation SIGIR ’17, August 07-11, 2017, Shinjuku, Tokyo, Japan

0 1 0 0 0 0 … 0

User 𝑢0 0 0 0 1 0 … 0

Frame 𝑘

𝑝$ 𝑞&𝑊()*+,

𝒗𝒔𝒍&

Image 𝑘

Embedding layer Embedding layer

𝑊()*+, 0 𝒗𝒔𝒍&

Inner Product

Merge𝑞&∗

𝑦3$&()*+,

Figure 4: �e framework of image-based model. �e prepro-cessed image feature is merged with frame latent factors toderive a new embedding, which is then multiplied by theuser latent factors to generate the prediction.

L1 = loд∏(u,k )

(yimaдeuk )yuk (1 − yimaдe

uk )1−yuk

= loд∏

(u,k )∈O+yimaдeuk

∏(u,k )∈O−

(1 − yimaдeuk )

=∑

(u,k )∈O+loд y

imaдeuk +

∑(u,k )∈O−

loд (1 − yimaдeuk )

(4)

where yuk is the ground truth that would be 1 if u has commentedon k with poluk = 1, and 0 otherwise. O+ is the set of positivefeedbacks, i.e., O+ = {(u,k ) |u has commented on k,poluk = 1},while O− is the set of sampled negative feedback, namely, O− ⊆Oneд ∪ O∗, where Oneд = {(u,k ) |u has commented on k,poluk =0}, and O∗ = {(u,k ) |u has not commented on k }. In the trainingphase, for each positive feedback (u,k ), we uniformly sample Kneд

negative instances, and the parameters can be learned via stochasticgradient descent (SGD).

4.2 Text-based ModelExisting review-based recommendation methods mostly considerthe words in a comment as independent elements, and they usu-ally ignore the word sequential information – which is yet veryimportant for understanding the semantic of a comment. In Figure1 for example, user D wrote the review “A tall man and a shortwoman”, where if we leave out the consideration of word sequentialinformation, it would be computationally identical to “A tall womanand a short man”, which obviously expresses a completely oppositemeaning.

To capture the word sequential information, we make use ofthe long short term memory (LSTM) [20] network, which has beensuccessfully applied to a number of sequencemodeling tasks such asmachine translation [1], image caption [44], and video classi�cation[55].

Preference-aware LSTM. Intuitively, the content of a time-synchronized comment on a frame is in�uenced by both the userpreference and the frame itself. When it comes to our model, as aresult, the word generation process in LSTM should be in�uenced

LSTM

softmax

Word one-hot representationEmbedding layer0 0 1 0 … 0

0 1 0 0 0 0 … 0

User 𝑢0 0 0 0 1 0 … 0

Frame 𝑘

𝑝$ 𝑞&

Embedding layer Embedding layer

Element-wise Product

𝑒$&()*+

𝑤- 𝑤. 𝑤/ 𝑤0

Embedding layer Embedding layer Embedding layer

𝑤.

LSTM

softmax𝑤/

LSTM

softmax𝑤0

LSTM

softmax𝑤1

ℎ. ℎ/ ℎ0𝑦4$&567

0 1 0 0 … 0 0 0 0 0 … 1 0 0 0 1 … 0

Sum

Figure 5: �e framework of text-based model. �e prefer-ence embedding e

pr e0uk

is fed as an extra input to LSTM ateach step. �e likeness score of user u to frame k can besimply predicted by applying logistic function to the innerproduct between pu and qk .

by both the user and the frame latent factors. So in our frameworkas shown in Figure 5, we �rst merge the user and frame latentfactors into a preference embedding using element-wise vectormultiplication, and then feed it as an extra input to LSTM at eachstep.

An alternative strategy is to only use the preference embeddingas the “Zero State” of LSTM. However, we have empirically veri�edthat this approach leads to unfavored performance for the task ofpersonalized key frame recommendation.

Formally, suppose the time-synchronized comment of user u onframe k is tscuk with wordswtscuk = {w

0tscuk ,w

1tscuk , ...w

suk−1tscuk },

where suk is the length of the comment, and the size of the wordvocabulary is de�ned as Nword . We formalize our architecture intoan encoder-decoder framework similar to [6, 40].

More speci�cally, the user embedding and the frame embeddingare �rst encoded into a joint preference embeddingepr e0

uk= pu�qk ,

where � is element-wise multiplication. �en, given epr e0uk

and allthe previously predicted words, the decoder predicts each word atiteration step t by a conditional distribution:

h1 = LSTM (w0tscuk ,e

pr e0uk

) (5)

ht = LSTM (ht−1,wt−1tscuk ,e

pr e0uk

) t ∈ {2, 3, ...suk } (6)

p (wttscuk |e

pr e0uk,w0:t−1

tscuk ) = SOFTMAX (ht ) (7)

where SOFTMAX () is an Nword -way so�max, ht is the hiddenstate in LSTM at iteration t ,w0:t−1

tscuk = {wt−1tscuk ,w

t−2tscuk , ...,w

0tscuk }

is the set of all previous words before iteration t , LSTM () is thelong short term memory (LSTM) net. At last, by simultaneouslypredicting users’ likeness and time-synchronized comments, our�nal objective function to be maximized is:

L2 =∑

(u,k )∈O+⋃O−

suk−1∑t=1

loд p (wttscuk |e

pr e0uk,w0:t−1

tscuk )

+∑

(u,k )∈O+loд yTSCuk +

∑(u,k )∈O−

loд (1 − yTSCuk )

(8)

Page 6: Personalized Key Frame Recommendationcseweb.ucsd.edu/classes/fa17/cse291-b/reading/personalized-keyframe-sigir17...In real scenarios, the main challenge to answer the above ques-tion

SIGIR ’17, August 07-11, 2017, Shinjuku, Tokyo, JapanXu Chen∗, Yongfeng Zhang∗, Qingyao Ai, Hongteng Xu, Junchi Yan, and Zheng Qin†

LSTM

softmax

Word one-hot representation

Embedding layer0 0 1 0 … 0

0 1 0 0 0 0 … 0User 𝑢

𝑝#Embedding layer

Element-wise Product

𝑒#%&'()

𝑤+ 𝑤, 𝑤- 𝑤.

Embedding layer Embedding layer Embedding layer

𝑤,

LSTM

softmax𝑤-

LSTM

softmax𝑤.

LSTM

softmax𝑤/

ℎ, ℎ- ℎ.𝑦2#%345(6'75(

0 1 0 0 … 0 0 0 0 0 … 1 0 0 0 1 … 0

Frame 𝑘

𝑊3:76(𝒗𝒔𝒍%

Image 𝑘0 0 0 0 1 0 … 0

Embedding layer𝑊3:76( > 𝒗𝒔𝒍%𝑞%

𝑞%∗ Merge

Sum

Figure 6: A model that directly combines the methods pro-posed in section 4.1 and 4.2. �e output of the linearelement-wise multiplication layer is directly used as an in-put of LSTM and to generate yinteдrateduk .

where yTSCuk = LOGIST IC (pu · qk ) is the prediction of user u’slikeness score on frame k .

4.3 Integrated RecommenderIn this subsection, we propose to jointly model frame image andtime-synchronized comment in a uni�ed framework.

Deep feature adapting. Intuitively, we may directly combinethe above two models for user preference learning, which is shownin Figure 6. However, as image and text features come from quite dif-ferent and heterogeneous information sources, the linear element-wise multiplication layer (see Figure 6) can be extremely biasedwhen directly adapting such di�erent information. To overcomethis weakness, we stack several fully connected layers on top ofthe element-wise multiplication layer to capture the non-linearrelationship among di�erent features.

Formally, suppose the output of the element-wise multiplicationlayer is: epr e0

uk, and there are totally D non-linear layers. �en the

output of each non-linear layer and the �nal output can be derivedas:

epr e0uk

= pu � q∗k (9)

epr eiuk

= дnl (W i · epr ei−1uk

) i ∈ {1, 2, ...D} (10)

yinteдrateduk = LOGIST IC (wout put · e

pr eDuk

) (11)

where дnl is the active function, where we select Recti�er (ReLU)in our model because (1) it is practically more reasonable froma biological perspective [12], and (2) it can usually prevent deepmodels from over-��ing. epr ei

ukis the output of the i-th non-linear

layer, W i is the coe�cient matrix used to weight epr ei−1uk

, andwout put is a vector that maps epr eD

ukinto a scalar so as to conduct

logistic. Note thatwout put is a parameter that needs to be learnedby the model.

For now, we have described the key components (image model-ing, text modeling and the D non-linear layers) of our �nal frame-work, and we further fuse them together (see Figure 7). Careful

LSTM

softmax

Word one-hot representation

Embedding layer0 0 1 0 … 0

0 1 0 0 0 0 … 0User 𝑢

𝑝#Embedding layer

Element-wise Product𝑒#%&'()

𝑤+ 𝑤, 𝑤- 𝑤.

Embedding layer Embedding layer Embedding layer

𝑤,

LSTM

softmax𝑤-

LSTM

softmax𝑤.

LSTM

softmax𝑤/

ℎ, ℎ- ℎ.

𝑦2#%345(6'75(

0 1 0 0 … 0 0 0 0 0 … 1 0 0 0 1 … 0

Frame 𝑘

𝑊3:76(𝒗𝒔𝒍%

Image 𝑘0 0 0 0 1 0 … 0

Embedding layer𝑊3:76( > 𝒗𝒔𝒍%𝑞%

𝑞%∗ Merge

Non-linear Layers

Non-linear Layers

Non-linear Layers

……

𝑒#%&'(A

𝑒#%&'(BCBD

𝑤E#5&#5

Figure 7: Our integrated recommender. D fully connectedlayers are introduced to capture the non-linear relationshipbetween image and text features. Either the output of lin-ear element-wise multiplication layer or the results from anon-linear layer can be used to initialize LSTM. yinteдratedukis generated from the last non-linear layer.

readers might have found that, except for the original preferenceembedding epr e0

uk, any one of {epr e1

uk,e

pr e2uk, ...e

pr eDuk

} can be usedas the extra input of LSTM at each step, suppose we use epr einit

ukas

the extra input, where init ∈ {0, 1, 2, ...D} is a pre-de�ned constant.Our �nal framework can be learned by maximizing the followingobjective function:

L3 = α∑

(u,k )∈O+⋃O−

suk−1∑t=1

loд p(wttscuk |e

pr einituk

,w0:t−1tscuk

)+

(1 − α ) *.,

∑(u,k )∈O+

loд yinteдrateduk +

∑(u,k )∈O−

loд(1 − yinteдrateduk

)+/-

(12)where α is a weighting parameter that balances the e�ects of di�er-ent optimization objectives. Once the model has been learned, fora user u and a key frame k with visual featurevslk , we can readilypredict the likeness score of u to k by Equation (11), according towhich we can further recommend u with the key frames that theuser is most likely interested in.

5 EXPERIMENTSIn this section, we evaluate our proposed models focusing on thefollowing three key research questions:

RQ 1: What is the performance of our �nal framework for the taskof personalized key frame recommendation?RQ 2: What are the e�ects of di�erent types of information forpersonalized key frame recommendation?RQ 3: Can the stacked non-linear layers promote the performanceof personalized key frame recommendation?

We begin by introducing the experimental setup, and then re-port and analyze the experimental results to answer these researchquestions.

Page 7: Personalized Key Frame Recommendationcseweb.ucsd.edu/classes/fa17/cse291-b/reading/personalized-keyframe-sigir17...In real scenarios, the main challenge to answer the above ques-tion

Personalized Key Frame Recommendation SIGIR ’17, August 07-11, 2017, Shinjuku, Tokyo, Japan

5.1 Experimental SetupDataset preprocess. �e raw comments are generally pre-prepro-cessed (word segmentation and stop-word �ltering) by an open-source Chinese natural language processing toolbox Jieba7. A�erthat, we conduct more detailed pre-processing according to thespecial inherent characteristics of time-synchronized comments,and this is conducted on two aspects: on one hand, we manuallyremove the meaningless reviews, e.g., the ones at the beginning ofthe movies that are generally not relevant to the movie content; onthe other hand, we map the slangs that express the same meaning(e.g., 2333…, namely, several 3’s following a 2, which means “hap-piness” in online language environment) into a uni�ed word (e.g.,wonderful8) for more accurate modeling.

In our crawled dataset, the time stamp is recorded when a usersent an edited comment, however, the actual favored frame shouldbe the one corresponding to the time when he/she began to typethe comment, rather than the frame when the comment was postedout. As a result, we revise the time stamp by subtracting the timeof typing according to the length of the comment and a person’sgeneral typewriting speed (approximately 40 words/minute). Wepre-segment each movie as 1000 shots, and use the �rst frame as ashot’s key frame. Because the frames in a shot are always very sim-ilar, all the commenting behaviors in a shot are seen as reviewingon its key frame, and we do not deliberately distinguish a shot fromits key frame in the rest of the paper. To avoid “cold-start” prob-lem, we remove those users with less than 100 time-synchronizedcomments, and �nally sample a smaller dataset containing 40 users’29,137 comments (20,312 positive (poluk = 1) and 8,825 negative(poluk = 0)) on 11,000 key frames.Baselines. To demonstrate the e�ectiveness of our models, weadopt the following methods as baselines for performance compar-ison:

• MostPopular: �is is a non-personalized static methodutilizing user reviews, where for each user it just selectsthe most popularly positive key frames as the �nal results.

• PMF:�e Probabilistic Matrix Factorization method pro-posed in [35], which is a frequently used state-of-the-artapproach for rating-based optimization and prediction. Weset the score of user u to key frame k as 1, if u commentedon k with poluk = 1, and 0 otherwise.

• BPR: �is is a well known ranking-based method [39] foruser implicit feedback modeling, the preference pairs areconstructed between the positively commented key framesand the other ones. In our experiments, we randomly sam-ple one negative instance for each positive feedback.

• HFT:�is is a stat-of-the-art method in terms of makingrating prediction with textual reviews [31], as the ratinginformation is absent in our dataset. We set the ratings ofone’s positively commented key frames as 1, and 0 other-wise for each user.

7h�ps://github.com/fxsjy/jieba/tree/jieba3k8Manually translated into English by the authors

• VBPR: �is is a stat-of-the-art visual-based recommenda-tion method [15]. Similar to [15], the image features arepre-generated from the original key frame pictures usingthe Ca�e deep learning framework [21].

• KFRI:�is is a Key Frame Recommender based only onImage features, which is proposed in section 4.1 with L1as its objective function.

• KFRC: �is is a Key Frame Recommender based only onComments, which is proposed in section 4.2 with L2 as itsobjective function.

Evaluation method. If a user comments on a key frame withpositive sentiment (poluk = 1), then this frame would be the onethat a�racts her, so the empirical experiments are conducted bycomparing the predicted key frames with the true positive ones(poluk = 1). 30% of each user’s positive key frames (poluk = 1)are selected as the test dataset, while the others are used for train-ing. We adapt F1-score and normalized discounted cumulative gain(NDCG) to evaluate the performance of the baselines and our pro-posed models.Parameter settings. �e hyper-parameters in our frameworksare tuned by conducting 5-fold cross validation, while the modelparameters are �rst randomly initialized according to a uniformdistribution in the range of (0, 1), and then updated by conductingstochastic gradient descent (SGD).�e learning rate of SGD is deter-mined by grid searching in the range of {1, 0.1, 0.01, 0.001, 0.0001}.We set the number of non-linear layers as D = 3, and to learnmore abstractive features, their dimensions are empirically set as{40, 20, 10} to form a tower structure [14].

We evaluate di�erent number of latent factors K in the rangeof {50, 100, 150, 200, 250, 300} for both user and frame vector rep-resentations. �e number of negative samples Kneд is empiricallyset as 5, while the weighting parameter α is set as 0.5 to make dif-ferent optimization parts equally contribute to the �nal results. Forbe�er performance, we leverage grid search technology to deter-mine the batch size in the range of {64, 128, 256, 512, 1024}. Whenimplementing the baselines, 5-fold cross validation and grid searchtechnology are used to determine the parameters. Our experimentsare conducted by predicting Top-5,10, and 20 favorite key framesrespectively. All the models are repeated for 10 times, and we reportthe average as well as bound values as the �nal results for clearillustration.

5.2 Performance of Our Models (RQ1)Di�erent models (exceptMostPopular) may reach their best per-formance at various number of latent factors, so for each baseline,we implement it by se�ing the dimension as 50,100,150,200,250 and300 respectively, and the best result is �nally reported. From Figure8, we can see: KFRCI achieves the best performance on both F1and NDCG when recommending di�erent number of key frames.It can on average enhance the performance by about 7.8% and 6.7%upon F1-score and NDCG respectively when compared with VBPR,which performs best among all the methods. Paired t-tests on theresults also verify that the improvements are statistically signi�cant

Page 8: Personalized Key Frame Recommendationcseweb.ucsd.edu/classes/fa17/cse291-b/reading/personalized-keyframe-sigir17...In real scenarios, the main challenge to answer the above ques-tion

SIGIR ’17, August 07-11, 2017, Shinjuku, Tokyo, JapanXu Chen∗, Yongfeng Zhang∗, Qingyao Ai, Hongteng Xu, Junchi Yan, and Zheng Qin†

Figure 8: Comparison between our method and baselinemethods. �e values on the bar indicate the average perfor-mance of the corresponding models.

on 0.01 level. Among the baselines, PMF, as expected, performsbe�er than MostPopular due to the consideration of diverse per-sonalities. By directly optimizing the ranking objective, BPR showsbe�er e�ectiveness compared with PMF on both F1 and NDCG,which is consistent with the observations in [39]. By introducingtextual or visual features, HFT/VBPR on average outperforms BPRby about 3.5%/5.4% on F1 and 8.6%/6.1% on NDCG respectively.

5.3 E�ects of Di�erent Information (RQ2)For capturing more comprehensive user preference, frame imagesand time-synchronized comments are combined together in our�nal framework for joint modeling. However, di�erent informationsources may play di�erent roles, in this section, we would like tostudy the in�uence of diverse information for our task of person-alized key frame recommendation. To begin with, we compareour �nal framework KFRCI with KFRI and KFRC, which only useimage or text information in their modeling processes. For faircomparison and to avoid disturbance of the deep architecture, wedo not use any non-linear layers in KFRCI. �e other parametersfollow the above se�ings. From the results shown in Figure 10, wecan see:

1. All the models can reach their best performance when the di-mension falls in the range of [100, 150], while additional dimensionsdo not help promoting the performance. �e reason may be thattoo many latent factors can lead to the over-��ing problem, whichwould weaken our models’ generalization ability on the test dataset.

2. KFRI performs slightly be�er than KFRC in most cases, whichtells us that, in our dataset, image features maybe more importantcompared with time-synchronized comments for the task of per-sonalized key frame recommendation. �is may be of the reasonthat although time-synchronized comments are helpful, they aretoo diverse and may include too much noise for capturing userinherent preference.

3. It is highly encouraging that although we did not use any non-linear layer, KFRCI exhibits higher performance compared withboth KFRC and KFRI across all the dimensions. �is observationdemonstrates that the integration of visual and textual features can

indeed help excavate more accurate user preference, which is inline with our intuition in Section 1.

Weighting parameter α . In this section, we study how the per-formance of KFRCI changes as the weighting parameter α increasesfrom 0.1 to 0.9. In this experiment, the number of user/frame latentfactors is �xed as 100, while the other parameters follow the set-tings in section 5.1. We predict Top-20 user favorite key frames, theresults are shown in Figure 9, from which we can see: the perfor-mance (F1@20) of KFRCI continues to rise until α reaches around0.3, then a�er hovering approximately stable in the range of [0.3,0.5], it begins to drop rapidly with the increase of α . �is observa-tion indicates that we should make a suitable balance between thetwo components in our �nal framework. Besides, similar resultscan be observed on NDCG@20.

Figure 9: �e in�uence of weighting parameter α . For clearcomparison, we also list the performances of the other mod-els, although they don’t change with α . MostPopular is notlisted herein because its performance is much lower.

5.4 Promotion of the Deep Architecture (RQ3)In this section, we would like to test whether deep architectures arehelpful to our task. To do so, we evaluate the performance of our�nal model KFRCI based on F1 and NDCG by changing the numberof non-linear layers. Note that when there is no non-linear layer,we are actually evaluating the straightforward model as shown inFigure 6. In this experiment, the dimensions of the non-linear layersfrom the �rst layer to the output layer are set as {40, 20, 10, 5}, andthe number of user/frame latent factors is �xed as 100. �e outputnon-linear layer is used to link LSTM. All the other parametersfollow the above se�ings.

�e results are shown in Table 3, from which we can see that ourmodel can reach its best performance when there are two or threenon-linear layers, and introducing more non-linear layers does notbring positive e�ects. �ese observations indicate that deep modelmay be helpful for personalized key frame recommendation, how-ever, only relatively small number of non-linear layers are requiredto capture the complex relationship among heterogeneous features.

“Extra input” of LSTM.When there aremultiple non-linear layers,an obvious problem is that, which output should be selected as the“Extra input” of LSTM. So we further evaluate our model by usingdi�erent layers’ output epr einit

ukas the LSTM “Extra input”. Note

that init = 0 means directly linking the output of element-wise

Page 9: Personalized Key Frame Recommendationcseweb.ucsd.edu/classes/fa17/cse291-b/reading/personalized-keyframe-sigir17...In real scenarios, the main challenge to answer the above ques-tion

Personalized Key Frame Recommendation SIGIR ’17, August 07-11, 2017, Shinjuku, Tokyo, Japan

Figure 10: Evaluating the performances of our models when using di�erent features. �e dimension of the latent factorsranging from 50 to 300.

Table 3: �e e�ect of deep architecture.

number of layers 0 1 2 3 4F1@5 0.012 0.014 0.015 0.014 0.013

NDCG@5 0.060 0.064 0.065 0.063 0.062F1@10 0.071 0.073 0.074 0.075 0.072

NDCG@10 0.092 0.094 0.095 0.096 0.091F1@20 0.103 0.104 0.106 0.104 0.103

NDCG@20 0.120 0.123 0.124 0.123 0.121

multiplication layer to the LSTM. In this experiment, we use 3 non-linear layers with the dimensions of {40, 20, 10}, other parametersfollow the above se�ings.

From the results on F1@20 and NDCG@20 shown in Table 4, wesee that it can lead to improved performance when init = 1, 2 or 3compared with init = 0, which manifests that introducing non-linear operations is important for be�er adapting the user/framelatent factors with the underlying motivations for generating time-synchronized comments.

Table 4: �e e�ects of di�erent LSTM “Extra inputs”.

init 0 1 2 3F1@20 0.103 0.104 0.105 0.104

NDCG@20 0.121 0.122 0.124 0.123

6 CONCLUSIONS AND OUTLOOKIn the paper, we propose the problem of personalized key framerecommendation for the �rst time. To do so, we propose to leveragethe rich time-synchronized comment information in video shar-ing websites, and further design a novel framework that combinesmodel-based collaborative �ltering and long-short term memorynetwork together to model user commented key frames and time-synchronized comments simultaneously. Comprehensive evalua-tion veri�ed the e�ectiveness of our framework.

�is is a �rst step towards our goal in personalized key framerecommendation, and there is much room for further improvements.For example, more other information (e.g. audio features) can beincluded to capture more comprehensive user preference, whichmay also bring us more inspiring insights on the inherent naturesof the user preference pa�erns upon video key frames. Beyondpersonalized key frame recommendation, our work also points topromising future directions in personalized video summarization,personalized image captioning, and personalized story telling basedon images or videos.

ACKNOWLEDGMENT�is work was supported in part by the Center for Intelligent Infor-mation Retrieval and in part by NSF IIS-1160894, NSFC 61602176and NSFC 61672231. Any opinions, �ndings and conclusions or rec-ommendations expressed in this material are those of the authorsand do not necessarily re�ect those of the sponsor.

REFERENCES[1] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly

learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.

Page 10: Personalized Key Frame Recommendationcseweb.ucsd.edu/classes/fa17/cse291-b/reading/personalized-keyframe-sigir17...In real scenarios, the main challenge to answer the above ques-tion

SIGIR ’17, August 07-11, 2017, Shinjuku, Tokyo, JapanXu Chen∗, Yongfeng Zhang∗, Qingyao Ai, Hongteng Xu, Junchi Yan, and Zheng Qin†

[2] Y. Bao, H. Fang, and J. Zhang. Topicmf: Simultaneously exploiting ratings andreviews for recommendation. In AAAI, pages 2–8, 2014.

[3] Y. Cao, J. Li, X. Guo, S. Bai, H. Ji, and J. Tang. Name list only? target entitydisambiguation in short texts. In EMNLP, pages 654–664, 2015.

[4] X. Chen, Z. Qin, Y. Zhang, and T. Xu. Learning to rank features for recommen-dation over multiple categories. In Proceedings of the 39th international ACMSIGIR conference on Research & development in information retrieval. ACM, 2016.

[5] X. Chen, P. Wang, Z. Qin, and Y. Zhang. Hlbpr: A hybrid local bayesian personalranking method. In Proceedings of the 25th International Conference Companionon World Wide Web, pages 21–22. International World Wide Web ConferencesSteering Commi�ee, 2016.

[6] K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk,and Y. Bengio. Learning phrase representations using rnn encoder-decoder forstatistical machine translation. arXiv preprint arXiv:1406.1078, 2014.

[7] P. Covington, J. Adams, and E. Sargin. Deep neural networks for youtuberecommendations. In Proceedings of the 10th ACM Conference on RecommenderSystems, pages 191–198. ACM, 2016.

[8] Q. Cui, S. Wu, Q. Liu, and L. Wang. A visual and textual recurrent neural networkfor sequential prediction. arXiv preprint arXiv:1611.06668, 2016.

[9] Q. Diao, M. Qiu, C.-Y. Wu, A. J. Smola, J. Jiang, and C. Wang. Jointly modelingaspects, ratings and sentiments for movie recommendation (jmars). In Proceed-ings of the 20th ACM SIGKDD international conference on Knowledge discoveryand data mining, pages 193–202. ACM, 2014.

[10] N. Ejaz, T. B. Tariq, and S. W. Baik. Adaptive key frame extraction for video sum-marization using an aggregation mechanism. Journal of Visual Communicationand Image Representation, 23(7):1031–1040, 2012.

[11] G. Ganu, N. Elhadad, and A. Marian. Beyond the stars: Improving rating pre-dictions using review text content. In WebDB, volume 9, pages 1–6. Citeseer,2009.

[12] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse recti�er neural networks. InAistats, volume 15, page 275, 2011.

[13] Y. Gong and X. Liu. Video summarization using singular value decomposition.In Computer Vision and Pa�ern Recognition, 2000. Proceedings. IEEE Conferenceon, volume 2, pages 174–180. IEEE, 2000.

[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.arXiv preprint arXiv:1512.03385, 2015.

[15] R. He and J. McAuley. Vbpr: visual bayesian personalized ranking from implicitfeedback. arXiv preprint arXiv:1510.01784, 2015.

[16] X. He, T. Chen, M.-Y. Kan, and X. Chen. Trirank: Review-aware explainable rec-ommendation by modeling aspects. In Proceedings of the 24th ACM Internationalon Conference on Information and Knowledge Management, pages 1661–1670.ACM, 2015.

[17] X. He, M. Gao, M.-Y. Kan, and D. Wang. Birank: Towards ranking on bipartitegraphs. IEEE Transactions on Knowledge and Data Engineering, 29(1):57–71, 2017.

[18] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua. Neural collaborative�ltering. In Proceedings of the 26th International Conference Companion on WorldWide Web (WWW), pages 173–182. International World Wide Web ConferencesSteering Commi�ee, 2017.

[19] X. He, H. Zhang, M.-Y. Kan, and T.-S. Chua. Fast matrix factorization for onlinerecommendation with implicit feedback. In Proceedings of the 39th InternationalACM SIGIR conference on Research and Development in Information Retrieval,pages 549–558. ACM, 2016.

[20] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation,9(8):1735–1780, 1997.

[21] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama,and T. Darrell. Ca�e: Convolutional architecture for fast feature embedding.In Proceedings of the 22nd ACM international conference on Multimedia, pages675–678. ACM, 2014.

[22] R. M. Jiang, A. H. Sadka, and D. Crookes. Advances in video summarization andskimming. In Recent Advances in Multimedia Signal Processing and Communica-tions, pages 27–50. Springer, 2009.

[23] W. Jiang, C. Co�on, and A. C. Loui. Automatic consumer video summarizationby audio and visual analysis. In ICME, pages 1–6, 2011.

[24] A. Khosla, R. Hamid, C.-J. Lin, and N. Sundaresan. Large-scale video summariza-tion using web-image priors. In CVPR, pages 2698–2705, 2013.

[25] G. Kim, L. Sigal, and E. P. Xing. Joint summarization of large-scale collections ofweb images and videos for storyline reconstruction. In CVPR, 2014.

[26] Y. J. Lee, J. Ghosh, and K. Grauman. Discovering important people and objectsfor egocentric video summarization. In CVPR, pages 1346–1353, 2012.

[27] L. Li, K. Zhou, G.-R. Xue, H. Zha, and Y. Yu. Video summarization via transferrablestructured learning. InWWW, pages 287–296. ACM, 2011.

[28] Y. Li, T. Zhang, and D. Tre�er. An overview of video abstraction techniques.Technical report, Technical Report HPL-2001-191, HP Laboratory, 2001.

[29] H. Liu, J. He, T. Wang, W. Song, and X. Du. Combining user preferences anduser opinions for accurate recommendation. Electronic Commerce Research andApplications, 12(1):14–23, 2013.

[30] Y.-F. Ma, X.-S. Hua, L. Lu, and H.-J. Zhang. A generic framework of user at-tention model and its application in video summarization. IEEE transactions on

multimedia, 7(5):907–919, 2005.[31] J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding

rating dimensions with review text. In Proceedings of the 7th ACM conference onRecommender systems, pages 165–172. ACM, 2013.

[32] J. McAuley, C. Targe�, Q. Shi, and A. van den Hengel. Image-based recommen-dations on styles and substitutes. In Proceedings of the 38th International ACMSIGIR Conference on Research and Development in Information Retrieval, pages43–52. ACM, 2015.

[33] E. Mendi and C. Bayrak. Shot boundary detection and key frame extractionusing salient region detection and structural similarity. In Proceedings of the 48thAnnual Southeast Regional Conference, pages 66–67, 2010.

[34] T. Mikolov, K. Chen, G. Corrado, and J. Dean. E�cient estimation of wordrepresentations in vector space. arXiv preprint arXiv:1301.3781, 2013.

[35] A. Mnih and R. Salakhutdinov. Probabilistic matrix factorization. In Advances inneural information processing systems, pages 1257–1264, 2007.

[36] A. Nagasaka and Y. Tanaka. Automatic video indexing and full-video search forobject appearances. Journal of Information Processing, 15(2):316, 1992.

[37] J. Peng, Y. Zhai, and J. Qiu. Learning latent factor from review text and rating forrecommendation. In 2015 7th International Conference on Modelling, Identi�cationand Control (ICMIC), pages 1–6. IEEE, 2015.

[38] Z. Ren, S. Liang, P. Li, S. Wang, and M. de Rijke. Social collaborative viewpointregression with explainable recommendations. Under submission, page 10, 2016.

[39] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-�ieme. Bpr: Bayesianpersonalized ranking from implicit feedback. In Proceedings of the 25th Conferenceon Uncertainty in Arti�cial Intelligence, pages 452–461. AUAI Press, 2009.

[40] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neuralnetworks. In Advances in neural information processing systems, pages 3104–3112,2014.

[41] Y. Tan, M. Zhang, Y. Liu, and S. Ma. Rating-boosted latent topics: Understandingusers and items with ratings and reviews. In IJCAI. AAAI, 2016.

[42] D. Tang, B. Qin, and T. Liu. Learning semantic representations of users andproducts for document level sentiment classi�cation. In ACL, pages 1014–1023,2015.

[43] D. Tang, B. Qin, T. Liu, and Y. Yang. User modeling with neural network forreview rating prediction. In IJCAI, pages 1340–1346, 2015.

[44] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural imagecaption generator. In Proceedings of the IEEE Conference on Computer Vision andPa�ern Recognition, pages 3156–3164, 2015.

[45] H. Wang, N. Wang, and D.-Y. Yeung. Collaborative deep learning for recom-mender systems. In SIGKDD, pages 1235–1244, 2015.

[46] H. Wang, S. Xingjian, and D.-Y. Yeung. Collaborative recurrent autoencoder:Recommend while learning to �ll in the blanks. In NIPS, pages 415–423, 2016.

[47] M. Wang, R. Hong, G. Li, Z.-J. Zha, S. Yan, and T.-S. Chua. Event driven webvideo summarization by tag localization and key-shot identi�cation. Multimedia,IEEE Transactions on, 14(4):975–985, 2012.

[48] Z. Wang, M. Kumar, J. Luo, and B. Li. Sequence-kernel based sparse represen-tation for amateur video summarization. In Proceedings of the 2011 joint ACMworkshop on Modeling and representing events, pages 31–36, 2011.

[49] B. Wu, E. Zhong, B. Tan, A. Horner, and Q. Yang. Crowdsourced time-sync videotagging using temporal and personalized topic modeling. In SIGKDD, pages721–730, 2014.

[50] C.-Y. Wu, A. Beutel, A. Ahmed, and A. J. Smola. Explaining reviews and ratingswith paco: Poisson additive co-clustering. In Proceedings of the 25th InternationalConference Companion on World Wide Web, pages 127–128. International WorldWide Web Conferences Steering Commi�ee, 2016.

[51] Y.Wu andM. Ester. Flame: A probabilistic model combining aspect based opinionmining and collaborative �ltering. In WSDM, pages 199–208. ACM, 2015.

[52] Y. Xian, J. Li, C. Zhang, and Z. Liao. Video highlight shot extraction with time-sync comment. In Workshop on Hot Topics in Planet-scale mobile computing andonline Social networking, pages 31–36, 2015.

[53] H. Xu, Y. Zhen, and H. Zha. Trailer generation via a point process-based visuala�ractiveness model. In IJCAI, pages 2198–2204, 2015.

[54] J. You, G. Liu, L. Sun, and H. Li. A multiple visual models based perceptiveanalysis framework for multilevel video summarization. CSVT, IEEE Transactionson, 17(3):273–285, 2007.

[55] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, andG. Toderici. Beyond short snippets: Deep networks for video classi�cation. InProceedings of the IEEE Conference on Computer Vision and Pa�ern Recognition,pages 4694–4702, 2015.

[56] W. Zhang, Q. Yuan, J. Han, and J. Wang. Collaborative multi-level embeddinglearning from reviews for rating prediction. In IJCAI. ACM, 2016.

[57] Y. Zhang, G. Lai, M. Zhang, Y. Zhang, Y. Liu, and S. Ma. Explicit factor mod-els for explainable recommendation based on phrase-level sentiment analysis.In Proceedings of the 37th international ACM SIGIR conference on Research &development in information retrieval, pages 83–92. ACM, 2014.

[58] Y. Zhuang, Y. Rui, T. S. Huang, and S. Mehrotra. Adaptive key frame extractionusing unsupervised clustering. In Image Processing, 1998. ICIP 98. Proceedings.1998 International Conference on, volume 1, pages 866–870. IEEE, 1998.