ieee transactions on multimedia, vol. 16, no. 1, january ...homepage.tudelft.nl/z0q8q/cavva.pdf ·...

9
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY 2014 15 CAVVA: Computational Affective Video-in-Video Advertising Karthik Yadati, Harish Katti, and Mohan Kankanhalli Abstract—Advertising is ubiquitous in the online commu- nity and more so in the ever-growing and popular online video delivery websites (e.g., YouTube). Video advertising is becoming increasingly popular on these websites. In addition to the existing pre-roll/post-roll advertising and contextual ad- vertising, this paper proposes an in-stream video advertising strategy—Computational Affective Video-in-Video Advertising (CAVVA). Humans being emotional creatures are driven by emo- tions as well as rational thought. We believe that emotions play a major role in inuencing the buying behavior of users and hence propose a video advertising strategy which takes into account the emotional impact of the videos as well as advertisements. Given a video and a set of advertisements, we identify candidate adver- tisement insertion points (step 1) and also identify the suitable advertisements (step 2) according to theories from marketing and consumer psychology. We formulate this two part problem as a single optimization function in a non-linear 0–1 integer programming framework and provide a genetic algorithm based solution. We evaluate CAVVA using a subjective user-study and eye-tracking experiment. Through these experiments, we demon- strate that CAVVA achieves a good balance between the following seemingly conicting goals of (a) minimizing the user disturbance because of advertisement insertion while (b) enhancing the user engagement with the advertising content. We compare our method with existing advertising strategies and show that CAVVA can enhance the user’s experience and also help increase the moneti- zation potential of the advertising content. Index Terms—Ad-insertion, affect, arousal, contextual adver- tising, marketing and consumer psychology, valence. I. INTRODUCTION E XPONENTIAL growth in the availability of public online digital video collections has given users the exibility to watch a video of his/her choice at any time. There is an ever increasing viewer base developing for such online video collec- tions. These video collections cater to a variety of geographic and topic-wise user groups [1]. As a result, video sharing web- sites are becoming valuable resources for people to sharing not Manuscript received January 01, 2013; revised April 03, 2013 and June 27, 2013; accepted June 28, 2013. Date of publication September 16, 2013; date of current version December 12, 2013. This research was partially carried out at the SeSaMe Centre. It was supported by the Singapore NRF under its IRC@SG Funding Initiative and administered by the IDMPO. The associate editor coor- dinating the review of this manuscript and approving it for publication was Prof. Nicu Sebe. K. Yadati and M. Kankanhalli are with the School of Computing, National University of Singapore (e-mail: [email protected]; [email protected]). H. Katti is with the Center for Neuroscience, Indian Institute of Science (e-mail: [email protected]). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TMM.2013.2282128 Fig. 1. Example of state-of-the-art advertising strategies. (1) Pre-roll/Pos-roll advertising: advertisements are inserted at the beginning and at the end of the video. (2) Contextual advertising [2]: advertisements are inserted by matching low-level features of the video and advertisements at scene change points. (3) CAVVA: Our method where advertisements are inserted according to estab- lished psychological rules, by assessing emotional impact of program content as well as that of candidate advertisements. only information, but also life experiences [1] and in turn be- coming lucrative markets for advertisement of products and ser- vices. In this paper, we propose an in-stream video-in-video ad- vertisement insertion strategy. The user and the advertiser constitute two major players in any advertising scenario. Though advertisements in between videos might be annoying from a user’s perspective, it makes free hosting and access to such video content possible. At the same time, an advertiser would like to maximize the reach of his/her advertisements. A successful advertising strategy should address the challenges posed by the conicting needs of the user and the advertiser listed below. 1) Ad-insertion should result in minimal disturbance for the user. 2) The placement of advertisements should result in an in- creased viewer engagement. Fig. 1 visualizes two existing video advertising strate- gies—pre-roll/post-roll advertisements (YouTube), where advertisements are inserted in the beginning or at the end of the video; contextual video advertisement insertion (e.g., VideoSense [2]), where contextually relevant advertisements are inserted at strategically chosen points in the video. The third row in Fig. 1 visualizes our affect based advertising strategy—CAVVA, where the advertisements are inserted based on the affective relevance between the video and the advertisement. 1520-9210 © 2013 IEEE

Upload: others

Post on 20-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY ...homepage.tudelft.nl/z0q8q/cavva.pdf · IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY 2014 15 CAVVA: Computational

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY 2014 15

CAVVA: Computational AffectiveVideo-in-Video AdvertisingKarthik Yadati, Harish Katti, and Mohan Kankanhalli

Abstract—Advertising is ubiquitous in the online commu-nity and more so in the ever-growing and popular onlinevideo delivery websites (e.g., YouTube). Video advertising isbecoming increasingly popular on these websites. In additionto the existing pre-roll/post-roll advertising and contextual ad-vertising, this paper proposes an in-stream video advertisingstrategy—Computational Affective Video-in-Video Advertising(CAVVA). Humans being emotional creatures are driven by emo-tions as well as rational thought. We believe that emotions play amajor role in influencing the buying behavior of users and hencepropose a video advertising strategy which takes into account theemotional impact of the videos as well as advertisements. Givena video and a set of advertisements, we identify candidate adver-tisement insertion points (step 1) and also identify the suitableadvertisements (step 2) according to theories from marketingand consumer psychology. We formulate this two part problemas a single optimization function in a non-linear 0–1 integerprogramming framework and provide a genetic algorithm basedsolution. We evaluate CAVVA using a subjective user-study andeye-tracking experiment. Through these experiments, we demon-strate that CAVVA achieves a good balance between the followingseemingly conflicting goals of (a) minimizing the user disturbancebecause of advertisement insertion while (b) enhancing the userengagement with the advertising content. We compare our methodwith existing advertising strategies and show that CAVVA canenhance the user’s experience and also help increase the moneti-zation potential of the advertising content.

Index Terms—Ad-insertion, affect, arousal, contextual adver-tising, marketing and consumer psychology, valence.

I. INTRODUCTION

E XPONENTIAL growth in the availability of public onlinedigital video collections has given users the flexibility to

watch a video of his/her choice at any time. There is an everincreasing viewer base developing for such online video collec-tions. These video collections cater to a variety of geographicand topic-wise user groups [1]. As a result, video sharing web-sites are becoming valuable resources for people to sharing not

Manuscript received January 01, 2013; revised April 03, 2013 and June 27,2013; accepted June 28, 2013. Date of publication September 16, 2013; date ofcurrent version December 12, 2013. This research was partially carried out atthe SeSaMe Centre. It was supported by the Singapore NRF under its IRC@SGFunding Initiative and administered by the IDMPO. The associate editor coor-dinating the review of this manuscript and approving it for publication was Prof.Nicu Sebe.K. Yadati and M. Kankanhalli are with the School of Computing,

National University of Singapore (e-mail: [email protected];[email protected]).H. Katti is with the Center for Neuroscience, Indian Institute of Science

(e-mail: [email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TMM.2013.2282128

Fig. 1. Example of state-of-the-art advertising strategies. (1) Pre-roll/Pos-rolladvertising: advertisements are inserted at the beginning and at the end of thevideo. (2) Contextual advertising [2]: advertisements are inserted by matchinglow-level features of the video and advertisements at scene change points. (3)CAVVA: Our method where advertisements are inserted according to estab-lished psychological rules, by assessing emotional impact of program contentas well as that of candidate advertisements.

only information, but also life experiences [1] and in turn be-coming lucrative markets for advertisement of products and ser-vices. In this paper, we propose an in-stream video-in-video ad-vertisement insertion strategy.The user and the advertiser constitute two major players in

any advertising scenario. Though advertisements in betweenvideos might be annoying from a user’s perspective, it makesfree hosting and access to such video content possible. At thesame time, an advertiser would like to maximize the reach ofhis/her advertisements. A successful advertising strategy shouldaddress the challenges posed by the conflicting needs of the userand the advertiser listed below.1) Ad-insertion should result in minimal disturbance for theuser.

2) The placement of advertisements should result in an in-creased viewer engagement.

Fig. 1 visualizes two existing video advertising strate-gies—pre-roll/post-roll advertisements (YouTube), whereadvertisements are inserted in the beginning or at the endof the video; contextual video advertisement insertion (e.g.,VideoSense [2]), where contextually relevant advertisementsare inserted at strategically chosen points in the video. Thethird row in Fig. 1 visualizes our affect based advertisingstrategy—CAVVA, where the advertisements are insertedbased on the affective relevance between the video and theadvertisement.

1520-9210 © 2013 IEEE

Page 2: IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY ...homepage.tudelft.nl/z0q8q/cavva.pdf · IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY 2014 15 CAVVA: Computational

16 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY 2014

Advertising has been existent for a very long time and ex-tensive research literature is available in the fields of marketingand consumer psychology. Insights from psychology [3] sug-gest that emotions play a major role in influencing human de-cision processes. The experience of an emotion is termed as af-fect and it can be measured in a discrete or a continuous space.One of the ways to represent affect is the circumplex model [4]of affect which is a dimensional representation, where affect ismeasured in two dimensions—arousal, referring to the intensityof the emotion and valence, referring to the type of emotion. Wechoose this continuous representation of affect as it is more ap-propriate in the context of video analytics.We highlight the contributions of our paper as follows:1) Devise concise rules with the help of insights and exper-imental results from consumer psychology literature [3],[5], [6]. Video-in-video advertising has mostly focused onrelevance; we address the important aspect of affect as thishas been shown to have significant impact on the successof advertising campaigns.

2) Realize the computational rules in a mathematical frame-work so that we could build a robust and scalable automaticvideo-in-video advertisement insertion system.

3) A thorough and systematic user evaluation to demonstratethe effectiveness of CAVVA, when compared to the ex-isting video-in-video advertisement insertion strategies.

CAVVA is different from the existing work on contextual ad-vertising (ex. VideoSense [2]), as our paper considers the emo-tional impact of the video to identify the advertisement inser-tion points and the appropriate advertisements, whereas currentstate-of-the-art focuses on contextual relevance of video seg-ment and the advertisement. A detailed comparison betweenCAVVA and VideoSense [2] is presented in the section on Re-lated work (II)Our paper is organized as follows: Section II presents the

related work on contextual advertising which is followed bythe problem formulation and it’s solution in Section III; Ex-perimental design and the list of experiments are explained inSection IV, followed by experimental results and discussion inSection V; Summary of the work and probable future researchdirections are briefed in Section VI.

II. RELATED WORK

Online advertising has evolved from (a) content agnosticrandom advertisement placement on web pages to (b) placingrelevant advertisements based on keywords in the web page(contextual advertising) and more recently (c) placing adver-tisements based on the semantic analysis of the text in the webpage (semantic advertising).Contextual advertising is popular in text-based advertising

(ex: Google’s AdSense) and is slowly entering the video adver-tising arena as well. The standard pre-roll, mid-roll and post-rollvideo advertising (as in YouTube) does not take the context ofthe video into consideration. Context plays an important role inthe way an advertisement is perceived by the viewer. Google’sAdSense and Microsoft’s VideoSense [2] deal with contextualadvertisements in the text and video domain respectively.In VideoSense[2], advertisement insertion points are identi-

fied as points having high discontinuity and low attractiveness

from the viewer’s perspective. Discontinuity measures the dis-similarity between consecutive shots of a video and attractive-ness is defined as the importance or the interestingness of thevideo shot from a user’s perspective. The contextually rele-vant advertisements are identified using two aspects: global andlocal relevance. Global relevance is obtained by calculating thetextual relevance between the web page containing the videoand the keywords associated with the advertisement. Local rel-evance is computed by measuring the similarities between thevideo and the advertisement at the insertion point with respectto the following features—motion content, tempo, color. Theinsertion point selection and advertisement placement is thentreated as a 0–1 non-linear optimization problem and solvedusing a greedy approach. We acknowledge the contribution ofVideoSense in adapting 0–1 NIP and definitions to the video-in-video advertising problem in a clear and concise manner, thusproviding a natural framework to describe work such as ours.This sets VideoSense apart from other existing research suchas [7], in which rule based heuristics are used to perform inter-active video-in-video advertisement insertion. In retrospect, theoverall objective of [7] fits in naturally into 0–1 NIP as well.Another work in the area of contextual advertising is pro-

posed in AdImage [8]. AdImage provides a framework forplacing contextually relevant video and image advertisementsby measuring the content based relevance and the semanticrelevance of the advertisements.Though VideoSense [2] and AdImage [8] rely heavily on se-

mantics and contextual relevance, they do not address the af-fective nature of advertising. Owing to the importance of af-fect in advertising [3], our method explores the emotional as-pects of video-in-video advertising, supported by empirical re-sults from the marketing and consumer psychology literature.We have earlier explored video-in-video advertisement inser-tion in a completely personalized human-in-loop setting, whereaffective state of the viewer is estimated using simple fusion anddecision strategies to select and insert advertisements on-the-fly,into the video stream [7]. This is very different, both philosoph-ically and in practice, from CAVVA, where we explore a fullyautomated content driven and scalable video-in-video advertise-ment insertion. As [7] requires a human-in-loop, we do not useit for benchmarking against CAVVA and instead choose appro-priate content analysis based methods such as VideoSense [2]and pre/post-roll. Table I provides a summary of the compar-ison of VideoSense [2] and CAVVA.

III. PROPOSED METHOD

A. Extracting Rules From Consumer Psychology

Studies in the marketing and consumer psychology literatureexplain the user’s responses to advertisements in different affec-tive contexts of the video being shown. We summarize insightsgained from the literature, in the form of the following rules:1) In a low arousal, low valence (unpleasant) programcontext, viewers treat the subsequent advertisements aspleasant, opposite to their evaluation of the program [5].

2) In a high arousal, high valence (pleasant) program context,viewers treat the subsequent advertisements as pleasant,similar to their evaluation of the program [5].

Page 3: IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY ...homepage.tudelft.nl/z0q8q/cavva.pdf · IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY 2014 15 CAVVA: Computational

YADATI et al.: CAVVA: COMPUTATIONAL AFFECTIVE VIDEO-IN-VIDEO ADVERTISING 17

TABLE ICOMPARISON OF VIDEOSENSE AND CAVVA

3) A positive commercial viewed in the context of a positiveprogram is treated as pleasant, when compared to the samecommercial viewed in a negative program context [6].

4) Human beings try to overcome their negative mood andthey try to maintain their positive mood. [3]

From rule 4, we can say that advertisement insertion shouldhelp the user transition to a higher valence state through theadvertisement, which implies that the scene following the ad-vertisement should be of higher valence when compared to thescene preceding the advertisement. Advertisements contain af-fective content, which can induce a positive or a negative moodin the user. An advertisement which can induce a positive moodin the user is termed as a positive advertisement and an adver-tisement which induces a negative mood in the user is termed asa negative advertisement. Our method aims to choose advertise-ment insertion points in the video stream so as to minimize dis-ruption and simultaneously select advertisements that will notonly be evaluated favorably, but also recalled well later. Com-bining rules 2, 3 and 4, we find that a positive advertisementwould be appropriate in a context where the scene preceding theadvertisement is of high arousal and high valence and the scenefollowing the advertisement falling into the category of highvalence. This would enable a high valence state being main-tained. Similarly, we place a negative advertisement at a pointwhere there is a low arousal, low valence scene preceding theadvertisement and a high valence scene following the advertise-ment. By doing this, we achieve a gradual transition from a neg-ative scene to a positive scene through a negative advertisement,where the user experiences an initial inertia to come out of thenegative state caused because of the preceding scene. Fig. 2 pro-vides a visualization of the advertisement insertion scenarios.From the figure, we can say that the advertisement should beemotionally similar to the scene preceding the advertisement.

B. Problem Formulation

Given a video and a set of advertisements, we have deviseda sequence of steps which result in an output video with adver-tisements inserted in it. Fig. 3 gives a schematic of the sequenceof steps involved in our advertisement insertion strategy.

Fig. 2. Visualizations of transition in valence. (a) A transition from low valenceto high valence through the advertisement, indicating the initial inertia to comeout of the negative mood. (b) Maintaining a high valence before and after theadvertisement.

Fig. 3. Affect based advertising strategy—CAVVA. (1) Input video, (2) Scenechange detection, (3) Affective analysis, (4) Optimization framework, (5)Output video.

1) We select an input video, in which we would like to placeadvertisements and we have a pool of video advertisementsin the ad inventory.

2) We consider each scene change point as a probable adver-tisement insertion point and hence we perform a scene seg-mentation of the input video as per [9]

3) We employ the circumplex model [4] to represent affect(valence and arousal) and use the method [10] to com-pute valence and arousal scores for the video as well as theadvertisements. We use motion activity, shot change fre-quency and average energy in the audio stream to modelarousal. Similarly, we use a HSV color histogram and pitchfrom the audio signal to model valence. The numericalvalues of valence and arousal are normalized between 0and 100, with 0 indicating minimum valence/arousal and100 indicating maximum valence/arousal.

4) We take the valence and arousal scores, for scenes andadvertisements, as input to the optimization framework toidentify the advertisement insertion points and select theappropriate advertisements.

5) An output video, with advertisements inserted in anin-stream manner, is created.

Given a video of duration seconds. Let the video be seg-mented into scenes , . As thereare scenes, there would be scene change points(excluding the first and the last scene). We consider each scene

Page 4: IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY ...homepage.tudelft.nl/z0q8q/cavva.pdf · IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY 2014 15 CAVVA: Computational

18 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY 2014

change point as a potential advertisement insertion point andthat would result in probable advertisement insertion points.Each scene of the video has an arousal score anda valence score , ; and

associated with it.Additionally given are a set of advertisements

, with an associatedvalue of arousal and valence with

and .Now, we introduce the following decision variables ,

, , , and, where and indicate

whether insertion point and advertisement are selected; . We adopt the nonlinear 0–1 integer program-

ming (NIP) paradigm to design an optimization framework toincorporate the above stated rules. The function AI defines thecost for identifying the advertisement insertion points and thefunction AS defines the cost for selecting the advertisements.The function AS includes a term , whichcomputes content based relevance between the scene and theadvertisement based on the following features—motion,audio tempo and color. This definition of relevance is similarin spirit to local relevance in VideoSense [2]. Our final opti-mization equation is defined as , which is a summationof the terms involving the cost of identifying advertisement in-sertion points (AI) and the selection of suitable advertisements(AS). The final constraints in the optimization equation accountfor the uniform distribution of advertisements across the video,by ensuring that there is at least one advertisement in a group ofscene change points ( being the number of advertisements

to be inserted in the video). We expect our constraint on uniformplacement of advertisements to perform better with increase inthe frequency of advertisements inserted into the video stream.Though this constraint performs satisfactorily for our currentdataset, we might have to redesign this constraint on compa-rable lines as the distribution entropy measure in [11]

(1)

(2)

(3)

(4)

Our optimization function identifies the advertisement inser-tion points and also the suitable advertisements to be inserted

TABLE IIVARIABLES USED IN THE OPTIMIZATION FRAMEWORK

at those points in a unified framework simultaneously. The AIterm formalizes rules 1,2 and 4. Furthermore, we require adver-tisements that can help maintain or transition to high valencestates, this is implemented in the term. A summaryof different variables used in the optimization equation is pre-sented in Table II

C. Efficiency and Quality of CAVVA

Considering potential insertion points in a video and a totalof advertisements, we need to identify points in the videowhere advertisements can be inserted. As a result, the searchspace becomes . We employ an efficient ge-netic algorithm based solver [12], in order to solve the optimiza-tion function. We use the global optimization toolbox availablein Matlab® to formulate the above stated optimization functionand obtain the solution. Though the idea of using genetic algo-rithm is mentioned in VideoSense [2], they implement a greedyselection strategy in their paper. The genetic algorithm workson a population, which represents a set of points in the solu-tion space. A solution to our problem would be of the form

, . A typical solution vectorwould be of length 212 (for 200 advertisements and 12 scenechange points on an average), with and for thechosen insertion points, advertisements and everything else as0. An initial population of vectors is chosen randomly by the al-gorithm and all the subsequent generations (population at eachiteration) are generated computing the value of the fitness func-tion ((3)). An important parameter, in identifying a global op-timum, is the size of the population which represents the numberof individuals present in each generation. As the size of the pop-ulation increases, the chances for obtaining a globally optimalsolution also increase. But, the search process in a large popu-lation increases the running time of the algorithm. In order toobtain a trade-off, we used different population sizes and foundthat a population size of 800 gives better performance for ourdata. Another important parameter is the number of generationswhich are produced before finding the solution. We limit thenumber of generations to 100 in order to keep the running timelow.We also implement brute force algorithm Alg. 1 to evaluatethe quality of solutions obtained from the genetic algorithm.Comparing the results of the genetic algorithm and Alg. 1, wefind an 85% agreement and hence we can say that the geneticalgorithm results in a near optimal solution.

Page 5: IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY ...homepage.tudelft.nl/z0q8q/cavva.pdf · IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY 2014 15 CAVVA: Computational

YADATI et al.: CAVVA: COMPUTATIONAL AFFECTIVE VIDEO-IN-VIDEO ADVERTISING 19

Algorithm 1 Brute-force advertisement insertion

1) Initialize a matrix of size , whererows represent the scenes of the video and the columnsrepresent the advertisements.

2) Initialize another matrix of size , whereis the desired number of advertisements to be inserted in

the video.

3) Compute for each of the scenes and theadvertisements and fill the matrix

4) Compute as the maximum value across thematrix and update the matrix withthe corresponding and Fill in a large negative numberin the rows of , to ensure uniformdistribution of ads. Similarly, to avoid repetition of ads, we filla large negative number in the column of

5) Repeat steps 4 & 5 until the desired number of adsare selected.

IV. EXPERIMENTS

In this section, we describe the different experimentsconducted in order to demonstrate the effectiveness of ouradvertisement insertion method (CAVVA). The experimentshave been designed keeping in view the two main objectives ofan advertising strategy—reduced disturbance for the user andincreased engagement with the advertising content. For a sub-jective evaluation of CAVVA, we conduct a user-study whichgathers user responses to a systematically chosen question-naire. We also perform an objective evaluation by conductingan eye-tracking experiment to assess the users’ arousal duringthe advertisements. Additionally, we also conduct a qualitativerecall based experiments which measure how well the userremembers the advertisement.In order to compare our method with the existing strategies,

we use the following different advertising strategies.• PRPR: Pre-roll/Post-roll advertising. In this method, ad-vertisements are inserted at the beginning and the end ofthe video. (ex: YouTube)

• VideoSense[2]: Contextually relevant advertisements areplaced at strategically chosen points of high discontinuityand low attractiveness in users’ perspective. For a fair com-parison, we ignore the global relevance between the videoand the advertisement as mentioned in [2].

• CAVVA: Our method, in which we employ the rules on af-fect from consumer psychology and marketing literature.Advertisement insertion points and the advertisementsthemselves are chosen according to the optimization equa-tion.

A. Data Collection

We utilize the popular online video collections Tellyads (ad-vertisements) and Youtube (videos) for compiling our dataset.We collect 15 videos of total duration 165 minutes with an av-erage duration of 12 minutes per video clip. After scene seg-mentation, we have a total of 185 scenes and an average of 12

TABLE IIIVIDEO DATA USED FOR EXPERIMENTS.

scenes per video clip. We include videos from various genreslike tv shows (3 videos with 40 scenes), movie clips (3 videoswith 44 scenes), news broadcasts (2 videos with 26 scenes), an-imated clips (2 videos with 22 scenes), documentaries (2 videoswith 32 scenes) and user generated videos (3 videos with 21scenes), which are a part of the videos available on YouTube.Table III summarizes the video data used in our experiments.In terms of advertisements, we collected 200 unique video ad-vertisements which belong to different product categories. Eachcategory and the percentage it constitutes in our dataset is de-scribed as follows—Sports goods (8%), Food products (29%),Life insurance (8%), Babycare products (6%), Clothing (6%),Beauty products (10%), Automobiles (10%), Consumer elec-tronics (16%) and Public service announcement (7%).

B. User-Study

In the user-study, we invite a group of users to participate inthe experiment by watching a set of videos and provide sub-jective responses to systematically chosen questions. In total,there were 48 ( and ) participantswhich included undergraduate, graduate students from the uni-versity in the age-group of and an average age of 24.Each subject was compensated monetarily for participating inthe experiment. The users had no knowledge about the under-lying advertisement insertion method being used in the videothey are watching.After each user watches a video, he/she is presented with

a questionnaire which judges the user’s viewing experience.The advertising strategies and the questions have a significantoverlap or are similar to those asked in VideoSense [2], thesequestions effectively quantify user experience and also help uscompare with VideoSense as a baseline. Following four ques-tions are posed to the user at the end of each video. The usersprovide a rating over the scale of 1 (bad)—5 (good) to each ofthe four questions.1) Are the advertisements uniformly distributed across thevideo?

2) How disturbing are the advertisement insertion points tothe flow of the video?

3) How relevant are the advertisements to the video at theinsertion point?

4) How do you rate the overall viewing experience?We have a total of 15 different videos on which we apply

the three different advertisement insertion strategies- PRPR,VideoSense, CAVVA to obtain the output videos. We use ablock-based design where each user watches 4 different outputvideos from one of the three advertising strategies. Across allthe videos and users, we have samples which

Page 6: IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY ...homepage.tudelft.nl/z0q8q/cavva.pdf · IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY 2014 15 CAVVA: Computational

20 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY 2014

is a significant sample size for analyzing the average ratingsfor each of the 4 questions. As the sample size increases, thestandard error of mean would decrease as it is inversely relatedto the square-root of the sample size. To avoid subject bias,we make sure that each of the videos is watched by at least 4people. The selection of the videos and also the order in whicheach user watches the videos is randomized so as to not induceany kind of bias in the experimental results.

C. Advertisement/Brand Recall

In addition to the subjective questionnaire, the users are alsoasked to recall the advertisement and the brand. Advertisement/brand recall is an important metric to measure advertisement ef-fectiveness [13]. We chose to measure the recall based metricsto check how well the user remembers the advertisement/brand,with an assumption that a user who remembers the advertise-ment is more likely to buy the product. We collect the followingrecall based metrics from the users:• Cued recall: The users are asked to recall advertisements/brands seen during the video, from a given list of adver-tisements/brands.

• Uncued recall: The user is asked to describe the advertise-ment/brand (in his/her own words) which he/she has seenduring the video.

• Immediate recall: Immediate recall is measured immedi-ately after the user has watched the video.

• Day-after recall: This recall metric is measured a day afterthe user has watched the videos.

D. Eye-Tracking Experiment: Measuring Pupillary Dilationas a Proxy for Arousal

Pupillary dilation (PD) is a metric used in eye-trackingstudies to measure the variation in sizes of the pupil in the eyewhen exposed to certain kinds of stimuli. Relevant studies asreported in the literature- [14], [15] and [16] suggest that pupil-lary dilation gives reliable estimates of physiological arousal.According to [17], variations in the size of pupil also measuresuser’s interest and engagement levels with the stimuli.Wemeasure the pupillary dilation of the users while watching

the videos using the eye-tracker, which gives us the length ofthe major axis of the pupil and the length of theminor axis . In order to measure the pupil size, wecompute the area of the ellipse using

. After computing the elliptical area, we find thatthe PD signal is a highly fluctuating signal.We smoothen the PDsignal with a moving average filter. The smoothened PD signalis then normalized to get the value between 0 and 1. Highervalues of pupillary dilation imply higher engagement levels.The eye-tracking experiment is conducted in a non-intrusive

manner on the same set of users who participate in the user-study. All subjects had normal or corrected to normal vision.The user is seated in front of a monitor where the video is beingplayed and the eye-tracker is placed in front of the monitor.We use a binocular infra-red based remote eye-tracking devicecalled SMI RED 250 from SMI, which can record eye-move-ment data at a frequency of 250 Hz.

Fig. 4. Self-Assessment Manikin [18], used to obtain ground-truth valence,arousal data for the videos and the advertisements.

E. Ground Truth Data

In CAVVA, we identify scenes with high/low arousal/va-lence to insert advertisements. In this experiment, we generateground-truth labels for arousal and valence for the scenes ofa video, from an independent group of users, using the SAM[18]. As illustrated in Fig. 4, SAM is a 9-point scale for arousaland valence. A scene with an average score of above 7 on thetwo dimensions is labelled as a high arousal/valence scene.Similarly, a scene with an average score of below 3 on thetwo dimensions is labelled as a low arousal/valence scene. Wecompare the ground-truth arousal labels with the arousal labelsgenerated using the pupillary dilation signal [17] and obtainan agreement level of 69%. We also compare the ground-trutharousal and valence labels with the labels generated using thecontent-based arousal and valence scores [7] and obtain anagreement level of 74% for arousal and 71% for valence. Theseannotators do not participate in the experiments for evaluatingCAVVA.

V. RESULTS AND DISCUSSION

This section provides the results for each of the experimentsconducted and also discuss the interesting findings from the ex-periments. The optimization framework identifies the advertise-ment insertion points and also selects the appropriate advertise-ments simultaneously. Fig. 5 depicts the example advertisementinsertion points and the corresponding advertisements for thethree different advertising strategies—Pre-roll/post-roll adver-tising, VideoSense [2] and CAVVA, applied on a video from ourdataset. For CAVVA, we plot the arousal and valence scores foreach of the scenes and the advertisements. Observing the va-lence-arousal plot, we can say that our advertisement insertionstrategy indeed makes a transition from a low valence state to ahigher valence state. In order to extend the observation, we pro-vide a visualization of valence scores at 15 randomly chosenadvertisement insertion points, from different videos, in Fig. 6.The y-axis in the figure represents the valence score and thex-axis represents the current scene, an advertisement followedby the next scene. A maximum valence of 65 (y-axis limit inFig. 6) is observed in our dataset. We see in Fig. 6, that our op-timization framework brings about the desired lower to higher

Page 7: IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY ...homepage.tudelft.nl/z0q8q/cavva.pdf · IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY 2014 15 CAVVA: Computational

YADATI et al.: CAVVA: COMPUTATIONAL AFFECTIVE VIDEO-IN-VIDEO ADVERTISING 21

Fig. 5. Frames from the result of applying the three different advertisementinsertion strategies—PRPR (row 1), VideoSense (row 2), CAVVA (row 3), onan example video. The graph plots the valence, arousal scores for CAVVA (row4).

Fig. 6. Visualization for transition in valence for 15 randomly chosen adver-tisement insertion points.

valence transition in accordance with the rules 1–4 described inSection III.In the following sub-sections, we demonstrate how CAVVA

fares when compared to the existing advertising strategies.The results record the subjective user experience, advertise-ment/brand recall and results from the eye-tracking experi-ments.

A. Subjective User Experience

User experience is recorded in terms of the ratings for thefollowing systematically chosen subjective parameters: uniformdistribution of advertisements, disturbance to the program flow,relevance of the advertisement and the overall viewing experi-ence. Fig. 7 presents the average ratings along with the standarddeviations from the user-study based on the above parameters,where the X-axis represents the label for each of the subjec-tive questions and the Y-axis represents the average rating toeach of the questions with 1 representing a bad rating and 5representing the best rating. Each color code is for one of thethree advertising strategies—PRPR, VideoSense and CAVVA.From Fig. 7, we observe that CAVVA performs better, in termsof average ratings, when compared to the pre-roll/post-roll ad-vertising and VideoSense [2]. In order to establish the statis-

Fig. 7. Average ratings to (1) Uniform distribution of videos, (2) Disturbanceto the flow of the video, (3) Relevance of the advertisement and (4) Overallviewing experience for each of the three different advertising strategies.

Fig. 8. Average ratings for the most liked video (left) and the most dislikedvideo (right).

tical significance of the results, we perform a two-sample Kol-mogorov-Smirnov test (KS test) at three different significancelevels— . We chose KS test as it does not as-sume an underlying normal distribution for the data. Table IVpresents the results from the significance test, where columns2–5 present the significance levels for each of the subjectivequestions and column 1 gives the two distributions being com-pared. The null hypothesis for the test is that the responses, toeach question, for different advertising strategies are from thesame distribution and the alternate hypothesis is that they arefrom different distributions. We can reject the null hypothesis atthe significance levels mentioned in Table IV. We have used avariety of videos and different users would like different videos.Program liking is considered to be an important factor in deter-mining the user’s reactions to advertisements as mentioned in[19]. In order to ensure that liking for a video did not influencethe ratings given to the subjective questions, we asked each userto give a rating to the videos based on their liking. Based on thisinformation, we selected top 20% most liked videos and least20% liked videos and present the average ratings to the ques-tions. Fig. 8 presents the results for the most liked and the mostdisliked video. Observing Fig. 8, we find that there is a trendwhich shows that CAVVA performs on par or better than theother two advertising strategies.

B. Advertisement/Brand Recall

We measure the various dimensions [20] of advertisement/brand recall (cued/uncued, immediate/day-after), which quan-

Page 8: IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY ...homepage.tudelft.nl/z0q8q/cavva.pdf · IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY 2014 15 CAVVA: Computational

22 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY 2014

TABLE IVRESULTS FROM THE TWO-SAMPLE KOLMOGOROV-SMIRNOVTEST FOR THE FOUR SUBJECTIVE QUESTIONS: Q1-UNIFORMDISTRIBUTION OF ADVERTISEMENTS, Q2-DISTURBANCE TO THEPROGRAM FLOW, Q3-RELEVANCE OF THE ADVERTISEMENT,

Q4-OVERALL VIEWING EXPERIENCE

Fig. 9. Average immediate recall (left) measured as—(1) Uncued advertise-ment recall, (2) Uncued brand recall, (3) Cued advertisement recall and (4) Cuedbrand recall for the three different advertising strategies. Average day-after re-call (right) measured as (1) Cued advertisement recall and (2) Cued brand recall.

tifies how much of the advertisement content was assimilatedby the user and how well the user remembers the advertise-ment/brand immediately after the session and after some timehas passed (day-after recall). Fig. 9 presents the results for im-mediate recall (left) and day-after recall (right). Each group ofbars in the figure, on the left, represents a dimension of the im-mediate recall—uncued advertisement recall, uncued brand re-call, cued advertisement recall, cued brand recall in the sameorder. Each group of bars in the figure, on the right, representsa dimension of day-after recall—cued advertisement recall andcued brand recall in that order. The Y-axis represents the av-erage recall in each of the figures, with 1 representing com-plete recall of the advertisements. For the immediate recall,we observe that the recall metrics are greater or on par withVideoSense [2] and greater than the pre-roll/post-roll adver-tising (left of Fig. 9). We also observe that there is only a slightdifference between the different advertising strategies, in termsof average ratings, for immediate cued advertisement/brand re-call. This behavior is because the user can associate the given listof advertisements/brands and remember them better as he/shehas just seen the video. The error lines on top of the bars repre-sents the standard deviation of the recall scores and we observethat it is low. We see a clear increasing trend across the threeadvertising strategies for the day-after recall—cued advertise-ment/brand recall. We do not measure the uncued recall metricsa day after the users have watched the video because it wouldbe difficult for the user to write a description on his/her ownwithout help.

C. Eye-Tracking Experiment

In the previous sub-sections, we have seen that CAVVAimproves the subjective experience of the user and also facil-itates advertisement/brand recall. In the current eye-tracking

Fig. 10. Average pupillary dilation (arousal) during advertisements.

experiment, we measure pupillary dilation as highlighted inSection IV-D. Since pupillary dilation measures user’s in-terest and engagement levels [17], a new advertising strategyshould induce similar or enhanced engagement levels, withthe advertisements, when compared to the existing advertisingstrategies. In Fig. 10, we plot the average normalized (0–1)pupillary dilation during advertisements for all the users in fivedifferent categories—All advertisements, familiar advertise-ments, unfamiliar advertisements, pleasant advertisements andunpleasant advertisements. For the category of all advertise-ments, we observe that there is a clear trend which shows thatCAVVA induces higher user engagement, when compared tothe other advertising strategies. The trend is increasing fromPre-roll/post-roll advertising, VideoSense [2] to CAVVA. Inorder to ensure that there is no user bias towards particularkind of advertisements, we also present the average pupillarydilation values for familiar vs. unfamiliar advertisements andpositive vs. negative advertisements.

VI. SUMMARY AND CONCLUSION

In this paper, we explored a novel and effective video-in-video advertising strategy—Computational AffectiveVideo-in-Video Advertising (CAVVA), which takes intoaccount the affective impact of a video and a set of adver-tisements. An optimization framework is proposed using0–1 Non-linear Integer Programming (NIP) paradigm, toidentify the advertisement insertion points and select suit-able advertisements simultaneously. We propose a solutionto the problem using the evolutionary genetic algorithms. Avariety of experiments are conducted to demonstrate the ef-fectiveness of our advertising strategy (CAVVA) and comparethe results with existing video-in-video advertising strate-gies—Pre-roll/Post-roll advertising and VideoSense [2]. Theresults from the experiments suggest that CAVVA enhancesuser experience (subjective user-study), improves user recall(recall experiments) and an increased user engagement with theadvertising content (eye-tracking experiment), when comparedto the other advertising strategies. In CAVVA, we have con-sidered the affective angle to advertising and our future workincludes exploring the semantics perspective to video adver-tising. Exploring the semantics perspective and also utilizingthe user-attention towards the video content to improve the

Page 9: IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY ...homepage.tudelft.nl/z0q8q/cavva.pdf · IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY 2014 15 CAVVA: Computational

YADATI et al.: CAVVA: COMPUTATIONAL AFFECTIVE VIDEO-IN-VIDEO ADVERTISING 23

advertisement insertion strategy remains our future focus. Theability to create personalized advertising is another direction ofinterest to us and is an ongoing work.

REFERENCES

[1] L. Kennedy and M. Naaman, “Less talk, more rock: automated or-ganization of community-contributed collections of concert videos,”in Proc. ACM 18th Int. Conf. World Wide Web, ser. WWW ’09, NewYork, NY, USA, 2009, pp. 311–320.

[2] T. Mei, X.-S Hua, L. Yang, and S. Li, “Videosense: towards effectiveonline video advertising,” in Proc. 15th Int. Conf. Multimedia., 2007,pp. 1075–1084.

[3] E. Plessis, The Advertised Mind: Ground-Breaking Insights Into HowOur Brains Respond to Advertising. London, U.K.: Kogan Page,2005.

[4] J. A. Russell, “A circumplex model of affect,” J. Personal. Social Psy-chol., vol. 39, no. 6, pp. 1161–1178, 1980.

[5] J. Broach, V. Carter, J. Page, J. Thomas, and R. D. Wilson, “Televisionprogramming and its influence on viewers’ perceptions of commer-cials: The role of program arousal and pleasantness,” J. Advertising,vol. 24, no. 4, pp. 45–54, 1995.

[6] M. A. Kamins, L. J. Marks, and D. Skinner, “Television commercialevaluation in the context of program inducedmood: Congruency versusconsistency effects,” J. Advertising, vol. 20, pp. 1 –14, 1991.

[7] K. Yadati, H. Katti, and M. Kankanhalli, “Interactive video-in-videoadvertising: A multi-modal affective approach,” in Proc. Int. Conf.Multimedia Modelling (MMM), Jan. 2013.

[8] W.-S. Liao, K.-T. Chen, and W. H. Hsu, “Adimage: video advertisingby image matching and ad scheduling optimization,” in Proc. 31stAnnu. Int. ACM SIGIR Conf. Research and Development in Informa-tion Retrieval, 2008, pp. 767–768.

[9] Z. Rasheed and M. Shah, “Scene detection in Hollywood movies andtv shows,” in Proc. IEEE Computer Society Conf. Computer Vision andPattern Recognition, 2003, Jun. 2003, vol. 2, pp. II–343-8.

[10] A. Hanjalic and L.-Q. Xu, “Affective video content representation andmodeling,” IEEE Trans. Multimedia, vol. 7, no. 1, pp. 143–154, Feb.2005.

[11] T. Mei, X.-S. Hua, and S. Li, “Videosense: a contextual in-video ad-vertising system,” IEEE Trans. Ciruits Syst. Video Technol., vol. 19,no. 12, pp. 1866–1879, Dec. 2009.

[12] D. Whitley, “A genetic algorithm tutorial,” Statist. Comput., vol. 4, pp.65–85, 1994.

[13] P. W. Farris, N. T. Bendle, P. E. Pfeifer, and D. J. Reibstein, MarketingMetrics: The Definitive Guide to Measuring Marketing Performance,2nd ed. Philadelphia, PA, USA: Wharton School Publishing, 2010.

[14] E. H. Hess and J. M. Polt, “Pupil size as related to interest value ofvisual stimuli,” Science, vol. 132, no. 3423, pp. 349–350, 1960.

[15] M.M. Bradley, L. Miccoli, M. A. Escrig, and P. J. Lang, “The pupil as ameasure of emotional arousal and autonomic activation,” Psychophys-iology, vol. 45, no. 4, pp. 602–607, 2008.

[16] I. Ólafadóttir, I. De Lemos J, G. Sadeghnia, and O. Jensen, “Measuringemotions using eye tracking,” inProc. 6th Int. Conf. Methods and Tech-niques in Behavioral Research, 2008, A. Spink (ed.).

[17] H. Katti, K. Yadati, M. Kankanhalli, and T. S. Chua, “Affective videosummarization and story board generation using pupillary dilation andeye gaze,” in Proc. 2011 IEEE Int. Symp. Multimedia (ISM), Dec. 2011,pp. 319–326.

[18] M. M. Bradley and P. J. Lang, “Measuring emotion: The self-assess-ment Manikin and the semantic differential,” J. Behavior Therapy Ex-periment. Psychiatry, vol. 25, no. 1, pp. 49–59, 1994.

[19] K. S. Coulter, “The effects of affective responses to media context onadvertising evaluations,” J. Advertising, vol. 27, no. 4, pp. 41–51, 1998.

[20] R. Adany, S. Kraus, and F. Ordonez, “Allocation algorithms for per-sonal tv advertisements,” Multimedia Syst., vol. 19, no. 2, pp. 79–93,2013.

Karthik Yadati received his B.Tech. degree incomputer science and engineering from the In-ternational Institute of Information Technology,Hyderabad (India) in 2007. He worked as a softwareengineer during 2007-2010 in the Enterprise ContentManagement wing of IBM India Software Labs. Hethen completed his Masters degree in Computingfrom the National University of Singapore in 2012.He is currently an MSc. candidate at the NUSSchool of Computing. His research interests includecomputational multimedia advertising, multimedia

analysis, eye-tracking and its applications.

Harish Katti received his PhD in computer sciencefrom the National University of Singapore, in 2012and a B. Engg degree from Karnatak University,in 2000. He also received a Masters degree inBio-Medical Engineering from the Indian Instituteof Technology, Bombay, in 2006. He worked inopen standards based multimedia software devel-opment during 2000–2004 and was involved in thedesign and development of middleware to supportmultimedia applications. His research interests liebroadly at the intersection of cognition and media

and more specifically in experimental and computational vision research. He iscurrently a post-doctoral fellow at the Center for Neuroscience, Indian Instituteof Science, Bangalore.

Mohan Kankanhalli is a Professor at the De-partment of Computer Science of the NationalUniversity of Singapore. He is also the AssociateProvost for Graduate Education at NUS. Before that,he was the Vice-Dean for Academic Affairs andGraduate Studies at the NUS School of Computingduring 2008-2010 and Vice-Dean for Researchduring 2001-2007. Mohan obtained his BTech fromIIT Kharagpur and MS & PhD from the RensselaerPolytechnic Institute. He first joined the Institute ofSystems Science, NUS in Singapore as a researcher.

He then was a faculty member at the Department of Electrical Engineeringof the Indian Institute of Science, Bangalore. He is with the NUS Schoolof Computing since 1998. His current research interests are in MultimediaSystems (content processing, retrieval) and Multimedia Security (surveillanceand privacy). He has been recently awarded a large grant by Singapore’sNational Research Foundation to set up the Centre for “Sensor-enhanced SocialMedia”. Mohan is actively involved in organizing of many major conferencesin the area of Multimedia. He is on the editorial boards of several journalsincluding the ACM Transactions on Multimedia Computing, Communications,and Applications, Springer Multimedia Systems Journal, Pattern RecognitionJournal and Multimedia Tools & Applications Journal.