216 ieee transactions on multimedia, vol. 17, no. 2,...

216 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 17, NO. 2, FEBRUARY 2015

Multimedia Summarization for SocialEvents in Microblog Stream

Jingwen Bian, Yang Yang, Hanwang Zhang, and Tat-Seng Chua

Abstract—Microblogging services have revolutionized the waypeople exchange information. Confronted with the ever-increasingnumbers of social events and the corresponding microblogs withmultimedia contents, it is desirable toprovide visualized summariesto help users to quickly grasp the essence of these social events forbetter understanding.While existing approaches mostly focus onlyon text-based summary, microblog summarization with multiplemedia types (e.g., text, image, and video) is scarcely explored. Inthis paper, we propose a multimedia social event summarizationframework to automatically generate visualized summaries fromthe microblog stream of multiple media types. Specifically, theproposed framework comprises three stages, as follows. 1) A noiseremoval approach is first devised to eliminate potentially noisyimages. An effective spectral filteringmodel is exploited to estimatethe probability that an image is relevant to a given event. 2) Anovel cross-media probabilistic model, termed Cross-Media-LDA(CMLDA), is proposed to jointly discover subevents from mi-croblogs of multiple media types. The intrinsic correlations amongthese different media types are well explored and exploited forreinforcing the cross-media subevent discovery process. 3) Fi-nally, based on the cross-media knowledge of all the discoveredsubevents, a multimedia microblog summary generation process isdesigned to jointly identify both representative textual and visualsamples, which are further aggregated to form a holistic visualizedsummary. We conduct extensive experiments on two real-worldmicroblog datasets to demonstrate the superiority of the proposedframework as compared to the state-of-the-art approaches.

Index Terms—Microblog, multimedia summarization, socialevent.

I. INTRODUCTION

R ECENT years have witnessed the emergence of mi-croblogging services that change the way people live,

work and communicate. For example, Sina Weibo,1 one of thelargest microblogging platforms on the Web, has attracted morethan 500 million registered users, and the average number ofdaily active users has reached 46 million by the end of 2012.The contents of microblogs are becoming more multimedia

Manuscript received March 30, 2014; revised September 30, 2014; acceptedNovember 08, 2014. Date of publication December 22, 2014; date of currentversion January 15, 2015. The associate editor coordinating the review of thismanuscript and approving it for publication was Dr. Cees G. M. Snoek.J. Bian, H. Zhang, and T.-S. Chua are with the School of Computing, National

University of Singapore, Singapore 119077 (e-mail: [email protected];[email protected]; [email protected]).Y. Yang is with the School of Computer Science and Engineering, Univer-

sity of Electronic Science and Technology of China, Chengdu 610051, China(e-mail: [email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TMM.2014.2384912

1[Online] Available: http://www.weibo.com

Fig. 1. Illustration of social events on Sina Weibo.

with close to 37% of Sina Weibo microblogs containing im-ages [1]. With wide availability of information sources, rapidinformation propagation and ease of use, microblogging hasquickly become one of the most important medium for sharing,distributing and consuming interesting contents and topics.One of the most important functionalities of microblogging

services is to monitor hot trends, also known as social events.Given the streaming data, various techniques ([2]–[4]) havebeen proposed for social event detection. Currently, mostmicroblogging platforms provide the list of ongoing socialevents, which will offer a potentially useful service to helpusers to conveniently gain a quick and concise impression ofthe current hot social events. For example, Twitter provides theTrends service, and Sina Weibo provides the Hot Topics service(as presented in Fig. 1). However, social event detection itselfdoes not end the story. It only provides cues of the existenceof a new event, together with the tremendous volume of unor-ganized microblog posts, which usually offer too many detailsto browse. Without effective summarization mechanism, theusers are often confronted with incomplete, irrelevant andduplicate information, which makes it difficult to capture theessence of the event and possible to miss information of avaluable direction. Therefore, it would be of great benefit ifan effective mechanism can be provided for summarizing thedetected social events. In this paper, we focus on the step aftersocial event detection: given the microblog posts related to adetected social event, we target at mining the different divisionsunder this event (denoted as subevents in this paper), as well assummarizing these subevents precisely and concisely.It is natural to formulate microblog summarization as a

multi-document summarization (MDS) [5] task, which has beenwidely studied in information retrieval. MDS is an automaticprocedure aimed at extraction of information from multipletexts about the same topic. Resulting summary report allow

1520-9210 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

BIAN et al.: MULTIMEDIA SUMMARIZATION FOR SOCIAL EVENTS IN MICROBLOG STREAM 217

readers to quickly capture the essential information containedin a large cluster of documents. Most of the previous MDStechniques are designed for well-organized texts, such as newsarticles. However, two challenges make it greatly difficult todirectly apply the traditional MDS techniques to summarize mi-croblogs: 1) word restriction (e.g., a maximum 140 charactersfor Twitter2) and 2) unreliable microblog contents. Recently,several attempts ([6]–[8]) have been made to summarize mi-croblog text by considering the above limitations. Nonetheless,the performance is still far from satisfactory because they didnot address the problem of lacking descriptive power causedby word restriction and noisy content.Different from traditional documents that contain only tex-

tual objects, microblogs are comprised of contents of variousmedia types, such as images and video links. For instance, ofthe 310,097 microblogs in our Social Trends dataset (refer toSection V for details about the dataset), 114,426 (36.9%) con-tain images. Such high proportion of multimedia contents arepotentially precious resources for handling the different prob-lems ([9], [10]). The benefit of incorporating different mediatypes into summarization is three-fold: 1) In many cases, im-ages contain essential information which could not be com-pletely expressed by the microblog texts. Therefore, the visualinformation is of great significance for summarizing the eventand remedying the descriptive power of short texts. In addi-tion, when the emphasis of an event lies in the visual part, itwill not be meaningful if only textual summary is generated.2) Multimedia contents can facilitate subevent discovery. In-tuitively, given a social event, multimedia contents from dif-ferent subevents should have lower visual similarity while thosewithin the same subevent should have higher visual similarity.Thus, discriminative information embedded in visual informa-tion of multimedia contents can be exploited as critical cues forsubevent discovery. 3) Incorporating concrete multimedia ex-emplars into summarization can assist users to gain a more vi-sualized understanding of interesting events.It is non-trivial to integrate textual and visual information to

generate comprehensive summaries in the circumstance of mi-croblogs. If the intrinsic correlations between textual and visualinformation are not well explored, they may exert negative in-fluence on each other. In our previous work [11], we target at thisproblem, and proposed a framework to automatically generatemultimedia summary for trending topic in microblogs. By com-paring with many state-of-the-art summarization methods, weproved the superiority of our proposed framework in terms of ef-fectiveness and practicality. However, in the mean time, we alsodiscovered another severe problem regarding to the quality ofmicroblog posts. As the input of our problem is a stream of mi-croblog posts related to a detected social event, and the detectionprocess is usually text-based with no visual information takeninto consideration, therefore, the inconsistency of textual partand visual part in the same microblog is very common, we arefaced with irrelevant microblog texts and/or images in the inputdata stream. For instance, according to our statistics, the per-centage of relevant images in our datasets is only 67.1% (referto Section V for details). Directly utilizing such noisy imagesmay severely degrade the performance of subevent discoveryand summarization. Taken this issue into consideration, we ex-

2[Online] Available: http://twitter.com

tend the previous work by adding an important component intoour framework, which tackles the problem of removing irrel-evant data before the succeeding procedures of subevent dis-covery and summary generation.To sum up, in this paper, we propose a novel multimedia

social event summarization framework to generate holisticvisualized summary from the microblogs with multiple mediatypes. Specifically, the proposed framework comprises threestages: removal of irrelevant data, cross-media subevent dis-covery and multimedia summary generation. First, we devisea data cleansing approach to automatically eliminate thoseirrelevant/noisy images. An effective spectral filtering model isexploited to estimate the probability that an image is relevantto a given event. In the second stage, we propose a novelcross-media probabilistic model, termed Cross-Media-LDA(CMLDA), to jointly exploit the microblogs of multiple mediatypes for discovering subevents. The CMLDAmodel notmerelywell explores and exploits the intrinsic correlations among dif-ferent media types, but also simultaneously characterizes boththe general distribution and the subevent specific distributionfrom the microblog data of various media types for reinforcingthe subevent discovery process. Besides, this step could alsohandle the noise of the input data, and remove those microblogexamples from the next summarization step. Finally, based onthe cross-media distribution knowledge of all the discoveredsubevents, we generate a holistic visualized summary for thesocial events by pinpointing both the representative textual andvisual samples in a joint fashion. In particular, by utilizing thecross-media distributions of microblog text, we specify threecriteria, namely coverage, significance and diversity to measurethe summarization capability of individual textual samples. Wethen devise a greedy algorithm for identifying the represen-tative microblog texts based on the combination of the threecriteria. For visual summarization, we employ the cross-mediaknowledge of the subevents as the prior knowledge for rankingthe visual samples and selecting the most representative ones.In order to improve the descriptive power and the diversity ofviewpoints, we first partition the images within a subevent intogroups via spectral clustering. Then, for each group we applya manifold algorithm with the cross-media prior knowledgeas initial ranking scores to identify the top-ranked image asrepresentative. It is remarkable that both the textual and visualsummarization processes utilize the cross-media knowledge ofthe discovered subevents and thus are intrinsically connectedto reinforce each other.The rest of the paper is organized as follows. Related

works are briefly summarized and discussed in Section II.Section III introduces problem formulation and frameworkoverview. We elaborate the details of the proposed multimediasocial event summarization framework in Section IV, includingnoise removal, subevent discovery and microblog summarygeneration. Experimental results are reported and discussed inSection V, followed by the conclusion in Section VI.

II. RELATED WORK

Multi-document summarization has drawn much attention inthe past twodecades.GeneralMDSmethod can be separated intotwo types: extractive summarization and abstractive summariza-tion. The former one usually ranks the sentences in the docu-


Fig. 2. Flowchart of the proposed multimedia social event summarization framework.

ments according to their salient scores calculated by a set of pre-defined features and then extracts top ranked sentences;while thelatter involves information fusion, sentence compression and re-formulation. Over the years, many methods have been proposedforMDS, andmost of them are extractive methods. In this work,we focus on extractive summarization, and take each microblogas the extractive unit instead of a single sentence.Notable MDS methods include SumBasic [12] and cen-

troid-based algorithm. The underlying premise under SumBasicis that words which occur more frequently across documentshave a higher probability of being selected for human createdmulti-document summaries than words that occur less fre-quently. MEAD [13], one implementation of centroid-basedmethod, discovers the centroid of each document cluster, andextracts sentences closest to the centroid. Maximal marginalrelevance (MMR) [14]method is adopted to remove redundancysentences. Another direction is to use graph-based methods torank sentences based on the vote between each other. TextRank[15] and LexPageRank [16] use algorithms like PageRank andHITS to compute sentence importance. These methods first con-struct a graph representing the relationship between sentencesand then evaluate the importance of each sentence based onthe topology of the graph. Some other methods are designed toidentify semantically important sentences for summary genera-tion. For example, [17] uses the latent semantic analysis (LSA)technique to select highly ranked sentences; and a hierarchicalLDA-style model is utilized in [18] to represent content speci-ficity as a hierarchy of topic vocabulary distributions and thensentences are selected according to these distributions. Othermethods includeNMF-based topic-specific summarization [19];conditional random fields (CRF) based summarization [20];and hidden Markov model (HMM) based method [21]. Mostrecently, Wang et al. [22] proposed a framework to summarizemultiple documents via sentence-level semantic analysis andsymmetric matrix factorization. While all these methods aredesigned forwell-organized formal texts, directly applying themonmicroblog dataset is not very suitable.With the development of microblogging, many works have

shifted their focus to process microblog data. Most of theprior work on Twitter data summarization are about topic-levelsummarization. Sharifi et al. [8] summarized Twitter hottopics through finding the most commonly used phrase thatencompasses the topic phrase. In [6], clustering algorithms areintroduced for Twitter topic summarization to select multipleposts that convey information about a given topic without beingredundant. Chakrabarti et al. [7] formalized the problem oftweets summarizing for some highly structured and recurringevents, and offered a solution based on learning the underlyinghidden state representation of the event via HMM. Lin et al.[23] generated event storyline of an ongoing event via graph op-

timization for microblogs. The temporal information is utilizedfor event representation, and this framework is only suitablefor relatively long-term events. However, the hot period ofmost social events to be summarized for our task is usually oneday only, which makes the temporal information less valuablefor such short term. One problem for all approaches above isthat they only focused on the textual information, while theprecious multimedia resources are neglected.Several previous works have leveraged the importance of

multimedia resources [24], [25] and proposed methods for mul-timodal information representation. Yan et al. exploited the factthat news documents are often accompanied by pictures andproposed a graph-based framework to generate a timeline sum-mary for a news topic in [26]. Mutual Summarization is pro-posed in [27] to summarize images by text and visualize textutilizing images. The target domain for these previous efforts isnews. Compared with user-generated microblog contents, newsarticles are formalized and unified across multiple documents.Besides, the number of corresponding images in news articlesis usually small, and the images are of high quality and highrelevance. Hence it is not possible to adopt multimedia summa-rization techniques in news domain to microblog data.

III. FRAMEWORK OVERVIEW

Suppose we are given a microblog streamrelated to the same social event , which

can be either provided by any social event service of onlinemicroblog platform or detected by any social event detectionmethod. Each microblog consists of two com-ponents: textual component and visual component . Notethat may be empty, which means contains no visualsample. denotes the cardinality of a set. The objective of ourframework is to automatically generate a multimedia summary(i.e., both textual and visual) from the microblog collectionfor revealing multiple subevents of the event . For event

, we define its event-level summary as the union of all itssubevents’ summaries. For each subevent, a subevent-levelsummary comprises both textual and visual exemplars selectedfrom . The flowchart of the proposed multimedia socialevent summarization framework is illustrated in Fig. 2. As canbe seen, there are three main stages in the whole process. Inthe first stage, we preprocess both the textual and visual dataof microblogs as well as eliminate irrelevant images, whichresults in a more reliable microblog collection. In the secondstage, we discover subevents from microblogs by using theproposed CMLDA model. This model determines the subeventassignment for each microblog, as well as the cross-mediadistribution knowledge for all subevents. Then, the next stagesubstantially exploits the cross-media knowledge of each indi-vidual subevent for jointly identifying both the representative


textual and visual exemplars, which forms the subevent-levelsummary. Finally, we aggregate all the subevent-level sum-maries to derive the holistic summary for the social event.

IV. MULTIMEDIA SOCIAL EVENT SUMMARIZATION

In this section, we elaborate the details of the proposed mul-timedia social event summarization framework, including theremoval of irrelevant data, the cross-media subevent discoveryand the multimedia summary generation.

A. Removal of Irrelevant Data

As a kind of user-generated content, the quality of microblogscannot be guaranteed. It has been observed that manymicroblogimages are irrelevant to their corresponding texts (e.g., spamimages). Directly applying our framework on such noisy imageset may severely degrade the performance of the summarization.Since the input microblog collection is gathered with text-basedmethods, the problem of noisy images is more severe than thatof noisy texts. Therefore, it is necessary to first pre-filter mi-croblog images to eliminate those noisy images. For the problemof noisy texts, it will be addressed in the following subevent dis-covery procedure.Specifically, we develop the noise removal procedure

by exploiting a spectral filtering model [28]. Without lossof generality, suppose we have microblog images

corresponding to all the non-emptyimages of the given social event , where and isthe dimension of visual space. We first build a neighborhoodgraph , where is a vertex set composed ofvertices representing our images in , is

an edge set connecting neighboring vertices, andis a weighting matrix measuring the strength of the edges,i.e., the similarity between two data points. There are variousmethods to compute . In this work, we adopt the widely used-Nearest-Neighbors similarity graph

if or

otherwise

where is a distance measure such as Euclidean distance,and is the bandwidth parameter. denotes the set ofnearest neighbors of in . We further define as a degreematrix whose diagonal elements . With and, the normalized graph Laplacian is defined as

As previously discussed, an intuition is that images depicting thesame subevent should be visually similar to each other. There-fore, it is reasonable to assume that the truely relevant imagesshould reside in multiple high-density regions, while the irrel-evant images will present a more random distribution. It hasbeen demonstrated [29] that when data points have formed clus-ters, each high density region implicitly corresponds to certainlow-frequency (smooth) eigenvector. The data points which be-long to the region will take relatively large absolute values cor-responding to the eigenvector, while for data points elsewhere,the values are close to zero. With this assumption, we can ex-ploit the spectrum of the NN similarity graph , which isa set of eigenvalue/eigenvector pairs of the nor-malized graph Laplacian to find the high density regions.

For simplicity, we assume that the eigenvalues are sorted in anondecreasing order, thus the top eigenvectors have the lowestfrequency.Let be a label vector indicating the relevance

of each image to the given event. Ideally, takes the valueof 1 for all relevant images and 0 for noisy ones. Considerthe top smoothest eigenvectors as eigen-bases ( is eliminated because it is nearly constant and

when the graph is connected, thus does not formany region). According to the multi-region assumption,should lie in the subspace spanned by these eigenbases. Let

, andinitially. The spectral filter reconstructs the noisy label

vector with the sparse eigenbases by solving the followingproblem:

(1)

where is the sparse coefficient vector,is the -norm. and are two regularization parameters.Note that the last term , which is actu-ally a weighted -norm, imposes that smoother eigenbases withsmaller eigenvalues are preferred in the reconstruction of .Once the solution to Eq. (1) is obtained, the truly relevant

label vector is set to , where the functionis defined as follows:

(2)

where is the threshold indicating the confidence level of adata to be regarded as relevant. With the final label vector ,we eliminate those image samples with , and retain amore reliable image set.

B. Cross-Media Subevent Discovery

In this subsection, we propose a novel cross-media prob-abilistic model, termed Cross-Media-LDA (CMLDA), todiscover subevents by jointly exploring the intrinsic corre-lation between the textual and visual aspects of microblogs.The CMLDA model substantially characterizes the multiplefacets of the social event by exploring two underlying proper-ties of various media types, i.e., inter-media consistency andintra-media discrimination. Besides, this model is also capableof eliminating possible noisy microblog posts from the datacollection gathered for the following summarization process.Inter-Media Consistency: It is observed that the microblogs

associated with the same social event contain various inter-cor-related media types, such as texts and images. If we canproperly capture and model the intrinsic correlations amongthese media types, we may achieve a better understanding ofthe social event. Intuitively, different media types of the sameevent should be related to certain common topics or share somecommon high-level semantics. In other words, the semanticsshould be consistent across different media types. Based onthis analysis, we model the common semantics shared amongdifferent media types via a subevent indicator , which is ableto jointly generate both the textual words and visual words inthe microblogs. With the cross-media subevent indicator, wemanage to capture the inter-media correlations for effectivesubevent discovery. It is worth noting that while the traditional


Fig. 3. Graphical model representation of the CMLDA model.

latent dirichlet allocation (LDA) [30] model assigns multipletopics to each individual document and one topic for eachword, the proposed CMLDA model is designed to associateonly one topic (subevent) with each individual microblog. Theunderlying reason is that microblog content is usually short andfocused, and thus it is reasonable to assume that each microblogis related to only one subevent.Intra-Media Discrimination: Within each individual media

type, it is non-trivial to directly employ traditional topic mod-eling approach (e.g., LDA) to discover the subevents within thesame social event because the semantics of different subeventsmay be heavily overlapped, while we target at discovering thediscriminative knowledge of each subevent.Normally, we may assume that all subevents of the same

event share certain general words indicating common seman-tics related to the social event; while each individual subeventuniquely possesses certain specific semantics, which distin-guishes itself from other subevents. Take “Lushan Earthquake”as an example, “earthquake”, “Lushan” and “death” are morelikely to be general words; while words like “hypocenter”, “col-lapse” and “Premier” are more probable to appear in differentsubevents. If the proportion of general contents is large, thenthey may dominate the result. In order to exclude the influenceof general contents and discover discriminative cues for eachsubevent, two new latent variables and are introducedto guarantee intra-media discrimination in the generation oftextual and visual words, respectively. For each textual (visual)word, ( ) indicates whether it is generated from the generaldistribution or from the specific distribution corresponding toits subevent.CMLDA Modeling and Inference: In this part, we elaborate

the details of the modeling and inference processes of theCMLDA model. Fig. 3 illustrates the graphical model rep-resentation, and the key notations are listed in Table I. Thegeneration process is as follows.1. For the event , draw and

, indicating the general textual and visual dis-tribution, respectively. Then draw , whichindicates the distribution of subevents over the microblogcollection corresponding to .

2. For each subevent, draw and, , corresponding to the spe-

cific textual and visual distribution.3. For each microblog , draw , corre-sponding to the subevent assignment for . Then draw

TABLE ILIST OF KEY NOTATIONS

, indicating the general/specific textualword distribution of . Similarly, draw .

4. For each textual word position of , draw a variable:

• If indicatesGeneral ( for short), then draw a word.

• If indicates Specific ( for short), then drawa word from the -th specific distribution

.5. The generation of visual words is similar to step 4.In the CMLDAmodel, the subevent indicator as well as the

general/specific indicator and are latent variables to be in-ferred from observations, i.e., textual and visual words. We useGibbs sampling to achieve the inference due to its efficiencyand effectiveness in handling high-dimensional data. The up-date rules for latent variables are shown as follows:


where and denote the textual and visual vocabulary,respectively. The variables with subscript are correspondingto the -th microblog , while subscript correspond to the-th textual/visual word. stores the number of samplessatisfying certain requirements during the iterative samplingprocess. For example, represents the number of word(excluding the words in ) in the -th specific textual

distribution.After Gibbs sampling, we obtain the latent variables. Besides,specific distributions and can also be easily com-

puted. For a textual word , measures the proba-bility of appearing in the -th specific distribution. It is similarfor visual distribution . Therefore, they can be evaluatedas follows:

(3)

(4)

With the CMLDA model, textual and visual components willfacilitate each other to discovery the cross-media knowledge ofthe subevents hidden in the event. The obtained textual/visualdistribution pair ( , ) depicts the discriminative multi-media cues for each subevent. According to the subevent in-dicator for each microblog, the CMLDA model partitionsthe microblog collection into subsets corre-sponding to subevents where each subset contains both tex-tual part and visual part . Intuitively, if a subevent con-tains a small number of textual or visual samples, the topicof this subevent may not be important or related to the event.We argue that such subevents are probably composed of thosenoisy microblogs and should be removed. In our work, we re-move all subsets whose sizes are smaller than , whereis the threshold. In the following subsection, we will em-

ploy the cross-media knowledge achieved with CMLDA for thesummarization.

C. Multimedia Summary Generation

In this subsection, we explore how to utilize the cross-mediadistribution knowledge of all the discovered subevents to fa-cilitate the generation of the holistic visualized summary withvarious media types for social events.Cross-Media Summarization for Microblog Texts: In this

part, we propose a method for text summarization based onthe cross-media distribution information inferred from both thetextual and visual aspects of the microblogs in the subeventdiscovery procedure. Specifically, a greedy algorithm is de-veloped to sequentially select representative samples basedon a novel selection criterion, which takes three fundamentalrequirements into consideration:Coverage: Intuitively, if a summary is able to well “cover”

the information of its corresponding subevent, then the worddistributions over both of them should be close to each other.Weuse the similarity of word distributions over a summary and itscorresponding subevent for measuring coverage. Denote asthe current summary set consisting of the selected samples, thenthe word distribution over , denoted as , can be estimatedas

(5)

where denotes the term frequency of wordin . We use as the word distribution over the corre-sponding subevent, which is the distribution estimated in thelearning process of CMLDA model [Eq. (3)]. We employKullback-Leibler (KL) divergence to measure the distance oftwo distributions and

(6)

Given the current summary set , the new sample to beselected should be the one which makes the new summary (i.e.,

) achieve the best coverage (i.e., minimize the distancebetween and ). Therefore, the coverage of eachcandidate could be measured by the following criterion:

(7)

Significance: In the circumstance of microblogging, each mi-croblog can propagate between users by the repost action. Ingeneral, the popularity of a microblog can be revealed fromthe repost number. A large repost number implies that themicroblog attracts a lot of attention and interest from otherusers, and hence indirectly represents the importance of thismicroblog. The users will be more satisfied if more of thesehot microblogs are shown in the summary. Therefore, weuse a smooth function over the repost number to measure thesignificance of a candidate

(8)

Diversity: The information diversification is favored in the finalsummary. We take the information redundancy into consider-ation for sample selection. Consider a candidate , the redun-dancy it brings to the summary set can be measured by the simi-larity between this candidate and the previously generated sum-mary, which is

(9)

Overall Selection Score: The overall selection score is definedas a weighted linear combination of the scores of coverage, sig-nificance and diversity. Since small distance indicateshigh coverage, we compute the overall selection score as

where are trade-off parameters with .is a logistic increasing function for

normalizing all the scores to the interval .With the above selection score for all the microblog samples,

we may derive a greedy algorithm for representative sample se-lection. In each iteration, we select the one with the largest scorefrom all the remaining samples.Cross-Media Summarization for Microblog Images: Con-

sider the visual subset , which contains all images relatedto the -th subevent. The objective of this step is to employthe cross-media knowledge of the discovered subevents toreinforce the selection of the most representative image sam-ples. The selected images should provide enough visuallydescriptive power as well as diverse viewpoints. We developa two-step approach to automatically select representativeimages satisfying the above two criteria. We first partition theimages within a subevent into groups via spectral clustering.


Then, for each group we apply a manifold algorithm withthe cross-media prior knowledge as initial ranking scores toidentify the top-ranked image as representative.Clustering Step: With the similarity matrix previously

constructed in the step of the noise removal, the similaritymatrix for can be directly obtained by extracting thecolumns and rows corresponding to images in , i.e.,

. Then normalized cut is applied to theimage set, and visual diversity is achieved across clusters.Ranking Step: In order to discover images with best repre-

sentative ability within each cluster, we adopt manifold rankingalgorithm to rank the images. Let denote the vector of rankingscore, manifold ranking defines an iterative update process asfollows:

(10)

where represents the vector of initial ranking scores, whichis an all-one vector in standard manifold ranking setting. How-ever, in our scenario, we expect to possess the prior knowl-edge of the importance of each image. Recall that with CMLDAmodel, we have achieved the discriminative visual informationfor this subevent, which is . Intuitively, if an image is moreconsistent with , it would have better descriptive abilityfor the whole subevent image set, and should gain more em-phasis. Therefore, instead of all-one vector which takes equalweighting for all images, we express as prior knowledge mea-sured by the KL divergency of an image and , i.e.,

. By integrating the prior knowledge inthe ranking scheme, the descriptive ability for the cluster as wellas for the subevent image set are both taken into consideration.Note that has a closed formwhen the update process converges

(11)

Finally, the image with the largest ranking value in is selectedfrom each cluster to construct the visual summarization set.

V. EXPERIMENTS

A. Dataset and Experimental Settings

We conducted the evaluation of our framework on twodatasets that were collected by ourselves: (1) Social Trends,which include 20 trending topics that were listed as hot trendsin February 2013 by Sina Weibo. For each trending topic, wecrawled the relatedmicroblogs in the life cycle of this topic usingthe trending topicAPI provided bySinaWeibo. The total numberof microblogs is 310,097, of which 114,426 contain image;(2) Product Events, which was collected by Gao et al. [31]. Itincludes 20 product-related events and 13,932 microblogs, and11,736 of them contain image. The detailed event lists of bothdatasets are shown in the appendix. Due to limited informationappended to repost action, only the original microblogs areincluded in our datasets. In order to evaluate the quality of thegenerated summaries, five volunteers were invited to manuallygenerate a textual summary for each event as golden standardindividually. Each manually generated summary consists of 50microblogs selected from the microblog datasets.In text pre-processing procedure, we first segmented Chinese

words using IKAnalyzer,3 and then removed the stop words,

3[Online] Available: http://code.google.com/p/ik-analyzer

Fig. 4. Effects of irrelevant image removal. (a) Social Trends. (b) ProductEvents.

low-frequency words with document frequency of less than 5and mentions (@somebody) from textual vocabulary. Textscontaining less than 3 words were also eliminated. For visualfeature extraction, scale-invariant feature transform (SIFT)descriptors were first extracted from each image. Then wetrained a codebook of 1,000 visual words with descriptorssampled from images of all events. With the trained codebook,each descriptor was quantized into a visual word. Each imagewas further represented as a 1,000-dimensional -normalizedbag-of-visual-words feature vector.When constructing the image similarity graph, we set the

number of nearest neighbors to 20 and bandwidth parameterto 0.1. For the spectral filtering model in Eq. (1), we useeigenbases for label vector reconstruction, and set ,

. For concentration parameters in CMLDA model,as stated in [18], the more specific a distribution is meant tobe, the smaller its parameter. Accordingly, we set ,

, , , , , and. For the final representation image selection procedure,

the parameter is set to 0.85. The threshold is set toand all subevents with size smaller than are removed.The total number of the selected microblogs is chosen to be 50,which is the same as the number of microblogs in the gold stan-dards. The 50 microblogs quota are assigned to the remainingsubevents according to the proportion of microblog number ineach subevent.

B. Capability of Irrelevant Image Removal

We demonstrate the capability of our developed irrelevantimage removal component in Fig. 4. The percentage of relevantimages before and after the removal procedure for each eventis listed. As aforementioned, the original image collections ofall the social events contain many images that are irrelevant tothe corresponding event. The average percentage of relevant im-ages is 67.1% and 69.9%, respectively, across all events for thetwo datasets. We apply spectral filtering on the image collectionof each individual event separately. As shown in Eq. (2), oneimportant factor which controls the performance of spectral fil-tering is the parameter of the function. There is a


Fig. 5. Illustrative examples of removed images and those remained after ir-relevant image removal.

tradeoff between the performance of irrelevant image removaland the number of remaining images: in general, the higher therelevance percentage, the smaller the number of remaining im-ages. In our framework, the quality of image collections is verycrucial for the cross-media subtopic discovery and summariza-tion. In our experiments, the controlling parameter is set to 0.5for Social Trends dataset, which results in a relatively high rel-evance percentage (88.4%), as well as a reasonable number ofimages (54,800 images, or 51% of the original collection size).Similarly, we set to 0.6 forProduct Events, and achieved 6,570remaining images with 91.5% of them being relevant. On av-erage, our proposed method improves the percentage of rele-vant images by around 21%. Fig. 5 shows several examples ofremoved and remaining images for Event #1 of Social Trendsand Event #13 of Product Events. As can be clearly seen, forboth events, the exemplars with high rank orders are truly rel-evant to the corresponding events while those images with lowranks are really noisy.

C. Summarization Performance

In this subsection, we evaluate the effectiveness of our pro-posed framework as compared to several summarization ap-proaches. For fairness of evaluation, we select 50microblogs forall the comparing approaches to form the summaries. For evalu-ation metric, we employ ROUGE evaluation toolkit [32] whichautomatically determines the quality of a summary as comparedto human generated golden standards. In particular, F-measurescores of ROUGE-1, ROUGE-2, ROUGE-W (withW set to 1.2)and ROUGE-SU are reported. Take ROUGE-N as an example.Denote the golden standards as , and the generated summaryas , ROUGE-N-Recall is an N-gram recall metric computed asfollows:

- --

--

and ROUGE-N-Precision is an N-gram precision matric asfollows:

- -

--

-

For the ROUGE-N value reported in our experimental results,we adopt the F score of the above recalled based and precisionbased metrics

-- - - -

- - - -

We compare our proposal with the following multi-documentsummarization approaches.• RANDOM: which selects all samples randomly.• LSA [17]: which conducts SVD on sample by term matrixfirst and starting frommost significant left eigenvector, andselect samples with highest entry value.

• NMF [19]: which performs NMF on sample by termmatrixand select samples best represent the discovered bases.

• SNMF [22]: which constructs the sample-sample simi-larity matrix first, clusters all samples with SymmetricNon-negative Matrix Factorization (SNMF) and extractscentering sentences from the clusters.

• KMEANS [13]: which performs K-means clustering overthe dataset, and samples nearest to cluster centers are se-lected.

• NCUT [33]: which is similar to KMEANS, while use nor-malized cut as clustering method.

Besides, the following text-based microblog summarizationapproaches are also compared.• PR [34]: the Phrase Reinforcement algorithm, which gen-erates summaries by looking for the most commonly oc-curring phrases.

• HTF-IDF [6]: which selects summary posts by their HybridTF-IDF weights, and filters redundant posts with similaritythreshold.

• CLUSTER [6]: another method proposed by Inouye et al.[6]. Similar to the traditional clustering-based multi-docu-ment summarization approach, this method first conductskmeans to cluster the data samples. When selectingsummary posts from each cluster, the above HTF-IDF isutilized to assign weights to the samples.

For our proposed approach, two specific methods are evalu-ated for comparison:• MMES: the proposed multimedia social event summariza-tion (MMES) framework that uses both text and visual con-tents in building CMLDA model.

• MMES-I: MMES without utilizing the visual information.In the subevent discovery stage, when applying CMLDAmodel, all microblog samples are assumed to be comprisedof texts only.

• MMES-R: MMES without the process of irrelevant imageremoval, where the whole noisy image collections areused. This is themethod adopted in our previous work [11].


TABLE IICOMPARISON AMONG DIFFERENT SUMMARIZATION APPROACHES ON THESOCIAL TRENDS DATASET. AVERAGE RESULTS OF THE 20 EVENTS ARE

REPORTED FOR ALL EVALUATION MEASUREMENT

The overall comparison of proposed MMES, MMES-I andMMES-R with other approaches are shown in Table II andTable III. In addition, detailed ROUGE-1 performance foreach event is shown in Fig. 6. For conciseness, only sevenselected comparing methods are shown in the figure. As can beseen from the results, the proposed MMES outperforms othermethods for all events as well as all evaluation measurements.The good performance of MMES benefits from the followingthree aspects.First of all, MMES explores the joint correlation between the

textual and visual aspects of microblogs. The impact of mul-timedia knowledge can be demonstrated by comparing the re-sults of MMES and MMES-I. The latter approach differs fromMMES only with the lack of visual component. The perfor-mance illustrates the degradation of summarization ability whenonly a single media type is used. In addition, by comparing theresults of MMES and MMES-R (which uses the noisy imagecollections), it clearly demonstrates the necessity for removingirrelevant images from the original datasets.Secondly,MMES andMMES-I discover subevents before the

summarization procedure. As a result, all important branches foran event are covered in the final summarization. Although somecomparing methods also consider the coverage of the summa-rization for the dataset, the coverage is only considered at theevent-level rather than the subevent-level. In case a subeventcontains a small number of microblogs, there is a high proba-bility that the microblogs related to this subevent will be ignoredwith comparing methods. The high performance of MMES-Ias compared to all the baseline methods demonstrates the ef-fectiveness of subevent discovery for enhancing summarizationperformance.Thirdly, three criteria are specified in MMES for generating

the summary of each subevent, namely coverage, significanceand diversity. These three criteria are able to further facilitate thesummary generation.We conduct further experiment to evaluatethe effectiveness of each individual component by removingeach of the three criteria from our framework. The result isshown in Table IV and Table V.MMES-C denotes themethod ofusing only significance and diversity, without taking coverage

TABLE IIICOMPARISON AMONG DIFFERENT SUMMARIZATION APPROACHES ON THEPRODUCT EVENTS DATASET. AVERAGE RESULTS OF THE 20 EVENTS ARE

REPORTED FOR ALL EVALUATION MEASUREMENT

into consideration. Similarly, MMES-S is the method withoutconsidering significance, and MMES-D represents our methodwithout considering diversity. For each comparing methods, theparameter corresponding to the removed criterion was set to 0,while the parameters for other two factors were kept unchanged(The parameter value is described in the next subsection). Ascan be seen, the performance of removing any criterion becomesworse, which illustrates that all components are necessary forour framework.An example of our summarization result is shown in Fig. 9.

This is a summary on Event #1 of Social Trends dataset. Asshown, five subevents are discovered. Due to space limitation,only the top 3 images and top 5 texts for each subevent are listed.This example demonstrates the ability of our proposed frame-work in: 1) well organizing the messy microblogs into struc-tured subevents; 2) generating high-quality textual summary atsubevent level; and 3) selecting the most representative imagesfor summarizing the event.

D. Parameter Tuning

The overall selection score is a weighted linear combinationof the three criteria coverage, significance and diversity. In thispart, we examine the effects of the corresponding weighting pa-rameters , and to achieve the optimal parameter setting.Keeping other two parameters fixed to 1, we vary the remainingfrom 0 to 10 to examine its influence on the final results, and

select the value which achieves the best F-score for the ROUGEvalues. After achieving the corresponding values for , and, we adjust to make the sum of the

three weighting parameters to 1.With this procedure, the param-eters are selected as , and for theSocial Trends dataset, and , , andfor the Product Events dataset. In order to prove the above re-sults are the optimized combination, we further fix two of thevalues fixed as the achieved value, and vary the third one. Ac-cording to the results shown in Fig. 7, all parameters performthe best when they are at the achieve optimized value, e.g., thebest performance for in Social Trends dataset is 0.4, which


Fig. 6. Detailed performance (ROUGE-1) of MMES, MMES-I, and five selected comparing approaches over all events. (a) Social Trends. (b) Product Events.

TABLE IVEFFECTS OF COVERAGE, SIGNIFICANCE AND DIVERSITY CRITERIA IN

SUBEVENT DISCOVERY ON THE SOCIAL TRENDS DATASET

TABLE VEFFECTS OF COVERAGE, SIGNIFICANCE AND DIVERSITY CRITERIA IN

SUBEVENT DISCOVERY ON THE PRODUCT EVENTS DATASET

is consistent with our result, thus proves the optimization of thetuned parameter values.Another important parameter is the number of subevents .

Fig. 8 shows the performance of MMES with various subeventnumber in terms of ROUGE-1 result. Very small failsto achieve satisfactory performance, as the ability to discoversubevents is not fully utilized in this situation. However, largedoes not lead to significant growth for the summarization

performance, and may exert negative influence. By taking a de-tailed observation of our dataset, we can see that the microblogdiscussion for the same event is usually limited to a few direc-tions, whichmeans the number of subevents will not be too largein our specific scenario. If we set the subevent number to animproper large number, less important topic branches will beextracted, and corresponding microblogs will be included in thefinal summary, which will hurt the summarization performance.Furthermore, too many subevents will hurt the “concise” prin-ciple of summarization. Taken the above points into considera-tion, we set the subevent number to 10 for Social Trends and7 for Product Events.

Fig. 7. Performance of parameter and on the two datasets. (a) SocialTrends. (b) Product Events.

Fig. 8. Summarization performance of MMES with various subeventnumber .


Fig. 9. Illustrative example of multimedia social event summarization on Event #1 in Social Trends dataset.

VI. CONCLUSION

In this paper, we present a multimedia social event sum-marization framework which automatically generates holisticvisualized summary from themicroblogs of variousmedia types.The proposed framework features the exploration of the intrinsiccorrelations among differentmedia types for enhancing the sum-marization performance. In particular, we developed threemajorstages to accomplish the summarization. First, we devise aneffective approach for eliminating the potentially noisy imagesfrom rawmicroblog image collection.Then,weproposed anovel

Cross-Media-LDA (CMLDA) model, to discover subeventsfrom microblogs of different media types. Finally, we generatedmultimedia summary for social events utilizing the cross-mediadistribution knowledge of all the discovered subevents. Weconducted extensive experiments on two real-world microblogdatasets collected by ourselves to show the superiority of ourproposedmethod as compared to the state-of-the-art approaches.In the future, we intend to extend the cross-media framework forautomatically detecting social events and retrieving related can-didatemicroblogs. In addition, wewill also explore personalizedmicroblog summarization based on user profile.


Fig. 10. Contents of the events in the two datasets. (a) Event list of SocialTrends dataset. (b) Event list of Product Events dataset.

APPENDIX

The contents of the events in the two datasets are listed inFig. 10.

REFERENCES

[1] T.-S. Chua, H. Luan, M. Sun, and S. Yang, “Next: Nus-Tsinghua centerfor extreme search of user-generated content,” IEEEMultiMediaMag.,vol. 19, no. 3, pp. 81–87, Jul.-Sep. 2012.

[2] T. Sakaki, M. Okazaki, and Y. Matsuo, “Earthquake shakes Twitterusers: Real-time event detection by social sensors,” in Proc. WWW,2010, pp. 851–860.

[3] J. Weng and B.-S. Lee, “Event detection in Twitter,” in Proc. ICWSM,2011.

[4] Y. Chen, H. Amiri, Z. Li, and T.-S. Chua, “Emerging topic detectionfor organizations from microblogs,” in Proc. SIGIR, 2013, pp. 43–52.

[5] K. Spärck Jones, “Automatic summarising: The state of the art,” Inf.Process. Manage., vol. 43, no. 6, pp. 1449–1481, 2007.

[6] D. Inouye and J. K. Kalita, “Comparing Twitter summarization algo-rithms for multiple post summaries,” in Proc. SocialCom, 2011, pp.298–306.

[7] D. Chakrabarti and K. Punera, “Event summarization using tweets,” inProc. ICWSM, 2011, pp. 66–73.

[8] B. Sharifi, M.-A. Hutton, and J. Kalita, “Summarizing microblogs au-tomatically,” in Proc. NAACL HLT, 2010, pp. 685–688.

[9] Y. Yang, Z.-J. Zha, Y. Gao, X. Zhu, and T.-S. Chua, “Exploitingweb images for semantic video indexing via robust sample-specificloss,” IEEE Trans. Multimedia, vol. 16, no. 6, pp. 1677–1689, Oct.2014.

[10] J. Bian, Y. Yang, and T.-S. Chua, “Predicting trending messages anddiffusion participants in microblogging network,” in Proc. SIGIR,2014, pp. 537–546.

[11] J. Bian, Y. Yang, and T.-S. Chua, “Multimedia summarizationfor trending topics in microblogs,” in Proc. CIKM, 2013, pp.1807–1812.

[12] L. Vanderwende, H. Suzuki, C. Brockett, and A. Nenkova, “Beyondsumbasic: Task-focused summarization with sentence simplificationand lexical expansion,” Inf. Process. Manage., vol. 43, no. 6, pp.1606–1618, 2007.

[13] D. R. Radev, H. Jing, M. Styś, and D. Tam, “Centroid-based summa-rization of multiple documents,” Inf. Process. Manage., vol. 40, no. 6,pp. 919–938, 2004.

[14] J. Goldstein,M. Kantrowitz, V.Mittal, and J. Carbonell, “Summarizingtext documents: Sentence selection and evaluation metrics,” in Proc.SIGIR, 1999, pp. 121–128.

[15] R. Mihalcea and P. Tarau, “A language independent algorithm forsingle and multiple document summarization,” in Proc. IJCNLP,2005.

[16] G. Erkan and D. R. Radev, “Lexpagerank: Prestige in multi-documenttext summarization,” in Proc. EMNLP, 2004, pp. 365–371.

[17] Y. Gong and X. Liu, “Generic text summarization using relevancemeasure and latent semantic analysis,” in Proc. SIGIR, 2001, pp.19–25.

[18] A. Haghighi and L. Vanderwende, “Exploring content models formulti-document summarization,” in Proc. NAACL HLT, 2009, pp.362–370.

[19] S. Park, J.-H. Lee, D.-H. Kim, and C.-M. Ahn, “Multi-document sum-marization based on cluster using non-negative matrix factorization,”in Proc. SOFSEM, 2007, pp. 761–770.

[20] D. Shen, J.-T. Sun, H. Li, Q. Yang, and Z. Chen, “Document summa-rization using conditional random fields,” in Proc. ACL, 2007, vol. 7,pp. 2862–2867.

[21] J. M. Conroy and D. P. O’Leary, “Text summarization via hiddenMarkov models,” in Proc. SIGIR, 2001, pp. 406–407.

[22] D. Wang, T. Li, S. Zhu, and C. Ding, “Multi-document summarizationvia sentence-level semantic analysis and symmetric matrix factoriza-tion,” in Proc. SIGIR, 2008, pp. 307–314.

[23] C. Lin, C. Lin, J. Li, D. Wang, Y. Chen, and T. Li, “Generating eventstorylines from microblogs,” in Proc. CIKM, 2012, pp. 175–184.

[24] Y. Yang, Y. Yang, and H. Shen, “Effective transfer tagging from imageto video,” TOMCCAP, vol. 9, no. 2, 2013.

[25] Y. Yang, Y. Yang, Z. Huang, H. Shen, and F. Nie, “Tag localizationwith spatial correlations and joint group sparsity,” in Proc. IEEE Conf.Comput. Vis. Pattern Recog., Jun. 2011, pp. 881–888.

[26] R. Yan, X. Wan, M. Lapata, W. X. Zhao, P.-J. Cheng, and X. Li,“Visualizing timelines: Evolutionary summarization via iterativereinforcement between text and image streams,” in Proc. CIKM, 2012,pp. 275–284.

[27] P. Li, J. Ma, and S. Gao, “Learning to summarize web image and textmutually,” in Proc. ICMR, 2012, p. 28.

[28] W. Liu, Y.-G. Jiang, J. Luo, and S.-F. Chang, “Noise resistant graphranking for improved web image search,” in Proc. IEEE Conf. Comput.Vis. Pattern Recog., Jun. 2011, pp. 849–856.

[29] T. Shi, M. Belkin, and B. Yu, “Data spectroscopy: Eigenspaces of con-volution operators and clustering,” Ann. Statist., vol. 37, no. 6B, pp.3960–3984, 2009.

[30] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,”in Proc. JMLR, 2003, vol. 3, pp. 993–1022.

[31] Y. Gao, F. Wang, H. Luan, and T.-S. Chua, “Brand data gathering fromlive social media streams,” in Proc. ICMR, 2014, p. 169.

[32] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,”in Text Summarization Branches Out: ACL-04 Workshop, 2004, pp.74–81.

[33] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEETrans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888–905, Aug.2000.

[34] B. Sharifi, M.-A. Hutton, and J. K. Kalita, “Experiments in microblogsummarization,” in Proc. SocialCom, 2010, pp. 49–56.

Jingwen Bian received the B.S. degree in computerscience from Peking University, Beijing, China, in2010, and is currently working toward the Ph.D. de-gree at the School of Computing, National Universityof Singapore, Singapore.Her research interest includes social network anal-

ysis and social media processing.

Yang Yang received the B.S. degree from Jilin Uni-versity, Changchun, China in 2006, the M.E. degreefrom Peking University, Beijing, China, in 2009, andthe Ph.D. degree from The University of Queensland,Brisbane, Qld., Australia, in 2013.He was a Research Fellow with the School of

Computing, National University of Singapore. He iscurrently with the University of Electronic Scienceand Technology of China, Chengdu, China. Hisresearch interests include multimedia information re-trieval, social media analysis, and machine learning.


Hanwang Zhang received the B.Eng. (Hons.) de-gree in computer science from Zhejiang University,Hangzhou, China, in 2009, and the Ph.D. degree incomputer science from the National University ofSingapore, Singapore, in 2014.Hismain research interests includemultimedia and

computer vision, focusing on developing techniquesfor efficient search and recognition in image contents.

Tat-Seng Chua is currently the KITHCT Chair Pro-fessor with the School of Computing, National Uni-versity of Singapore (NUS), Singapore, where he wasthe Acting and Founding Dean of the School of Com-puting from 1998 to 2000. He spent three years as aResearch Staff Member with the Institute of SystemsScience starting in 1980. He joined NUS in 1983.He is the Independent Director of two listed compa-nies in Singapore. His main research interests includemultimedia information retrieval, multimedia ques-tion answering, and the analysis and structuring of

user-generated contents.Dr. Chua has organized and served as a Program Committee Member of nu-

merous international conferences in the areas of computer graphics, multimedia,and text processing. He was the Conference Co-Chair of ACM Multimedia in2005, the Conference on Image and Video Retrieval in 2005, and the ACMSIGIR in 2008, and was the Technical PC Co-Chair of SIGIR in 2010. He serveson the Editorial Boards of the ACM Transactions of Information Systems, Foun-dation and Trends in Information Retrieval, The Visual Computer, and Multi-media Tools and Applications. He is on the Steering Committee of the Interna-tional Conference on Multimedia Retrieval, Computer Graphics International,and Multimedia Modeling Conference Series. He serves as a Member of the in-ternational review panels of two large-scale research projects in Europe.

216 ieee transactions on multimedia, vol. 17, no. 2,...

Documents