an unsupervised bayesian modelling approach for storyline detection on news...

Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1943–1948,Lisbon, Portugal, 17-21 September 2015. c©2015 Association for Computational Linguistics.

An Unsupervised Bayesian Modelling Approach to Storyline Detectionfrom News Articles

Deyu Zhou†‡ Haiyang Xu† Yulan He§† School of Computer Science and Engineering, Southeast University, China

‡ State Key Laboratory for Novel Software Technology, Nanjing University, China§ School of Engineering and Applied Science, Aston University, UK

[email protected], [email protected], [email protected]

Abstract

Storyline detection from news articlesaims at summarizing events described un-der a certain news topic and revealing howthose events evolve over time. It is a dif-ficult task because it requires first the de-tection of events from news articles pub-lished in different time periods and thenthe construction of storylines by linkingevents into coherent news stories. More-over, each storyline has different hierarchi-cal structures which are dependent acrossepochs. Existing approaches often ignorethe dependency of hierarchical structuresin storyline generation. In this paper, wepropose an unsupervised Bayesian model,called dynamic storyline detection model,to extract structured representations andevolution patterns of storylines. The pro-posed model is evaluated on a large scalenews corpus. Experimental results showthat our proposed model outperforms sev-eral baseline approaches.

1 Introduction

The rapid development of online news media sitesis accompanied by the generation of tremendousnews reports. Facing such massive amount ofnews articles, it is crucial to develop an automat-ed tool which can provide a temporal summary ofevents and their evolutions related to a topic fromnews reports. Therefore, storyline detection, aim-ing at summarising the development of certain re-lated events, has been studied in order to help read-ers quickly understand the major events reportedin news articles. It has attracted great attention re-cently. Kawamae (2011) proposed a trend analy-sis model which used the difference between tem-poral words and other words in each documen-t to detect topic evolution over time. Ahmed et

al. (2011) proposed a unified framework to grouptemporally and topically related news articles in-to same storylines in order to reveal the tempo-ral evolution of events. Tang and Yang (2012) de-veloped a topic-user-trend model, which incorpo-rates user interests into the generative process ofweb contents. Radinsky and Horvitz (2013) builtstorylines based on text clustering and entity en-tropy to predict future events. Huang and Huang(2013) developed a mixture-event-aspect model tomodel sub-events into local and global aspects andutilize an optimization method to generate story-lines. Wang et al. (2013) proposed an evolutionarymulti-branch tree clustering method for streamingtext data in which the tree construction is castedas an online posterior estimation problem by con-sidering both the current tree and the previous treesimultaneously.

With the fast development of social media plat-forms, newsworthy events are widely scattered notonly on traditional news media but also on socialmedia (Zhou et al., 2015). For example, Twit-ter, one of the most widely adopted social medi-a platforms, appears to cover nearly all newswireevents (Petrovic et al., 2013). Therefore, ap-proaches have also been proposed for storylinesummarization on social media. Given a user in-put query of an ongoing event, Lin et al. (2012) ex-tracted the storyline of an event by first obtainingrelevant tweets and then generating storylines viagraph optimization. In (Li and Li, 2013), an evo-lutionary hierarchical Dirichlet process was pro-posed to capture the topic evolution pattern in sto-ryline summarization.

However, most of the aforementioned ap-proaches do not represent events in the form ofstructured representation. More importantly, theyignore the dependency of the hierarchical struc-tures of events at different epochs in a storyline.In this paper, we propose a dynamic storyline de-tection model to overcome the above limitations.

1943

We assume that each document could belong toone storyline s, which is modelled as a joint dis-tribution over some named entities e and a set oftopics z. Furthermore, to link events at differentepochs and detect different types of storylines, theweighted sum of storyline distribution of previousepochs is employed as the prior of the current s-toryline distribution. The proposed model is eval-uated on a large scale news corpus. Experimentalresults show that our proposed model outperformsseveral baseline approaches.

2 Methodology

To model the generation of a storyline in con-secutive time periods for a stream of documents,we propose an unsupervised latent variable mod-el, called dynamic storyline detection model (DS-DM), The graphical model of DSDM is shown inFigure 1.

M D

SEd

S

Nd

!

"

# $

%

&

' ( )

z

w e

s

M

!

"

M

S K

Figure 1: The Dynamic Storyline Detection mod-el.

In this model, we assume that the storyline-topic-word, storyline-topic and storyline-entityprobabilities at time t are dependent on theprevious storyline-topic-word, storyline-topic andstoryline-entity distributions in the last M epochs.For a certain period of time, we assume that eachdocument could belong to one storyline s, which ismodelled as a joint distribution over some namedentities e and a set of topics z. This assumption es-sentially encourages documents published aroundsimilar time that involve the same named entitiesand discuss similar topics to be grouped into thesame storyline. As the storyline distribution isshared across documents with the same named en-tities and similar topics, it essentially preserves theambiguity that for example, documents compris-ing the same person and location may or may notbelong to the same storyline.

The generative process of DSDM is shown be-

low:For each time period t from 1 to T :

• Draw a distribution over storylines πts ∼

Dirichlet(γts).

• For each storyline s ∈ {1...S}:

– Draw a distribution over topicsθt

s ∼ Dirichlet(αts).

– Draw a distribution over named entities ωts ∼

Dirichlet(ϵts).

– For each topic k ∈ {1...K}, draw a word distri-bution φt

s,k ∼ Dirichlet(βts).

• For each document d ∈ {1...D}:

– Choose a storyline indicator std ∼

Multinomial(πts).

– For each named entity e ∈ {1...Ed}:∗ Choose a named entity e ∼

Multinomial(ωts).

– For other word positions n ∈ {1...Nd}:∗ Choose a topic zn ∼ Multinomial(θt

s).∗ Choose a word wn ∼ Multinomial(φt

s,z).

We define an evolutionary matrix of storylineindicator s and topic z, σt

s,z,m, where each colum-n σt

s,z,m denotes storyline-topic-word distributionof storyline indicator s and topic z at epoch m,an evolutionary topic matrix of storyline indicators, τ t

s , where each column τ ts,m denotes storyline-

topic distribution of storyline indicator at epochm, an evolutionary entity matrix of storyline in-dicator s, υt

s, where each column υts,m denotes

storyline-entity distribution of storyline indicators.

We attach a vector of M + 1 weights µts,z =

{µts,z,m}M

m=0(µts,z,m > 0,

∑Mm=0 µ

ts,z,m = 1),

with its components representing the weights thateach σt

s,z,m contributes to calculating the priors ofφt

s,z . We do it similarly for θts and ωt

s. The Dirich-let prior for the storyline-topic-word distribution,the storyline-topic distribution and the storyline-entity distribution, respectively, at epoch t are:

βts,z =

M∑m=0

µts,z,m × σt

s,z,m (1)

αts =

M∑m=0

µts,m × τ t

s,m (2)

ϵts =M∑

m=0

µts,m × υt

s,m (3)

In our experiments, the weight parameters areset to be the same regardless of storylines or top-ics. They are only dependent on the time win-dow using an exponential decay function, µm =

1944

exp(−0.5×m) wherem stands for themth epochcounting backwards in the past M epochs. Thatis, more recent documents would have a relativelystronger influence on the model parameters in thecurrent epoch compared to earlier documents. It isalso possible to estimate the weights directly fromdata. We leave it as our future work.

The storyline-topic-word distribution φts,z , the

storyline-topic distribution θts and the storyline-

entity distribution ωts at the current epoch t are

generated from the Dirichlet distribution param-eterized by βt

s,z, αts, ϵ

ts, φ

ts,z ∼ Dir(βt

s,z), φts,k ∼

Dir(αts), ω

ts ∼ Dir(ϵts). With this formulation,

we can ensure that the mean of the Dirichlet pa-rameter for the current epoch becomes proportion-al to the weighted sum of the word, topic distribu-tion, and entity distribution at previous epochs.

3 Inference and Parameter Estimation

We use collapsed Gibbs sampling (Griffiths andSteyvers, 2004) to infer the parameters of the mod-el, given observed data D. Gibbs sampling is aMarkov chain Monte Carlo method which allowsus repeatedly sample from a Markov chain whosestationary distribution is the posterior of interest,std and zt

d,n here, from the distribution over thatvariable given the current values of all other vari-ables and the data. Such samples can be used toempirically estimate the target distribution. Let-ting the subscript −d denote the quantity that ex-cludes counts in document d, the conditional pos-terior for sd is:

P (std = j|st

−d, z,w,Λ) ∝ {Nj}−d + γ

D−d + Sγ

×E∏

e=1

∏n(d)j,e

b=1 Nj,e − b+ ϵtj,e∏nE(d)j

b=1 nEj − b+

∑Ee=1 ϵ

tj,e

×K∏

k=1

∏n(d)j,k

b=1 nj,k − b+ αtj,k∏n

(d)j

b=1 nj − b+∑K

k=1 αtj,k

×V∏

v=1

∏n(d)j,k,v

b=1 nj,k,v − b+ βtj,k,v∏n

(d)j,k

b=1 nj,k − b+∑V

v=1 βtj,k,v

,

where Nj denotes the number of documents as-signed to storyline indicator j in the whole corpus,D is the total number of documents, nj,e is thenumber of times named entity e is assigned withstoryline indicator j, nE

j denotes the total number

of named entities with storyline indicator j in thedocument collection, nj,k is the number of timeswords with topic label k with storyline indicator j,nj is the total number of words (excluding namedentities) in the corpus with storyline indicator j,nj,k,v is the number of words v with storyline in-dicator j and topic label k in the document col-lection, counts with (d) notation denote the countsrelating to document d only.

Letting the index x = (d, n) denote nth word indocument d and the subscript−x denote a quantitythat excludes data from the nth word position indocument d. We only sample a topic zx if the nthword is not a named entity based on the followingconditional posterior:

P (ztx = k|sd = j,z−x,w, Λ)

∝ {ntj,k}−x + αt

j,k

{nj}−x +∑K

k=1 αtj,k

× {ntj,k,wn

}−x + βtj,k,v

{ntj,k}−x +

∑Vv=1 βt

j,k,v

Once the latent variables s and z are known,we can easily estimate the model parametersπ,Θ, φ, ψ, ω. We set the hyperparameters α =γ = 0.1, β = ϵ = 0.01 for the current epoch(i.e., m = 0), and gather statistics in the previous7 epochs (i.e., M = 7) to set the Dirichlet priorsfor the storyline-topic-word distribution φt

s,z , thestoryline-topic distribution θt

s and the storyline-entity distribution ωt

s in the current epoch t, andrun Gibbs sampler for 1000 iterations and stop theiteration once the log-likelihood of the training da-ta converges under the learned model.

4 Experiments

4.1 Dataset

We crawled and parsed the GDELT Even-t Database1 containing news articles published inMay 2014. We manually annotated one-week da-ta containing 101,654 documents and identified 77storylines for evaluation. We also report the result-s of our model on the one-month data containing526,587 documents. But we only report the preci-sion and not recall of the storylines extracted sinceit is time consuming to identify all the true story-lines in such a large dataset. In our experiments,we used the Stanford Named Entity Recognizerfor identifying the named entities. In addition, weremoved common stopwords and only kept tokens

1http://data.gdeltproject.org/events/index.html

1945

which are verbs, nouns, or adjectives in these newsarticles.

4.2 BaselinesWe chose the following three methods as the base-line approaches.

1. K-Means + Cosine Similarity (KMCS): themethod first applies K-Means to cluster newsdocuments for each day, then link storylinesdetected in different days based on the cosinesimilarity measurement.

2. LDA + Cosine Similarity (LDCS): themethod first splits news documents on a dailybasis, then applies the Latent Dirichlet Allo-cation (LDA) model to detect the latent story-lines for the documents in each day, in whicheach storyline is modelled as a joint distribu-tion over named entities and words, and final-ly links storylines detected in different daysusing the cosine similarity measurement.

3. Dynamic LDA (DLDA)2: this is the dynam-ic LDA (Blei and Lafferty, 2006) where thetopic-word distributions are linked across e-pochs based on the Markovian assumption.That is, the topic-word distribution at the cur-rent epoch is only influenced by the topic-word distribution in the previous epoch.

4.3 Evaluation MetricTo evaluate the performance of the proposed ap-proach, we use precision, recall and F-score whichare commonly used in evaluating information ex-traction systems. The precision is calculated basedon the following criteria: 1) The entities and key-words extracted refer to the same storyline. 2) Theduration of the storyline is correct. We assume thatthe start date (or end date) of a storyline is the pub-lication date of the first (or last) news article aboutit.

4.4 Experimental ResultsThe proposed model is compared against the base-line approaches on the annotated one-week da-ta which consist of 77 storylines. The numberof storylines, S, and the number of topics, K,are both set to 100. The number of historical e-pochs, M , which is taken into account for set-ting the Dirichlet priors for the storyline-topic-word, the storyline-topic and the storyline-entity

2Topic number is set to 100 for both DLDA and LDCS.Cluster number is also set to 100 for KMCS.

distributions, is set to 7. The evaluation results ofour proposed approach in comparison to the threebaselines are presented in Table 1.

Method Precision Recall F-scoreKMCS 22.73 32.47 26.74LDCS 34.29 31.17 32.66DLDA 62.67 61.03 61.84DSDM 70.67 68.80 69.27

Table 1: Performance comparison of the storylineextraction results in terms of Precision (%), Recall(%) and F-score (%).

It can be observed from Table 1 that simplyusing K-means to cluster news articles in eachday and linking similar stories across differen-t days in hoping of identifying storylines givesthe worst results. Using LDA to detect storiesin each day improves the precision dramatically.The dynamic LDA model assumes topics (or sto-ries) in the current epoch evolves from the previ-ous epoch and further improves the storyline de-tection results significantly. Our proposed mod-el aims to capture the long distance dependen-cies in which the statistics gathered in the past 7days are taken into account to set the Dirichlet pri-ors of the storyline-topic-word, storyline-topic andstoryline-entity distributions in the current epoch.It gives the best performance and outperforms dy-namic LDA by nearly 7% in F-measure.

To study the impact of the number of topics onthe performance of the proposed model, we con-ducted experiments on the one-month data with d-ifferent number of topics varying between 100 and200. In all these experiments, the number of story-lines, S, is set to 200, based on the speculation thatabout 40 storylines in the annotated one-week datalast for one month and about 40 new storylines oc-cur each week. Table 2 shows the precision of theproposed method under different number of topic-s. It can be observed that the performance of theproposed approach is quite stable across differentnumber of topics.

K 100 150 200Precision 69.6% 70.2% 69.9%

Table 2: The precision of our method with variousnumber (K) of topics.

1946

May 1

Entity:Samsung

AppleGoogleU.S.

Mr Farage

Topic:PatentTrialJury

ReportUser

OfficialSupply

Entity:Apple

SamsungGoogleU.S.

San JoseMr Farage

Topic:PatentAwardJury

VerdictPhoneAppealTrial

Entity:Samsung

AppleGoogleAmazon

U.S.Steve Jobs

Topic:Patent

InfringeCompany

WinJury

VerdictAgreement

Entity:Samsung

AppleGoogleU.S.

San JoseCalifornia

Topic:Patent

InfringeVerdictJury

DamageAllegianceCompany

Entity:Apple

SamsungSilicon ValleyGoogle

Mr FarageSan Jose

Topic:PatentJury

CompanyVerdictDeviceAmountFactory

Entity:Apple

SamsungGoogleU.S.

LenovoCalifornia

Topic:PatentJury

CompanyDeviceUpdate

WearableVideo

May 2 May 3 May 4 May 5 May 6

Figure 2: Storyline about the patent infringement case between Apple and Samsung was extracted by theproposed Model.

4.5 Structured Browsing

We illustrate the evolution of storylines by usingstructured browsing, from which the structured in-formation (entity, topic, keywords) about story-lines and the duration of storylines can be easilyobserved. Figure 2 shows the storyline about “Thepatent infringement case between Apple andSamsung”. It can be observed that in the first twodays, the hierarchical structure consists of entities(Apple, Samsung) and keywords (trial, patent, in-fringe). The case has gained significant attentionin the next three days when US jury orders Sam-sung to pay Apple $119.6 million. It can be ob-served that the stories in the next three days alsoconsist of entities (Apple, Samsung), but with d-ifferent keywords (award, patent, win). The lastday’s story gives an overall summary and consistsof entities (Apple, Samsung) and keywords (jury,patent, company).

To further investigate the storylines detected bythe proposed model, we randomly selected threedetected storylines. The first one is about “thepatent infringement case between Apple andSamsung”. It is a short-term storyline lasting for6 day as shown in Figure 3. The second one isabout “India election”, which is a long-term sto-ryline lasting for one month. The third one is about“Pistorius shoot Steenkamp”, which is an inter-mittent storyline, lasting for a total of 22 days butwith no relevant news reports in certain days asshown in Figure 3. It can be observed that the pro-posed model can detect not only continuous butalso intermittent storylines, which further demon-strates the advantage of the proposed model.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 300

50

100

150

200

250

300

350

400

Date of storylines

Num

ber

of d

ocum

ents

rel

ated

to th

e st

oryl

ine

Apple VS SamsungIndia ElectionPistorius shoot Steenkamp

Figure 3: The number of documents on each dayrelating to the three storylines.

5 Conclusions and Future Work

In this paper, we have proposed an unsupervisedBayesian model to extract storylines from newscorpus. Experimental results show that our pro-posed model is able to extract both continuous andintermittent storylines and outperforms a numberof baselines. In future work, we will considermodelling background topics explicitly and inves-tigating more principled ways in setting the weightparameters of the statistics gathered in the histor-ical epochs. Moreover, we will also explore theimpact of different scale of the dependencies fromhistorical epochs on the distributions of the currentepoch.

Acknowledgments

This work was funded by the National Natural Sci-ence Foundation of China (61528302), State Edu-cation Ministry, the Fundamental Research Fundsfor the Central Universities, and the Innovate UKunder the grant number 101779.

1947

ReferencesAmr Ahmed, Qirong Ho, Jacob Eisenstein, Eric Xing,

Alexander J Smola, and Choon Hui Teo. 2011. U-nified analysis of streaming news. In Proceedingsof the 20th international conference on World wideweb, pages 267–276. ACM.

David M Blei and John D Lafferty. 2006. Dynamictopic models. In Proceedings of the 23rd interna-tional conference on Machine learning, pages 113–120. ACM.

Thomas L. Griffiths and Mark Steyvers. 2004. Find-ing scientific topics. In Proceedings of the NationalAcademy of Sciences 101 (Suppl. 1), pages 5228–5235.

Lifu Huang and Lian’en Huang. 2013. Optimizedevent storyline generation based on mixture-event-aspect model. In Proceedings of the 2013 Confer-ence on Empirical Methods on Natural LanguageProcessing, pages 726–735. ACL.

Noriaki Kawamae. 2011. Trend analysis model: trendconsists of temporal words, topics, and timestamp-s. In Proceedings of the fourth ACM internationalconference on Web search and data mining, pages317–326. ACM.

Jiwei Li and Sujian Li. 2013. Evolutionary hierarchi-cal dirichlet process for timeline summarization. InProceedings of the 51st Annual Meeting of the Asso-ciation for Computational Linguistics, pages 556–560. ACL.

Chen Lin, Chun Lin, Jingxuan Li, Dingding Wang,Yang Chen, and Tao Li. 2012. Generating even-t storylines from microblogs. In Proceedings of the21st ACM international conference on Informationand knowledge management, pages 175–184. ACM.

Sasa Petrovic, Miles Osborne, Richard McCreadie,Craig Macdonald, Iadh Ounis, and Luke Shrimpton.2013. Can twitter replace newswire for breakingnews? In Proceedings of the 7th International AAAIConference on Weblogs and Social Media.

Kira Radinsky and Eric Horvitz. 2013. Mining theweb to predict future events. In Proceedings of thesixth ACM international conference on Web searchand data mining, pages 255–264. ACM.

Xuning Tang and Christopher C Yang. 2012. TUT: astatistical model for detecting trends, topics and userinterests in social media. In Proceedings of the 21stACM international conference on Information andknowledge management, pages 972–981. ACM.

Xiting Wang, Shixia Liu, Yangqiu Song, and Bain-ing Guo. 2013. Mining evolutionary multi-branchtrees from text streams. In Proceedings of the 19thACM SIGKDD international conference on Knowl-edge discovery and data mining, pages 722–730.ACM.

Deyu Zhou, Liangyu Chen, and Yulan He. 2015.An unsupervised framework of exploring eventson twitter: Filtering, extraction and categorization.In Proceedings of the Twenty-Ninth AAAI Confer-ence on Artificial Intelligence (AAAI 2015) , Austin,Texas, USA, January 25 C 30, 2015, pages 2468–2474.

1948

an unsupervised bayesian modelling approach for storyline detection on news...

Documents