how much is wikipedia lagging behind news?
TRANSCRIPT
Introduction
Motivation
ResearchQuestions
Datasets:CollectionAlignment
News Densityin Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
How much is Wikipedia lagging behind news?
Besnik Fetahu Abhijit Anand Avishek Anand
L3S Research Center, Leibniz Universitat Hannover
July 1, 2015
1 / 24
Introduction
Motivation
ResearchQuestions
Datasets:CollectionAlignment
News Densityin Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
1 Introduction
2 Motivation
3 Research Questions
4 Datasets: Collection Alignment
5 News Density in Wikipedia
6 Lag AnalysisEntity LagEvent Lag
7 Conclusions
2 / 24
Introduction
Motivation
ResearchQuestions
Datasets:CollectionAlignment
News Densityin Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
1 Introduction
2 Motivation
3 Research Questions
4 Datasets: Collection Alignment
5 News Density in Wikipedia
6 Lag AnalysisEntity LagEvent Lag
7 Conclusions
3 / 24
Introduction
Motivation
ResearchQuestions
Datasets:CollectionAlignment
News Densityin Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
Introduction
1 Wikipedia as a backbone for many real-world applications(e.g. search, entity disambiguation etc.)
2 Real-world entities and events in Wikipedia withcontinuous evolution
3 Collaboratively created and edited encyclopedia
4 Entity and event pages as an aggregation of facts frommultiple external sources (web pages, news, videotranscriptions etc.)
5 Constant trade-off between data streams (i.e., daily news)and maintenance of a fresh and consistent of applicationsrelying on Wikipedia
4 / 24
Introduction
Motivation
ResearchQuestions
Datasets:CollectionAlignment
News Densityin Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
1 Introduction
2 Motivation
3 Research Questions
4 Datasets: Collection Alignment
5 News Density in Wikipedia
6 Lag AnalysisEntity LagEvent Lag
7 Conclusions
5 / 24
Introduction
Motivation
ResearchQuestions
Datasets:CollectionAlignment
News Densityin Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
Motivation: Why Wikipedia and News?
Why Wikipedia?
• Text Categorization
• Entity Disambiguation
• Entity Search
• Knowledge Bases etc.
Why news?
• Authoritative sources
• Professionally edited and qualitative source ofinformation!
• Inherent importance of reported events andfacts about entities in Wikipedia
• Second most cited source of information inWikipedia
6 / 24
Introduction
Motivation
ResearchQuestions
Datasets:CollectionAlignment
News Densityin Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
1 Introduction
2 Motivation
3 Research Questions
4 Datasets: Collection Alignment
5 News Density in Wikipedia
6 Lag AnalysisEntity LagEvent Lag
7 Conclusions
7 / 24
Introduction
Motivation
ResearchQuestions
Datasets:CollectionAlignment
News Densityin Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
Research Questions: Aim of this analysis
Research Questions
1 What fraction of external references in entity pagesare news articles?
2 How much does Wikipedia lag behind newsarticles? How has this lag evolved over time?
3 Which categories or classes of entities in news leador lag Wikipedia?
4 How do events reported by news articles lag withthe Wikipedia event pages?
5 What is the influence of reported events in creatingentities in Wikipedia?
8 / 24
Introduction
Motivation
ResearchQuestions
Datasets:CollectionAlignment
News Densityin Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
Research Questions: Aim of this analysis
Research Questions
1 What fraction of external references in entity pagesare news articles?
2 How much does Wikipedia lag behind newsarticles? How has this lag evolved over time?
3 Which categories or classes of entities in news leador lag Wikipedia?
4 How do events reported by news articles lag withthe Wikipedia event pages?
5 What is the influence of reported events in creatingentities in Wikipedia?
8 / 24
Introduction
Motivation
ResearchQuestions
Datasets:CollectionAlignment
News Densityin Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
Research Questions: Aim of this analysis
Research Questions
1 What fraction of external references in entity pagesare news articles?
2 How much does Wikipedia lag behind newsarticles? How has this lag evolved over time?
3 Which categories or classes of entities in news leador lag Wikipedia?
4 How do events reported by news articles lag withthe Wikipedia event pages?
5 What is the influence of reported events in creatingentities in Wikipedia?
8 / 24
Introduction
Motivation
ResearchQuestions
Datasets:CollectionAlignment
News Densityin Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
Research Questions: Aim of this analysis
Research Questions
1 What fraction of external references in entity pagesare news articles?
2 How much does Wikipedia lag behind newsarticles? How has this lag evolved over time?
3 Which categories or classes of entities in news leador lag Wikipedia?
4 How do events reported by news articles lag withthe Wikipedia event pages?
5 What is the influence of reported events in creatingentities in Wikipedia?
8 / 24
Introduction
Motivation
ResearchQuestions
Datasets:CollectionAlignment
News Densityin Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
Research Questions: Aim of this analysis
Research Questions
1 What fraction of external references in entity pagesare news articles?
2 How much does Wikipedia lag behind newsarticles? How has this lag evolved over time?
3 Which categories or classes of entities in news leador lag Wikipedia?
4 How do events reported by news articles lag withthe Wikipedia event pages?
5 What is the influence of reported events in creatingentities in Wikipedia?
8 / 24
Introduction
Motivation
ResearchQuestions
Datasets:CollectionAlignment
News Densityin Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
1 Introduction
2 Motivation
3 Research Questions
4 Datasets: Collection Alignment
5 News Density in Wikipedia
6 Lag AnalysisEntity LagEvent Lag
7 Conclusions
9 / 24
Introduction
Motivation
ResearchQuestions
Datasets:CollectionAlignment
News Densityin Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
Datasets: Collection Alignment
Wikipedia News: New York Times
• 6 million articles (entities,events, etc.)
• Version history between2001– current
• Categorized entities andevents
• Rich editor network
• 1.8 million news articles• Daily news between
1987–2007• 506k disambiguated entities
(using TagMe!)• Temporally aligned articles
and entities
10 / 24 0
20000
40000
60000
80000
100000
120000
140000
2001
2002
2003
2004
2005
2006
2007
Fre
quency
Wikipedia New York Times
Number of entities appearing in thecorresponding years in Wikipedia and in theNYT corpus.
Introduction
Motivation
ResearchQuestions
Datasets:CollectionAlignment
News Densityin Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
1 Introduction
2 Motivation
3 Research Questions
4 Datasets: Collection Alignment
5 News Density in Wikipedia
6 Lag AnalysisEntity LagEvent Lag
7 Conclusions
11 / 24
Introduction
Motivation
ResearchQuestions
Datasets:CollectionAlignment
News Densityin Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
(RQ1) News Density in Wikipedia
News Reference Density (NRD)
The NRD of an entity page, as the fraction of newsreferences over all references of all types in the page.
12 / 24
0
0.2
0.4
0.6
0.8
1
Co
mic
sC
rea
tor
Art
wo
rkN
atu
ralP
lace
Airlin
eF
ilmS
occe
rMa
na
ge
rL
eg
alC
ase
Alb
um
Ba
nd
Sp
ort
sT
ea
mT
ele
vis
ion
Sh
ow
An
ato
mic
alS
tru
ctu
reA
thle
teW
ea
po
nC
rim
ina
lM
usic
alA
rtis
tP
olit
icia
nP
lan
tS
on
gN
on
-Pro
fitO
rga
nis
atio
nB
oo
kA
cto
rF
ictio
na
lCh
ara
cte
rR
eco
rdL
ab
el
Bro
ad
ca
ste
rP
olit
ica
lPa
rty
Au
tom
ob
ileT
rad
eU
nio
nS
cie
ntist
Mili
tary
Pe
rso
nP
hilo
so
ph
er
Te
levis
ion
Se
aso
nE
lectio
nO
ffic
eH
old
er
Sp
ort
sL
ea
gu
eG
ove
rnm
en
tAg
en
cy
Sin
gle
An
ima
lA
wa
rdS
po
rtsE
ve
nt
Airp
ort
Mili
tary
Co
nflic
tT
ele
vis
ion
Ep
iso
de
Aircra
ftM
ag
azin
eW
rite
rL
oca
tio
n
news book court journal web thesis
cite type #references
web 375596075news 140432947journal 8200496book 3548469court 48566visual 32044pressrelease 22308thesis 19198speech 17511techreport 3345
Introduction
Motivation
ResearchQuestions
Datasets:CollectionAlignment
News Densityin Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
(RQ1) NRD Dynamics in Wikipedia
Citation density across years (2009-2014)
0
0.2
0.4
0.6
0.8
1
news journal web book thesis court
0.0
0.2
0.4
0.6
0.8
1.0
Crim
inal
Autom
obile
Office
Holde
r
Socce
rMan
ager
Lega
lCas
e
Election
Loca
tion
Crim
inal
Autom
obile
Office
Holde
r
Socce
rMan
ager
Lega
lCas
e
Election
Loca
tion
Crim
inal
Autom
obile
Office
Holde
r
Socce
rMan
ager
Lega
lCas
e
Election
Loca
tion
13 / 24
Introduction
Motivation
ResearchQuestions
Datasets:CollectionAlignment
News Densityin Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
1 Introduction
2 Motivation
3 Research Questions
4 Datasets: Collection Alignment
5 News Density in Wikipedia
6 Lag AnalysisEntity LagEvent Lag
7 Conclusions
14 / 24
Introduction
Motivation
ResearchQuestions
Datasets:CollectionAlignment
News Densityin Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
(RQ2) Entity Lag
Entity Lag
Entity lag — lag(e), is the delay between the first appearanceof an entity page and its first mention in a news article.
time
∆time
lag(e)=tw - tn
tw tn
Wiki page
News article
15 / 24
lag(e) =
low, lag(e) ≤ 30 d
medium, lag(e) ≤ 12 m
high, lag(e) > 1 y
Introduction
Motivation
ResearchQuestions
Datasets:CollectionAlignment
News Densityin Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
(RQ2) Entity Lag
16 / 24
Wikipedia: First revision on 17 August 2002.
NYT: First appearance on 5 January 2001 (sinceWikipedia has started)
NOTE: Before 2001 there were 58 news articlesmentioning Angela Merkel in NYT.
Introduction
Motivation
ResearchQuestions
Datasets:CollectionAlignment
News Densityin Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
(RQ2) Entity Lag
0
500
1000
1500
2000
2500
-11
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9
10
11
2001-EE
2001-NEE
0
500
1000
1500
-11
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9
10
11
2002-EE
2002-NEE
0
500
1000
1500
-11
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9
10
11
2003-EE
2003-NEE
0
500
1000
1500
2000
-11
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9
10
11
2004-EE
2004-NEE
0
500
1000
1500
2000
2500
-11
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9
10
11
2005-EE
2005-NEE
0
500
1000
1500
2000
2500
3000
-11
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9
10
11
2006-EE
2006-NEE
Entity lag in months. The emergent entities are shown in red, they aredetermined by filtering all entities from the subset of NYT that appear in earlieryears before 2001.
17 / 24
Introduction
Motivation
ResearchQuestions
Datasets:CollectionAlignment
News Densityin Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
(RQ3) Lag for Entity Categories
0
0.2
0.4
0.6
0.8
1
high
-pos
high
-neg
low-p
os
low-n
eg
PersonOrganisation
WorkPlace
Others
(a) Overall
0
0.2
0.4
0.6
0.8
1
high
pos
high
neg
low p
os
low n
eg
athletemusical artist
politicianscientist
(b) Person
Lag distribution of different entity types.
18 / 24
Introduction
Motivation
ResearchQuestions
Datasets:CollectionAlignment
News Densityin Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
(RQ4) Event Lag
Event Definition
An Event Page is the Wikipedia article that refers toa real-world event, e.g. U.S Elections 2004.
4000
6000
8000
10000
12000
14000
16000
18000
20000
22000
-5 -4 -3 -2 -1 0 1 2 3 4 5
19 / 24
Event news reference lag (in years) inWikipedia. Most of Wikipedia eventsfall into low-lag class, showing highdynamics of reporting real news eventsin Wikipedia
Introduction
Motivation
ResearchQuestions
Datasets:CollectionAlignment
News Densityin Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
(RQ5) Emerging Entities in Event Pages
Emerging Entity Density in Event Pages
The fraction of entities that were created after the event page,are referred as emerging entities in event pages.
0
0.2
0.4
0.6
0.8
1
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
Em
erg
ing
en
tity
de
nsity in
Eve
nt
Pa
ge
s
Entities created after Events
(c) Emerging Entity Density
0
0.2
0.4
0.6
0.8
1
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
Person Organisation Work Place
(d) Emerging entity categories
20 / 24
Introduction
Motivation
ResearchQuestions
Datasets:CollectionAlignment
News Densityin Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
(RQ5) Emerging Entities in Event Pages
21 / 24
Introduction
Motivation
ResearchQuestions
Datasets:CollectionAlignment
News Densityin Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
Conclusions
1 Approximately 20% of all external references in entitypages are news articles.
2 The bootstrapping period of Wikipedia takes roughly 3years.
3 Wikipedia establishes as an information source only after 3years.
4 Entity lag follows a distinct normal distribution and showthat Wikipedia has been catching up on news ever since itwas introduced.
5 Unlike entities, events are quickly reflected in Wikipedia assoon as they are reported in news.
6 Events are responsible for creation of emergent entities,with 12% of the entities mentioned in event pages beingcreated after the creation of the event page.
22 / 24
Introduction
Motivation
ResearchQuestions
Datasets:CollectionAlignment
News Densityin Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
Thank you!Questions?
e-mail: [email protected]
twitter: @FetahuBesnik
23 / 24
Introduction
Motivation
ResearchQuestions
Datasets:CollectionAlignment
News Densityin Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
Limitations
• Lag distribution may vary across different localizedWikipedias and news collections.
• Entity linking and disambiguation tools are trained onspecific Wikipedia snapshots, hence entities with temporalroles may be incorrectly linked.
• The remaining portion of ‘web’ references remainunanalyzed due to their lack of quality (language, format,authority etc.)
24 / 24