1
On the Value of Temporal Anchor Texts in
Wikipedia
Nattiya Kanhabua and Wolfgang Nejdl
L3S Research CenterHannover, Germany
SIGIR 2014 Workshop on Temporal, Social and Spatially-aware Information Access (#TAIA2014), Gold Coast, Australia
2
Motivation Entity Evolution: Implications for IR Our Approach: Model and Ranking Experimental Results Conclusions and Future Work
Outline
3
Computer Science and interdisciplinary research on all aspects of the Web
Information Access: Knowledge on and through the Web
Community: Supporting communities for research, education, production and entertainment
Society: Requirements (technological, social, legal) for the Web
Internet: Communication and Networks
Selected projects
Web Science @ L3STemporal Retrieval,
Exploration and Analytics in Web Archives
LivingKnowledge: Diversity, opinion and
bias on the Web
CUbRIK: Searching by computers and humans
Real-time data processing for finance predictions
Privacy, Property and Internet Governance
ForgetIT: Concise Preservation via
Managed Forgetting
MAPPING
4
Computer Science and interdisciplinary research on all aspects of the Web
Information Access: Knowledge on and through the Web
Community: Supporting communities for research, education, production and entertainment
Society: Requirements (technological, social, legal) for the Web
Internet: Communication and Networks
Selected projects
Web Science @ L3S
ALEXANDRIA: Temporal Retrieval, Exploration and Analytics in Web Archives
ERC Advanced Grant (2014-2018, 2.5 MEuro)
Cooperation: Internet Archive, German National Library,
British Library, Rutgers University, et al.
Q1: How to link web archive content against multiple entity and event collections evolving over time?Ioannou, E., Nejdl, W., Niederée, C. and Velegrakis, Y. 2011. LinkDB: A Probabilistic Linkage Database System. SIGMOD’2011Papadakis, G., Ioannou, E., Niederée, C., Palpanas, T. and Nejdl, W. 2012. Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data. WSDM’2012
Q2: How to support time-sensitive and entity-based queries?Kanhabua, N. and Nørvåg, K. 2010. Exploiting time-based synonyms in searching document archives. JCDL’2010Nguyen, T., and Kanhabua, N. 2014. Leveraging dynamic query subtopics for time-aware search result diversification. ECIR’2014
Entity Evolution and Implications for IR
6
march madnessbegan
14/03/2006
ncaa women tournament began
18/03/2006 01/04/2006
final four began
query: ncaa
Change of Subtopic Popularities over Time
S. R. Kairam, M. R. Morris, J. Teevan, D. J. Liebling, and S. T. Dumais. Towards supporting search over trending events with social media. ICWSM'2013Nguyen, T., and Kanhabua, N. 2014. Leveraging dynamic query subtopics for time-aware search result diversification. ECIR’2014
Motivation
Anchor texts are complementary descriptionfor target pages, widely used to improve
search
Characteristics: Short summary (a few words) of target pages Collective wisdom of people other than
authors Similar behavior to real-world queries and
titles Capturing aboutness or what a document is
about
Our idea: Temporal anchor texts mined from the edit
history of Wikipedia as a hook for tracking entity evolution
Large-scale analysis and a more robust discovery of evolving information using limited resources
N. Eiron and K. S. McCurley: Analysis of Anchor Text for Web Search. SIGIR‘2003http://googleresearch.blogspot.de/2013/03/learning-from-big-data-40-million.html
Mining Temporal Anchor Texts
1. Partition Wikipedia revisions using the one-month granularity
2. For each Wikipedia snapshot, identify named entity articles/pages
3. Extract anchor texts from all articles linking to an entity page
4. Rank aggregated entity-anchor relationships at a particular time t
N. Kanhabua and K. Nørvåg: Exploiting time-based synonyms in searching document archives. JCDL 2010
Mining Temporal Anchor Texts
1. Partition Wikipedia revisions using the one-month granularity
2. For each Wikipedia snapshot, identify named entity articles/pages
3. Extract anchor texts from all articles linking to an entity page
4. Rank aggregated entity-anchor relationships at a particular time t
N. Kanhabua and K. Nørvåg: Exploiting time-based synonyms in searching document archives. JCDL 2010
Mining Temporal Anchor Texts
1. Partition Wikipedia revisions using the one-month granularity
2. For each Wikipedia snapshot, identify named entity articles/pages
3. Extract anchor texts from all articles linking to an entity page
4. Rank aggregated entity-anchor relationships at a particular time t
President_of_the_United_States
President Bush (43)
Time: 10/2005
Barack Obama
Time: 11/2008
George W. Bush
Time: 11/2004
N. Kanhabua and K. Nørvåg: Exploiting time-based synonyms in searching document archives. JCDL 2010
1. Multi-word title with all words capitalized, except prepositions, determiners, etc.
E.g., President_of_the_United_States => entity
2. Single-word titles with multiple capital letters
E.g., UNICEF and WHO => entities
3. 75% of occurrences in the article text itself are capitalized (not beginning of sentence)
Recognizing Named Entity Articles
R. C. Bunescu and M. Pasca. Using encyclopedic knowledge for named entity disambiguation. EACL2006
Weight anchor texts by importance with respect to a target entity at particular time:
• Link-independent : inlink pages are independent and equally important to the target page
• Compute based on the whole collection of Wikipedia entity pages at particular time t
• Two variants: 1) article links, and 2) distinct pages
Temporal Anchor Weighting
Z. Dou, R. Song, J.-Y. Nie, and J.-R. Wen. Using anchor texts with their hyperlink structure for web search. SIGIR'2009
Weight anchor texts by importance with respect to a target entity at particular time:
• Link-independent : inlink pages are independent and equally important to the target page
• Compute based on the whole collection of Wikipedia entity pages at particular time t
• Two variants: 1) article links, and 2) distinct pages
Temporal Anchor Weighting
Z. Dou, R. Song, J.-Y. Nie, and J.-R. Wen. Using anchor texts with their hyperlink structure for web search. SIGIR'2009
Experiments
Data collection:• A dump of English Wikipedia edit history
(2.8 TB)• All pages and revisions 03/2001 to 03/2008• 85 snapshots + 4 additional snapshots
(24/05/2008, 27/07/2008, 08/10/2008, 06/03/2009)
Tools:• Preprocess/store revisions using MWDumper
http://www.mediawiki.org/wiki/Mwdumper• Store anchor texts: mySQL databases
Evolving Context
“Barack Obama”
time
05/2008 03/2009
1. Senator Barack Obama2. Senator Obama's
legislative accomplishments
3. Illinois4. U.S. Sen. Barack Obama
1. Senator Barack Obama2. Illinois Senator Barack
Obama3. Barack Hussein Obama II4. Senator Obama's
legislative accomplishments
07/2008 10/2008
Evolving Context
“Barack Obama”
time
05/2008 03/2009
1. Senator Barack Obama2. Senator Obama's
legislative accomplishments
3. Illinois4. U.S. Sen. Barack Obama
1. Senator Barack Obama2. Illinois Senator Barack
Obama3. Barack Hussein Obama II4. Senator Obama's
legislative accomplishments
07/2008
1. Senator Barack Obama2. Illinois Senator Barack
Obama3. Barak Obama, U.S.
Senator, Illinois, 2008 Democratic nominee for U.S. President
4. presidential candidacy announcement
1. President Barack Obama
2. Senator Barack Obama 3. U.S. President Barack
Obama 4. 44th President of the
United States5. Obama Administration
10/2008
Our Findings
Evolving information & context• Role changes for political entities• Geographic name changes for
locations• Trend or things in vogue for
celebrities• Products in demand for technology
Our Findings
Evolving information & context• Role changes for political entities• Geographic name changes for
locations• Trend or things in vogue for
celebrities• Products in demand for technology
Our Findings
Evolving information & context• Role changes for political entities• Geographic name changes for
locations• Trend or things in vogue for
celebrities• Products in demand for technology
Our Findings
Evolving information & context• Role changes for political entities• Geographic name changes for
locations• Trend or things in vogue for
celebrities• Products in demand for technology
Our Findings
Evolving information & context• Role changes for political entities• Geographic name changes for
locations• Trend or things in vogue for
celebrities• Products in demand for technology
Our Findings
Evolving information & context• Role changes for political entities• Geographic name changes for
locations• Trend or things in vogue for
celebrities• Products in demand for technology
Our Findings
Evolving information & context• Role changes for political entities• Geographic name changes for
locations• Trend or things in vogue for
celebrities• Products in demand for technology
Conclusions and Future Work• Analysis of temporal anchor texts extracted from Wikipedia edits
• A temporal model for Wikipedia and a probabilistic ranking method
• A hook for tracing entity evolution: evolving information and context
Future work:
• A distinction between contemporaneous and historical events/entities
• Adding semantic information to anchor texts, e.g., temporal expressions
Q1: How to link web archive content against multiple entity and event collections evolving over time?Ioannou, E., Nejdl, W., Niederée, C. and Velegrakis, Y. 2011. LinkDB: A Probabilistic Linkage Database System. SIGMOD (New York, New York, USA, Jun. 2011)
Q2: How to maintain entity and event information and indexes for web-scale archives?Papadakis, G., Ioannou, E., Niederée, C., Palpanas, T. and Nejdl, W. 2012. Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data. WSDM (New York, NY, USA, 2012), 53–62.Papadakis, G., Ioannou, E., Palpanas, T., Niederée, C. and Nejdl, W. 2012. A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces. TKDE. (2012).
Evolution-Aware Entity-Based Enrichment and Indexing
Huge and Heterogeneous Information Spaces
Voluminous, (semi-)structured datasets DBPedia 3.4: 36,5 million triples and 2,1 million
entities BTC09: 1,15 billion triples and 182 million entities
Users are free to insert not only attribute values but also attribute names high levels of heterogeneity
DBPedia 3.4: 50,000 attribute names Google Base:100,000 schemata and 10,000 entity
typesLarge portion of data stemming from automatic information extraction noise, tag-style values
and this does neither involve time nor entity evolution …
Temporal Retrieval and Ranking
Q3: How to support time-sensitive and entity-based query formulation?Kanhabua, N. and Nørvåg, K. 2010. Exploiting time-based synonyms in searching document archives. JCDL (New York, New York, USA, Jun. 2010)Nguyen, T., and Kanhabua, N. 2014. Leveraging dynamic query subtopics for time-aware search result diversification. ECIR (Amsterdam, April 2014)
Q4: How to improve result ranking and clustering for time-sensitive and entity-based queries?Kanhabua, N., Blanco, R. and Matthews, M. 2011. Ranking related news predictions. SIGIR (New York, New York, USA, Jul. 2011)G. Demartini, C. Firan, T. Iofciu, R. Krestel, W. Nejdl: Why finding entities in Wikipedia is difficult, sometimes. Inf. Retr. 13(5): 534-567 (2010)