temporal dynamics in information retrieval - · pdf filetemporal dynamics in information...

221
Temporal Dynamics in Information Retrieval Stewart William Whiting School of Computing Science College of Science and Engineering University of Glasgow, Scotland, UK. A thesis submitted for the degree of Doctor of Philosophy (Ph.D) October, 2015

Upload: truongquynh

Post on 14-Feb-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • Temporal Dynamicsin Information Retrieval

    Stewart William Whiting

    School of Computing Science

    College of Science and Engineering

    University of Glasgow, Scotland, UK.

    A thesis submitted for the degree of

    Doctor of Philosophy (Ph.D)

    October, 2015

    mailto:[email protected]

  • I hereby declare that except where specific reference is made to the work of others, the con-tents of this dissertation are original and have not been submitted in whole or in part forconsideration for any other degree or qualification in this, or any other University.

    This dissertation is the result of my own work, under the supervision of Professor JoemonM. Jose and Dr Gethin Norman, and includes nothing which is the outcome of work done incollaboration, except where specifically indicated in the text.

    Permission to copy without fee all or part of this thesis is granted provided that the copies arenot made or distributed for commercial purposes, and that the name of the author, the title ofthe thesis and date of submission are clearly visible on the copy.

    Stewart William WhitingOctober, 2015

  • Abstract

    The passage of time is unrelenting. Time is an omnipresent feature of our existence, servingas a context to frame change driven by events and phenomena in our personal lives and socialconstructs. Accordingly, various elements of time are woven throughout information itself,and information behaviours such as creation, seeking and utilisation.

    Time plays a central role in many aspects of information retrieval (IR). It can not only dis-tinguish the interpretation of information, but also profoundly influence the intentions andexpectations of users information seeking activity. Many time-based patterns and trends namely temporal dynamics are evident in streams of information behaviour by individualsand crowds. A temporal dynamic refers to a periodic regularity, or, a one-off or irregular past,present or future of a particular element (e.g., word, topic or query popularity) driven bypredictable and unpredictable time-based events and phenomena.

    Several challenges and opportunities related to temporal dynamics are apparent throughoutIR. This thesis explores temporal dynamics from the perspective of query popularity andmeaning, and word use and relationships over time. More specifically, the thesis posits thattemporal dynamics provide tacit meaning and structure of information and information seek-ing. As such, temporal dynamics are a two-way street since they must be supported, butalso conversely, can be exploited to improve time-aware IR effectiveness.

    Real-time temporal dynamics in information seeking must be supported for consistent usersatisfaction over time. Uncertainty about what the user expects is a perennial problem forIR systems, further confounded by changes over time. To alleviate this issue, IR systemscan: (i) assist the user to submit an effective query (e.g., error-free and descriptive), and (ii)better anticipate what the user is most likely to want in relevance ranking. I first exploremethods to help users formulate queries through time-aware query auto-completion, whichcan suggest both recent and always popular queries. I propose and evaluate novel approachesfor time-sensitive query auto-completion, and demonstrate state-of-the-art performance of upto 9.2% improvement above the hard baseline. Notably, I find results are reflected across di-verse search scenarios in different languages, confirming the pervasive and language agnosticnature of temporal dynamics. Furthermore, I explore the impact of temporal dynamics on themotives behind users information seeking, and thus how relevance itself is subject to tempo-ral dynamics. I find that temporal dynamics have a dramatic impact on what users expect overtime for a considerable proportion of queries. In particular, I find the most likely meaning of

  • ambiguous queries is affected over short and long-term periods (e.g., hours to months) by sev-eral periodic and one-off event temporal dynamics. Additionally, I find that for event-drivenmulti-faceted queries, relevance can often be inferred by modelling the temporal dynamics ofchanges in related information.

    In addition to real-time temporal dynamics, previously observed temporal dynamics offer acomplementary opportunity as a tacit dimension which can be exploited to inform more effec-tive IR systems. IR approaches are typically based on methods which characterise the natureof information through the statistical distributions of words and phrases. In this thesis I lookto model and exploit the temporal dimension of the collection, characterised by temporal dy-namics, in these established IR approaches. I explore how the temporal dynamic similarity ofword and phrase use in a collection can be exploited to infer temporal semantic relationshipsbetween the terms. I propose an approach to uncover a query topics chronotype terms that is, its most distinctive and temporally interdependent terms, based on a mix of temporaland non-temporal evidence. I find exploiting chronotype terms in temporal query expansionleads to significantly improved retrieval performance in several time-based collections.

    Temporal dynamics provide both a challenge and an opportunity for IR systems. Overall, thefindings presented in this thesis demonstrate that temporal dynamics can be used to derivetacit structure and meaning of information and information behaviour, which is then valu-able for improving IR. Hence, time-aware IR systems which take temporal dynamics intoaccount can better satisfy users consistently by anticipating changing user expectations, andmaximising retrieval effectiveness over time.

  • Acknowledgements

    I dedicate this thesis to my parents, Richard and Pamela Whiting. Their love, support andencouragement has been unwavering. Without them I would never have been able to followthis opportunity in my life. I can never express quite how truly appreciative I am.

    Writing this thesis has been a long journey. I am deeply thankful to all my friends, familyand colleagues who have aided me throughout. I would like to thank my siblings Marc,Rebecca and Emma Whiting for being there for me, especially during the difficult past twoyears. My aunty and uncle, Sandy and Brian Talbot, opened the doors that led me downthis path of discovery. Their help, guidance and inspiration changed the way I look at theworld, and for that I am ever grateful. I will be forever indebted to my first mentor, WayneKerridge, who provided the early inspiration and support that allowed me to develop a careerin technology.

    My colleagues have been a source of much inspiration and comedy over my years as a PhDstudent. Guido Zuccon and Teerapong Leelanupab helped me to become established whenI first started. Ke Zhou, Jesus Rodriguez Perez, Philip McParlane, James McMinn, HoratiuBota, Rami Alkhawaldeh, Fajie Yuan and Stefan Raue their company and collaboration hasbeen a pleasure.

    I am extremely appreciative of my supervisor, Joemon Jose, for handing me the opportunityto do a PhD funded by the EPSRC DTA scheme. He granted me the freedom to follow manynew research ideas, yet provided the counsel when needed. I would also like to take thisopportunity to thank my viva examiners, Arjen de Vries and Milad Shokouhi, for their hardwork in providing excellent comments and suggestions to improve the thesis.

    From the very start of my time as a PhD student, it has been an absolute privilege to haveYashar Moshfeghi as my mentor. He voluntarily took a central role in helping me shape myresearch ideas and this thesis, and for that I will be forever appreciative.

    Special thanks must go to Omar Alonso, who gave me the incredible opportunity to join Mi-crosoft Research in Silicon Valley as an intern in 2012, and again in 2014. Omar and my othersupervisors at Microsoft Aditi (Shubha) Nabar and Alex Dow mentored me to develop aresearch and development approach that has laid the foundations of my career.

    Finally, I need to thank my partner, Jodie Clarke, for supporting me during this journey. Wehave gone through this together, and for that I can never quite thank her enough.

  • The only constant is change:

    All is flux, nothing is stationary; no man ever steps in the same river twice,

    for it is not the same river and he is not the same man.

    Plato, around 369 BC.

  • Contents

    Glossary xvi

    I Introduction and Background 2

    1 Introduction 31.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.6 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    2 General IR Background 132.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    2.1.1 Chapter Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 Information Retrieval Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3 Retrieval Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    2.3.1 Boolean Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3.2 Vector Space Model . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3.3 Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3.4 Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    2.4 Experimental Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4.1 System-o