intro - task identification using search engine query logs

Task Identification Using Search Engine Query Logs Horace Li

Introduction Although search engines in the past have been limited to returning a set of webpages potentially containing information required by the user, they have been limited in their usefulness by requiring the user to formulate their own search queries and keywords and break down their purpose of searching into easily digestible strings for the search engines. Thus since the 90s, there has been work to reduce the burden of analysing one’s search goal, by attempting to automatically derive the goal of users’ searches or tasks from a sample of search queries, and combining them with analysis of a corpus of previous query logs from other users to identify the user’s greater search goal.

Traditionally, such work has been done from two previously unrelated areas of research – statistical analysis of query strings and grouping of such queries into categories, and the formulation of large ontologies of data on the web. In recent years, however, work has been increasingly done to combine these areas, such that individual search queries can not only be used to identify the category of a user’s search goal, but using ontologies of interrelated data, used to derive the specific task or purpose of a user, and to suggest potentially related searches or information using the corresponding relations within the ontologies.

This report aims to summarise the development of task identification with a focus on their derivation from search engine query logs, their approaches and shortcomings, and potential improvements in their area of research in the future. It will look at how keywords and user intents can be identified, how they can then be combined to triangulate the area being researched, and how these two areas of research could be used together in an ideal implementation.

Identifying Tasks Prior to the abundance of personal computing, library search systems have tended to identify individual tasks by a series of searches performed within a single session, divided by user logins. As searches has gradually moved towards the web, the main approach for grouping searches and tasks has been to similarly distinguish tasks by time periods, though such period has tended to be based on a fixed timeout value rather than when a user logs in/out [1]. However, tested at a variety of time lengths, the precision of such methods only reach 70%.

It has been found that in real life scenarios, many users interleave their searches, such that the traditional model of grouping tasks by blocks of time would not suffice. With regard to this next level of sophistication involving dissecting interleaved tasks, efforts have been made to group tasks using a combination of Bayesian networks and a selection of categories of search tasks [2]. However, the approach was limited in the sense that the intentions of the user could only be narrowed down to a distinction between generalization and specialization and query categories, rather than one of determining an ‘informational goal’.

Further work in identifying a user’s task or session has been based on combining common query words with other contextual data such as link revisits [3] and time proximity between consecutive or potentially

related searches [4]. However, these and similar methods have yielded a user search topic at best, rather than a more specific user intent or goal.

A more recent development has been to overcome this challenge by identifying user search missions and goals [1] within a derived search topic. In this instance, a goal is defined as an information need, and a mission as a set of goals. Thus instead of identifying merely the topic, of related searches, each query can be sorted into a hierarchy of tasks, each falling into a goal, and in turn, a mission. It is important to note that this approach differs from previous approaches of studying search queries by not just analysing how searches are interleaved, but whether they fall into a pyramid of sub-tasks. An example used within Jones and Klinkner is ‘hiking san francisco’ followed shortly by ‘hiking san Francisco bay area’. In this scenario, a minor addition qualifies the search, and can be used to identify that the subsequent query is related to the former by form of a sub-task.

Whilst the most common method of identifying user intents is by identifying sessions or tasks, this only results in the identification of an entity or domain that the user has searched for. Contextual data, whilst unable to be used alone to derive a user session or task, can however be used to identify more subtle hints of intent, or even predict the exact resource sought by the user in cases. An obvious example is when users re-find web material. Teevan, Adar, Jones et al explores re-finding behaviour and how that can be used to derive user intent by tracking links that have been accessed multiple times and repeated queries [5]. They found that 40% of all queries result in a repeated link click, and 12% of all searches are navigational (queries that seek a particular website). Whilst of limited use for identifying user intent for informational queries (queries for a general topic), they were able to predict the link to be clicked clicked for navigational queries with an accuracy of up to 96%.

Different queries can be grouped into tasks by identifying repeating phrases or synsets, set of words that share a semantic concept or meaning. However, to group tasks into goals and indeed missions, require a more high level understanding of the context and semantic meaning of the query phrases. In the same way that synsets can be identified using a database of words that match the same definition, a dataset can also be used to identify related synsets or entities and the corresponding relations between them.

Knowledge Bases To identify relations between different entities, one must require a large dataset consisting of knowledge, of relationships and axioms that apply to semantic concepts and facts. These are known as ontologies. In the same way that a database can be used to lookup that ‘LA’ and ‘Los Angeles’ are of the same synset, ontologies can also be used to identify that ‘Los Angeles’ and ‘California’ are related by location, with one being a geographic constituent of the other.

There are currently a substantial number of knowledge bases online, mostly either derived from open source references such as Wikipedia, or itself a freely editable database of information such as Freebase. A substantial recent development has been YAGO, a large ontology derived from Wikipedia and WordNet [6]. Whilst many other alternatives have had developments over the years, the inherent architecture and methods for extracting data is relatively novel (2008), and proves to be a suitable starting point for comparison with competing methodologies and ontologies.

The majority of the facts within YAGO are defined by the infoboxes (a table used in articles which contain important facts and statistics which are common to related articles of the same time) within Wikipedia articles and the categories they fall under, whilst the individual relations between each entity is defined

by the synsets from WordNet. Where discrepancies arise, unless manual exceptions or heuristics have been defined, the WordNet entry takes priority and the Wikipedia entity is discarded. For many users whose sought after information can be found on Wikipedia as is increasingly the case, the use of a search engine whose results rankings are partly determined by cross checking past queries with data derived from Wikipedia itself would be beneficial.

YAGO data model is an extension of RDFS (Resource Description Framework Schema, a specification/language to describe and structure information on the web) by combining many of the latter’s expressiveness with additional ability to form acyclic relations [7]. It is therefore suitable for forming hierarchies of data as required by the inherent nature of Wikipedia categories and sub-concept/super-concept relations of WordNet. However, this is at the expense of the ability to make logical assertions, such as the negation of facts, or indeed more complicated logical constructs. Whilst this will be of no issue since YAGO’s purpose is to create a network of semantic factual data, this limits the power of YAGO to describe propositions, or indeed any statement that is not an undisputable fact. Furthermore, its dependence on existing web resources means that quality assurance still requires substantial manual intervention, and thus the scalability (semantic and breadth, rather than raw data size and count) is somewhat limited.

YAGO2 [8], in contrast to YAGO, aims not only to organize factual data, but apply semantic meaning to them, and store them rather as knowledge through the application of spatial and temporal context (location and time) and the associated metadata. Each fact will therefore be associated with relations that suggest a span of time, rather than discrete events on certain dates. Similarly, entities are also assigned a location where possible. Were YAGO2 to be used as the principle ontology to analyse and draw relations between user search queries, these additional functionalities will allow the user to make queries on (current) events, an ever important capability in the era of web 2.0. In addition, multilingual information and contextual keywords are associated with the relevant entities, though their usefulness is comparatively diminished. Although YAGO2 works towards a more semantic and contextually useful version of YAGO, its inherent architecture is still very similar, and thus fails to solve many of the issues from the original YAGO.

In a further development of YAGO, YAGO2s [9] was developed to provide a modular and hence scalable approach to the information gathering, extraction, and population of the YAGO database. This allows YAGO2s to gather data from more than the original sources (notably Wikipedia, WordNet, and GeoNames); indeed, any combination of sources of data can be collated through a process not unlike multiple iterations of MapReduce [10] – by transforming complete datasets into ‘themes’, then deriving certain attributes or subsets of the combined datasets through an ‘extractor’, a module of code that explicitly extracts a certain type of data (such as those only from Wikipedia). The highlights of YAGO2s include its new improved modular architecture and its open design, allowing anyone to design and implement their own ‘extractors’ and potentially allowing knowledge to be tapped from a larger variety of sources. As current generation search engines move towards a real-time web, the indexing of current news are increasingly taken for granted – such a module approach could allow YAGO2s to scrape data from social media or live data sources such as Twitter (although the value of such data itself is another matter).

A point of concern is YAGO’s susceptibility to a single point of failure in terms of its data sources, since it’s still mostly powered by WordNet and Wikipedia. As a consequence of measures implemented to maintain data quality, a gradual decline in the number of Wikipedia editors has been observed over the past couple

of years [11]. Although Wikipedia has proved to be a reliable source of data for YAGO and other high profile projects such as DBpedia [6][12], there are concerns that Wikipedia will not be able to maintain both its accuracy and breadth of knowledge in the long term. Viable alternatives exist that match Wikipedia in terms of accuracy [13], but its breadth is at least an order of magnitude greater than its next competitor.

Research has occurred on multiple fronts in these areas. Projects such as YAGO and DBpedia draw data from a select few data sources, relying on semi-structured data such as Wikipedia’s infoboxes. Others utilize existing ontologies published by independent entities to a common standard such as the Web Ontology Language (OWL) [14], or indeed networks of such ontologies [15], leveraging the large datasets and machine learning tools to deal with data noise, inconsistencies, and gaps [16]. Such projects tend not to have single points of failure as previously exampled with Wikipedia. However, the balance has to be struck between breadth of data, and its accuracy, especially when its breadth means manual intervention to further curate data is unfeasible.

With the development of artificial intelligence, many projects now use a combination of existing structured data, semi-structured data, natural language processing, machine learning, and various other recently developed big data techniques [17]. Although YAGO proved a promising starting point, especially with the modular approach of YAGO2s, it appears the best results for looking up relations between entities are achieved with large overlapping data sets combined with machine learning and big data techniques, potentially allowing scalable designs without the proportional amount of manually defined exceptions as was the case in the original YAGO; current trends point in this direction, and with big data gradually maturing, these trends are set to continue.

Implementation In a potential or ideal implementation, one would begin a search mission as they normally would on a conventional search engine, and as they enter queries, the search engine would run the query through both the traditional search and query analysis systems. As the latter approximates the field that’s being searched or the user intent, the returned results would be increasingly influenced by the query analyser.

Similar to current generation search engines which are expected to have an up to date database to search from, an ideal query analyser (and therefore the knowledge base which it draws on) would also be expected to have up to date knowledge, both in breadth and depth. This requirement to have a (near) real time indexer would prove to be an issue to many ontology architectures such as YAGO2, which currently relies on a batch-based processing system and the need for a predefined schedule and dependency graph. Whilst the current method of operation is perhaps adequate as a proof of concept, the two-stage MapReduce model used dates back to at least 2003. Since then, the explosion of data and the associated change in data manipulations strategies [18] has given rise to systems which can incrementally process updates and additions to large data sets such as Google’s Percolator [19]. Furthermore, recent technological advances and the need to cater for data processing tasks outwith the traditional batch-based model has encouraged the development of frameworks which can process and query large amounts of data in short amounts of time [20], ideal for machine learning and data processing in real time. If YAGO or other ontologies of a similar design were to scale up for public use in ways similar to Google’s recent well-publicized efforts such as the Knowledge Graph [21], they would be expected to provide data that is both accurate and up to date. One might even expect this ideal implementation to scrape the Twitter firehose (taken as an extreme example) and incorporate the latest news into its vast

ontology. The current batch-based design would not suffice at scale to match the exponential growth of data and the public’s expectation of real-time indexing – a reimplementation using some of the aforementioned technologies would be required.

Whilst many data retrieval fields such as traditional search and data analytics have begun to make use of newer non-batch-based data processing methods, semantic knowledge extraction storage is still in its in infancy, and as such, efforts in knowledge extraction have continued to address data quality and source as the main area of focus rather than the exact technical implementation of the processes. This leads to the issue that outside of ontology size and coverage, one of the most pressing concerns in this area regard the accuracy of an ontology’s data. Whilst the current generation of ontologies use Wikipedia, a resource that has shown to have very few inaccuracies [13][22], their reliability will undoubtedly be affected if they incorporate a much wider range of sources online. Traditional methods of manual editing will not scale, and it is very much down to machine learning algorithms to extract data from around the web. Progress is underway in this area of research, especially around the said issues of extracting data reliably and merging disparate (and potentially conflicting) datasets [23], [24].

Conclusion Search engines have come a long way since the AltaVista’s and even the first incarnations of Google late last century. Whilst Google, Wolfram Alpha and the likes can answer certain questions directly, no search engine is smart enough to predict our intent or what we need by analysing a couple of search queries. Work in this in this area have mostly been based around deriving a core set of synsets for each task (or sub-task) the user has been searching for, establishing an ontology of semantic data and contextual knowledge, and combining the two areas to identify information that has eluded the user (but is related to many of the topics they have been searching). Whilst methods for both sorting user search queries into tasks and storing data within ontologies are maturing, there has yet to be an implementation which successfully combines the two areas and can be used to ‘predict’ the missing piece of information required by the user based on the identified intent of the user. Research in this field is therefore expected to be based around improving the way search query logs can used to build a more detailed hierarchy of tasks using ontologies, generating a prediction of what the user needs based on their current task and this hierarchy, and eventually implementing a proof of concept.

[1] R. Jones and K. Klinkner, “Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs,” Proc. 17th ACM Conf. …, 2008.

[2] T. Lau and E. Horvitz, “Patterns of Search : Analyzing and Modeling Web Query.”

[3] A. Montgomery and C. Faloutsos, “Identifying web browsing trends and patterns,” Computer (Long. Beach. Calif)., no. December, 2001.

[4] D. He, A. Göker, and D. J. Harper, “Combining evidence for automatic Web session identification,” Inf. Process. Manag., vol. 38, no. 5, pp. 727–742, Sep. 2002.

[5] J. Teevan, E. Adar, R. Jones, and M. Potts, “History repeats itself: repeat queries in Yahoo’s logs,” Proc. 29th Annu. …, 2006.

[6] F. Suchanek, G. Kasneci, and G. Weikum, “Yago: A large ontology from wikipedia and wordnet,” … Sci. Serv. Agents …, no. July, 2008.

[7] P. Hayes, “RDF Semantics,” W3C Recomm., vol. 10, pp. 1–45, 2004.

[8] J. Hoffart, F. Suchanek, K. Berberich, and G. Weikum, “YAGO2: a spatially and temporally enhanced knowledge base from Wikipedia,” Artif. Intell., pp. 3161–3165, 2013.

[9] J. Biega, E. Kuzey, and F. Suchanek, “Inside YAGO2s: a transparent information extraction architecture,” Proc. 22nd Int. …, 2013.

[10] J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” Commun. ACM, pp. 1–13, 2008.

[11] a. Halfaker, R. S. Geiger, J. T. Morgan, and J. Riedl, “The Rise and Decline of an Open Collaboration System: How Wikipedia’s Reaction to Popularity Is Causing Its Decline,” Am. Behav. Sci., vol. 57, no. 5, pp. 664–688, Dec. 2012.

[12] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, and S. Hellmann, “DBpedia - A crystallization point for the Web of Data,” Web Semant. Sci. Serv. Agents World Wide Web, vol. 7, no. 3, pp. 154–165, Sep. 2009.

[13] J. Giles, “Internet encyclopaedias go head to head,” Nature, vol. 438, no. 7070, pp. 900–1, Dec. 2005.

[14] D. L. McGuinness and F. Van Harmelen, “OWL Web Ontology Language Overview,” W3C Recomm., vol. 10, pp. 1–22, 2004.

[15] C. Bizer, T. Heath, and T. Berners-Lee, “Linked data-the story so far,” Int. J. Semant. …, 2009.

[16] M. Nickel, V. Tresp, and H. Kriegel, “Factorizing YAGO: scalable machine learning for linked data,” … 21st Int. Conf. …, pp. 271–280, 2012.

[17] F. Suchanek and G. Weikum, “Knowledge harvesting in the big-data era,” … 2013 Int. Conf. …, pp. 933–937, 2013.

[18] P. Ranganathan, “From Microprocessors to Nanostores: Rethinking Data-Centric Systems (vol 44, pg 39, 2010),” Computer (Long. Beach. Calif)., no. January, pp. 39–48, 2011.

[19] D. Peng and F. Dabek, “Large-scale Incremental Processing Using Distributed Transactions and Notifications.,” OSDI, 2010.

[20] M. Zaharia and M. Chowdhury, “Spark: cluster computing with working sets,” … cloud Comput., 2010.

[21] S. Gallagher, “How Google and Microsoft taught search to ‘understand’ the Web,” Ars Tech., 2012.

[22] K. a Clauson, H. H. Polen, M. N. K. Boulos, and J. H. Dzenowagis, “Scope, completeness, and accuracy of drug information in Wikipedia.,” Ann. Pharmacother., vol. 42, no. 12, pp. 1814–21, Dec. 2008.

[23] C. Knoblock, K. Lerman, S. Minton, and I. Muslea, “Accurately and reliably extracting data from the web: A machine learning approach,” … Explor. web, pp. 1–10, 2003.

[24] A. Doan, P. Domingos, and A. Halevy, “Reconciling schemas of disparate data sources: A machine-learning approach,” ACM Sigmod Rec., 2001.

Demonstrate problemWhat we’re working towards, expected end resultHow to solveFurther developments

2

Example from briefing ‐ London

3

Want to travel to London

attractions

4

flights

5

Where to stay

6

Accommodation

Other problems?

7

Experienced traveler?

8

Havent been robbed at the airport

Visual walkthrough

9

What else?

10

Takes too longNeed to think of each aspect individually

What is this area of research

11

Search engine knows what we want, not what we type

We don’t need to make the queries

12

Search couple of the following

Identify what you’re doing

Give you results on related matters

Suggest areas to look further into

13

Google does this

Nowhere near as comprehensive

Look for past records, rather than really understanding what you’re looking for

Statistical analysis

100 people search attractions followed by London map

Long way to go

14

Identify the task, mission, goal

15

Library systems

Analyse queries within same session

16

Primitive methods

All queries within session are part of same task

17

Deinterleave queries from session

Common words

Determine category of search

Determine search patterns

Generalization, Specialization, Reformulation, Repeat

18

Link revisitsTime proximity

19

Continuing with contextual data

Once de‐interleaved

Interleaved searches – sub searches?

Analyze tasks and sub tasks

20

Mission – London

Goal – London attractions

Task – Buckinham Palace

Sub‐task – Tickets

How to get relations between queries

21

How to determine relations

Buckingham palace in London

22

Previous queries

Dumb

Doesn’t ‘understand’ anything

23

Datasets, collections of ordered knowledge

Wikipedia, but computer readable

Known as ontology

24

Database of Wikipedia

Run queriesUnderstand relations

Network of data

YAGODbpediaBabelNetFreebase

Some derived from Wikipedia, some custom curated

25

Lists in Wikipedia

Computer can understand the corresponding ontology

Understands relations

26

Can combine ontologies with queries

Can identify domain, area searched.

Produce hierarchy of tasks

Still no idea I’m traveling to London, starting to ‘understand’ what I’m looking for.

27

Understand user behaviorHow user reformulates queries

Contextual data as before

Need to make more use of ontologies knowledge to triangulate main item being searched

Need to identify not just area searched, but also search goals and intent.

28

intro - task identification using search engine query logs

Documents

search tasks

search engines

individual search queries

users greater search

sample of search queries

individual tasks

goal of users

group tasks