personalizing web search results by reading level

Personalizing Web Search Results by Reading Level

Kevyn Collins-ThompsonPaul N. BennettRyen W. White

Microsoft ResearchOne Microsoft Way

Redmond, WA 98052{kevynct, pauben, ryenw}

@microsoft.com

Sebastian de la ChicaMicrosoft Bing

One Microsoft WayRedmond, WA 98052

[email protected]

David SontagMicrosoft Research

New EnglandOne Memorial Drive

Cambridge, MA [email protected]

ABSTRACTTraditionally, search engines have ignored the reading diffi-culty of documents and the reading proficiency of users incomputing a document ranking. This is one reason why Websearch engines do a poor job of serving an important segmentof the population: children. While there are many importantproblems in interface design, content filtering, and resultspresentation related to addressing children’s search needs,perhaps the most fundamental challenge is simply that ofproviding relevant results at the right level of reading diffi-culty. At the opposite end of the proficiency spectrum, itmay also be valuable for technical users to find more ad-vanced material or to filter out material at lower levels ofdifficulty, such as tutorials and introductory texts.

We show how reading level can provide a valuable new rel-evance signal for both general and personalized Web search.We describe models and algorithms to address the three keyproblems in improving relevance for search using readingdifficulty: estimating user proficiency, estimating result dif-ficulty, and re-ranking based on the difference between userand result reading level profiles. We evaluate our methodson a large volume of Web query traffic and provide a large-scale log analysis that highlights the importance of findingresults at an appropriate reading level for the user.

Categories and Subject Descriptors: H.3.3 [Informa-tion Retrieval]: Retrieval Models; General Terms: Al-gorithms, Experimentation; Keywords: Reading difficulty,re-ranking, personalization.

1. INTRODUCTIONOur goal is to show how modeling reading proficiency of

users and the reading difficulty of documents can be usedto improve the relevance of Web search results. This goal ismotivated by the fact that content on the Web is written ata wide range of different reading levels: from easy introduc-tory texts and material written specifically for children, todifficult, highly-technical material for experts that requiresadvanced vocabulary knowledge to comprehend. Web users

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.CIKM’11, October 24–28, 2011, Glasgow, Scotland, UK.Copyright 2011 ACM 978-1-4503-0717-8/11/10 ...$10.00.

differ widely in their reading proficiency and ability to un-derstand vocabulary, depending on factors such as age, ed-ucational background, and topic interest or expertise. Websearch engines, however, typically use algorithms optimizedfor the ‘average’ user, not specific individuals. These factscurrently impair the ability of users to carry out success-ful searches by finding material at an appropriate level ofreading difficulty for them.

As an example of the need, and potential, for person-alization by reading level, consider the query [insect diet ],whose actual top-ranked results from a major search engineare shown in Table 1. While a younger child may be morelikely to have chosen query terms like [bug diet ] or [whatdo bugs eat? ], the choice of [insect diet ] could have beenmade by a child doing a class project, or an elementaryschool teacher searching for low-difficulty material; parentsor middle-school science students may require intermediatematerial, and more advanced high school and college usersmay require sites describing entomology research, as the topresults do here. Only one result (www.tutorvista.com), atrank position eight, is at a low level of difficulty (Ameri-can school grade level of 5.0). There are several results,however, including the top two, that contain highly tech-nical research-oriented content most appropriate for special-ists only. Clearly there is a need for improvement in rankingsearch results at an appropriate level of reading difficulty.

To address this problem, we describe a tripartite approachbased on user profiles, document difficulty, and re-ranking.First, we discuss how snippets and Web pages can be la-beled with reading level and combined with Open DirectoryProject (ODP, www.dmoz.org) category predictions. Second,we describe how a user’s reading proficiency profile may beestimated automatically from their current and past searchbehavior. Third, we use this profile to train a re-rankingalgorithm that combines both relevance and difficulty in aprincipled way, and which generalizes easily to broader taskssuch as expertise-based re-ranking. In this view, the overallrelevance of a document is a combination of two factors: ageneral relevance factor, provided by an existing ranking al-gorithm, and a user-specific reading difficulty model, basedon the gap between a user’s proficiency level and a docu-ment’s difficulty level. While users may self-identify theirdesired level of result difficulty, such information may notalways be provided. We therefore investigate methods forestimating a reading proficiency profile for users based ontheir online search interaction patterns.

We structure our study as follows. In Section 2 we re-view related work in reading difficulty prediction, modelinguser expertise, and search systems for children. We then de-scribe three key problems that must be addressed in usingreading level to improve Web search relevance: estimating

Rank URL Domain Title Category Reading Level(Grade level)

1 insectdiets.com Insect Diet & Rearing Research Technical/Research High (10.0)2 imfc.cfl.rncan.gc.ca Insect Diet Technical/Research High (10.0)3 www.sugar-glider-store.com Insect-Eater Diet Commercial Medium (7.0)4 insectdiets.com Insect Rearing Research Technical/Research High (10.0)5 insectrearing.com Bio-Serv Entomology Division Commercial, Technical Medium (8.0)6 www.ehow.com Aquatic Insects & Diet Educational Medium (7.0)7 www.exoticnutrition.com Insect-eater Diet... Commercial Medium (6.0)8 www.tutorvista.com Insect diet: Questions & Answers Educational Low (5.0)9 www.encarta.msn.com Dictionary Not relevant (empty) N/A10 deltafarmpress.com Producers may put fish on insect diet Technical/News High (10.0)

Table 1: Top ten results, in rank order, for the query [insect diet ] from a commercial search engine, showing the wide variationin reading level that can occur for material retrieved on the same query. Reading level here is estimated using the statisticalmodel described in Section 3.1 and shown in brackets. (Query issued on January 20, 2011.)

a profile of the user’s reading proficiency, estimating a doc-ument’s reading difficulty, and re-ranking algorithms thatcan effectively combine relevance and difficulty signals toimprove search quality. Section 3 then develops the the-oretical models and algorithms we use to address each ofthese three areas. In Section 4 we contrast the search be-havior of two groups: users looking for ‘kids’-related mate-rial and general users, by performing a large-scale log anal-ysis of query, session, and result properties. This analysishelps provide insights into features that may be useful forimproving re-ranking by reading-level. Section 5 evaluatesthe effectiveness of those algorithms according to implicitrelevance judgments obtained from actual Web search logscomprising queries, search result clicks, and post-click navi-gation events. Finally, in Sections 6 and 7 we discuss areasfor future research and summarize our findings.

2. RELATED WORKPersonalizing search using reading level touches on several

research areas with relevant prior work: search systems forchildren and students; modeling user expertise and topicfamiliarity; algorithms for predicting reading difficulty; andpersonalization or re-ranking methods based on additionaluser-specific relevance signals.

Effective search systems for children and students havebeen the focus of increased interest in recent years. Progressin improved user interfaces, crawling and indexing strate-gies, and models of child-centered relevance are all impor-tant in creating a better search experience for children. ThePuppyIR project [15] has begun to examine these types ofimportant questions around the design of search engines, es-pecially user interfaces and finding appropriate Web sitesfor children [7]. To date, however, there has been little,if any, published work on user modeling and re-ranking al-gorithms based on reading level and their deployment andevaluation in a commercial-scale search engine. Gyllstromand Moens [10] proposed a binary labeling of Web docu-ments: material for children versus adults, where the labelis inferred using a PageRank-inspired graph walk algorithmcalled AgeRank. For queries, recent work has explored queryexpansion methods for queries formulated by children [18].Our approach operates at a lower level and assumes thatcommon operations such as spelling correction have alreadybeen performed by the retrieval system to obtain the top-ranked documents, although such additional processing maywell be improved using the methods and features that wedevelop in this paper. Torres et al. [17] performed an anal-ysis of the AOL query log to characterize so-called ‘Kids’queries. A query was labeled as a Kids query if and only if

it had a corresponding clicked document whose domain waslisted as an ODP entry in the ‘Kids&Teens’ ODP top-levelcategory. Our study includes analysis based on the samedefinition, but we also explore the important dimension ofreading level.

An emerging community of human-factors researchers hasbeen focusing on children’s experiences in searching for in-formation online. Hirsh [11] carried out a detailed study ofthe relevance criteria that children employ when searchingfor information online. The study offers important findingson what criteria matter to children as they search the Web(e.g., topicality was viewed as much more important thanauthority). Bilal [2] investigated children’s cognitive, affec-tive, and physical behaviors as they use the Yahooligans!search engine to find information on a specific search task.Bilal found that childrens’ search processes were ineffectiveand inefficient, as well as of low quality, suggesting that chil-dren need to be better trained in how to search – or searchengines need to adapt, as we are advocating for in this pa-per. More recently, Druin et al. [6] studied how children usekeyword-based Web search and found that children exhibita number of different roles (e.g., content searcher, distractedsearcher) that have implications for the design of new searchinterfaces tailored toward children’s information needs andsearch behaviors.

Search engines have attempted to adapt to children’s usein other ways. For example, many search engines providea degree of parental control filtering, which blocks inappro-priate material. Other sites provide a corpus of high-qualitybut highly controlled ‘white-listed’ sites that is curated byhuman editors, but limited in scope and recency compared tothe standard Web. Since these other areas are either existingtechnologies or outside the scope of our research, we focuson the core problem of improving search result relevance forusers, given an estimate of their reading proficiency.

Estimating reading difficulty has been studied for decades,but traditional formulae such as Flesch-Kincaid provide onlya crude combination of vocabulary and syntactic difficultyestimates [5]. Recent progress has been made in applyingstatistical modeling and machine learning to improve read-ing difficulty estimation for non-traditional documents [5][12]such as Web pages or short snippets. An earlier study onpredicting query readability level [14] attempted automaticrecognition of reading levels from user queries by using Sup-port Vector Machines with syntactic and vocabulary-basedfeatures. A study by Clarke et al. [4] showed that the fea-tures of snippets provided by search engines have the poten-tial for significant influence on clickthrough behavior. Theirstudy included an ad-hoc readability measure for each snip-

https://www.researchgate.net/publication/221516082_Children's_roles_using_keyword_search_interfaces_at_home?el=1_x_8&enrichId=rgreq-451a9a1f4ed5833a698399ab41023c92-XXX&enrichSource=Y292ZXJQYWdlOzIyMTYxNDc1NjtBUzoxMDQzNDAxMDY3NzY1NzdAMTQwMTg4ODAyODkyNg==

https://www.researchgate.net/publication/221300429_The_influence_of_caption_features_on_clickthrough_patterns_in_web_search?el=1_x_8&enrichId=rgreq-451a9a1f4ed5833a698399ab41023c92-XXX&enrichSource=Y292ZXJQYWdlOzIyMTYxNDc1NjtBUzoxMDQzNDAxMDY3NzY1NzdAMTQwMTg4ODAyODkyNg==

https://www.researchgate.net/publication/221614397_Wisdom_of_the_ages_Toward_delivering_the_children's_web_with_the_link-based_AgeRank_algorithm?el=1_x_8&enrichId=rgreq-451a9a1f4ed5833a698399ab41023c92-XXX&enrichSource=Y292ZXJQYWdlOzIyMTYxNDc1NjtBUzoxMDQzNDAxMDY3NzY1NzdAMTQwMTg4ODAyODkyNg==

https://www.researchgate.net/publication/48340329_An_analysis_of_queries_intended_to_search_information_for_children?el=1_x_8&enrichId=rgreq-451a9a1f4ed5833a698399ab41023c92-XXX&enrichSource=Y292ZXJQYWdlOzIyMTYxNDc1NjtBUzoxMDQzNDAxMDY3NzY1NzdAMTQwMTg4ODAyODkyNg==

https://www.researchgate.net/publication/44284247_Automatic_Reformulation_of_Children's_Search_Queries?el=1_x_8&enrichId=rgreq-451a9a1f4ed5833a698399ab41023c92-XXX&enrichSource=Y292ZXJQYWdlOzIyMTYxNDc1NjtBUzoxMDQzNDAxMDY3NzY1NzdAMTQwMTg4ODAyODkyNg==

https://www.researchgate.net/publication/2905327_Automatic_Recognition_of_Reading_Levels_from_User_Queries?el=1_x_8&enrichId=rgreq-451a9a1f4ed5833a698399ab41023c92-XXX&enrichSource=Y292ZXJQYWdlOzIyMTYxNDc1NjtBUzoxMDQzNDAxMDY3NzY1NzdAMTQwMTg4ODAyODkyNg==

https://www.researchgate.net/publication/221520035_A_combined_topicalnon-topical_approach_to_identifying_web_sites_for_children?el=1_x_8&enrichId=rgreq-451a9a1f4ed5833a698399ab41023c92-XXX&enrichSource=Y292ZXJQYWdlOzIyMTYxNDc1NjtBUzoxMDQzNDAxMDY3NzY1NzdAMTQwMTg4ODAyODkyNg==

https://www.researchgate.net/publication/220817378_A_Language_Modeling_Approach_to_Predicting_Reading_Difficulty?el=1_x_8&enrichId=rgreq-451a9a1f4ed5833a698399ab41023c92-XXX&enrichSource=Y292ZXJQYWdlOzIyMTYxNDc1NjtBUzoxMDQzNDAxMDY3NzY1NzdAMTQwMTg4ODAyODkyNg==


https://www.researchgate.net/publication/221013015_Statistical_Estimation_of_Word_Acquisition_With_Application_to_Readability_Prediction?el=1_x_8&enrichId=rgreq-451a9a1f4ed5833a698399ab41023c92-XXX&enrichSource=Y292ZXJQYWdlOzIyMTYxNDc1NjtBUzoxMDQzNDAxMDY3NzY1NzdAMTQwMTg4ODAyODkyNg==

https://www.researchgate.net/publication/220433134_Children's_Relevance_Criteria_and_Information_Seeking_on_Electronic_Resources?el=1_x_8&enrichId=rgreq-451a9a1f4ed5833a698399ab41023c92-XXX&enrichSource=Y292ZXJQYWdlOzIyMTYxNDc1NjtBUzoxMDQzNDAxMDY3NzY1NzdAMTQwMTg4ODAyODkyNg==

https://www.researchgate.net/publication/297482932_Children's_use_of_the_Yahooligans_Web_search_engine_I_Cognitive_physical_and_affective_behaviors_on_fact-based_search_tasks?el=1_x_8&enrichId=rgreq-451a9a1f4ed5833a698399ab41023c92-XXX&enrichSource=Y292ZXJQYWdlOzIyMTYxNDc1NjtBUzoxMDQzNDAxMDY3NzY1NzdAMTQwMTg4ODAyODkyNg==

pet: the percentage of words that occur in a list of the 100most frequent English words. They did not experimentallyvalidate this measure, but it is related to the Dale readabil-ity feature we include (Section 3.1.2).

Reading difficulty is related to, but separate from, thetopic familiarity of a document to a user. Kumaran et al.[13] examined re-ranking Web search results with respect totopic familiarity. Their study results suggested that tradi-tional reading difficulty features and formulas such as Flesch-Kincaid alone could not predict whether a document was anintroductory or advanced text for a given topic. However,our study retains a focus on general language proficiency,where it is somewhat easier to distinguish between levels.We also do not rely exclusively on traditional difficulty mea-sures as these have been shown to perform poorly for Webtexts [5]. Instead, we apply a robust statistical modeling ap-proach that is able to capture detailed, per-word distinctionsin usage across grades.

White et al. [22] examined how domain expertise influ-ences Web search behavior, and their analysis of numerousquery and session features inspired our own choice of fea-tures for result re-ranking. Earlier work by Teevan et al.[16] examined general personalization based on a variety ofuser behavior and content-based features, and re-ranked us-ing a simple interpolation formula. A number of studies haveinvestigated combining a base relevance score with auxiliaryuser- or group-associated features to perform personalizedre-ranking. This includes subtopic coverage and novelty [24],which gives a multiplicative re-ranking update but also as-sumes a particular retrieval model. We emphasize that thispaper is about solving multiple problems to provide an end-to-end solution that assumes little about the base rankingscore or underlying retrieval model and thus can be appliedin a wide variety of systems.

With this research we extend previous work in a number ofways. First, we introduce a document labeling methodologythat assigns reading level and ODP category predictions toboth documents and the corresponding query-biased snippetthat searchers use when making search engine result page(SERP) clickthrough decisions. These labels play an im-portant role in re-ranking and the evaluation of re-ranking.Second, we describe how a user’s reading proficiency pro-file may be estimated automatically from their current andpast search behavior. Finally, we train and evaluate at Webscale a re-ranking algorithm that combines both relevanceand difficulty in a principled way, and allows proficiency pro-file features to be used to re-rank Web search results.

3. PROBLEM FORMULATIONThere are three key problems to solve in order to incor-

porate reading level as a relevance signal for Web search:(i) estimating reading level of documents and snippets, (ii)estimating reading proficiency of users, and (iii) ranking doc-uments based on reading level of users and documents. Wenow treat each of these in turn.

3.1 Estimating reading difficulty of documentsand snippets

We represent the reading difficulty of a document or textas a random variable Rd taking values in the range [1, 12].In this study, these values correspond to American schoolgrade levels, although they could easily be modified for fineror coarser distinctions in level, or for different tasks or pop-ulations. We computed reading level predictions for twodifferent representations of a page: the combined title andsummary text, which we call a ‘snippet’, that appears forthat page in the search engine results page; and the full bodytext extracted from the HTML of the underlying page. The

snippet text and full-page text are complementary sourcesof information. While the snippet provides a relatively shortsample of content for the underlying page, it is query-specific,and is what users see in choosing whether or not the corre-sponding page may be relevant and thus whether to clickthe result. The full-page text, in contrast, is not affectedby a query, and is what users see after clicking on a resulthyperlink on the search result page. We were interested inthe interaction of snippet and page level estimates as wellas their individual effectiveness. Indeed, we discovered astrong interaction of snippet-page difficulty difference withpage dwell time (see Section 4.4 for more details).

3.1.1 Prediction using language modelsThe reading difficulty prediction method that we use for

this study, summarized in this section, has been shown tobe effective for both short, noisy texts, and full-page Webtexts. Unlike traditional measures that compute a singlenumeric score, methods based on statistical language mod-eling provide extra information about score reliability bycomputing the likely distribution over levels, which can beused to compute confidence estimates. Moreover, languagemodels are vocabulary-centric and can capture fine-grainedpatterns in individual word behavior across levels. Thus,they are ideal for the noisy, short, fragmented text that oc-curs on the Web in queries, titles, result snippets, image ortable captions. Because of this short, noisy nature of Websnippets we applied a robust, vocabulary-oriented readingdifficulty prediction methods that is a hybrid of the originalsmoothed unigram approach [5] and a more recent modelbased on estimated age of word acquisition [12].

Following [12], we say that a document D has an (r, s)-reading level t if at least s percent of the words in D arefamiliar to at least r percent of the general population1. Wesay that a word has r-acquisition level µw(r) if r percent ofthe population have acquired the word by grade µw. For afixed (but large) vocabulary V of distinct words, we definean approximate age-of-acquisition for all words w ∈ V usinga truncated normal distribution with parameters (µw, σw).We estimated (µw, σw) from a corpus of labeled Web con-tent from [5]. With these word parameters, we can thenapply the above definition of (r, s)-readability. To computethe readability distribution of a text passage, we accumu-late individual word predictions into a stepwise cumulativedensity function (CDF). Each word contributes in propor-tion to its frequency in the passage. The reading level ofthe text is then the grade level corresponding to the s-thpercentile of the text’s word acquisition CDF. Details aregiven in Kidwell et al. [12].

3.1.2 Prediction using traditional semantic variablesWe also compute a traditional measure of vocabulary-

based difficulty: the fraction of unknown words in a queryor snippet (which we call the ‘Dale readability measure’)relative to the Dale 3000 word list, which is the semanticcomponent of the Dale-Chall reading difficulty measure [3].

3.1.3 Category predictionWe used automatic classification techniques to assign a

category label to each page. A logistic regression classifierusing an L2 regularizer was trained over each of the ODPtopics identified: for the experiments reported in this studywe had available the 219 topical categories from the toptwo levels of the ODP hierarchy, although we focus primar-ily on the Kids&Teens category in this study. Our classi-

1In that study, the authors found that setting r = 0.80 ands = 0.65 provided the greatest reduction in training error,and so we use the same settings for r and s here.

https://www.researchgate.net/publication/221520138_Characterizing_the_influence_of_domain_expertise_on_Web_search_behavior?el=1_x_8&enrichId=rgreq-451a9a1f4ed5833a698399ab41023c92-XXX&enrichSource=Y292ZXJQYWdlOzIyMTYxNDc1NjtBUzoxMDQzNDAxMDY3NzY1NzdAMTQwMTg4ODAyODkyNg==

https://www.researchgate.net/publication/221299394_Personalizing_search_via_automated_analysis_of_interests_and_activities?el=1_x_8&enrichId=rgreq-451a9a1f4ed5833a698399ab41023c92-XXX&enrichSource=Y292ZXJQYWdlOzIyMTYxNDc1NjtBUzoxMDQzNDAxMDY3NzY1NzdAMTQwMTg4ODAyODkyNg==




https://www.researchgate.net/publication/2478314_Beyond_Independent_Relevance_Methods_and_Evaluation_Metrics_for_Subtopic_Retrieval?el=1_x_8&enrichId=rgreq-451a9a1f4ed5833a698399ab41023c92-XXX&enrichSource=Y292ZXJQYWdlOzIyMTYxNDc1NjtBUzoxMDQzNDAxMDY3NzY1NzdAMTQwMTg4ODAyODkyNg==

https://www.researchgate.net/publication/221615292_Biasing_Web_search_results_for_topic_familiarity?el=1_x_8&enrichId=rgreq-451a9a1f4ed5833a698399ab41023c92-XXX&enrichSource=Y292ZXJQYWdlOzIyMTYxNDc1NjtBUzoxMDQzNDAxMDY3NzY1NzdAMTQwMTg4ODAyODkyNg==





https://www.researchgate.net/publication/221023380_Classification-enhanced_ranking?el=1_x_8&enrichId=rgreq-451a9a1f4ed5833a698399ab41023c92-XXX&enrichSource=Y292ZXJQYWdlOzIyMTYxNDc1NjtBUzoxMDQzNDAxMDY3NzY1NzdAMTQwMTg4ODAyODkyNg==

https://www.researchgate.net/publication/247340455_Readability_Revisited_The_New_Dale-Chall_Readability_Formula?el=1_x_8&enrichId=rgreq-451a9a1f4ed5833a698399ab41023c92-XXX&enrichSource=Y292ZXJQYWdlOzIyMTYxNDc1NjtBUzoxMDQzNDAxMDY3NzY1NzdAMTQwMTg4ODAyODkyNg==

fier assigned one or more labels using a similar approachto that described in [1]. We chose to use the Open Direc-tory Project (ODP) for classification because of its broad,general purpose topic coverage; the availability of reason-ably high-quality training data; and of special interest is theKids&Teens category, which was created in Nov. 2000 withits own set of editorial guidelines with the goal of providingkids-safe content for the under-18 age group.

3.2 Estimating reading proficiency of usersOne approach is to have users self-identify their level of

reading proficiency. For example, this is the approach thatGoogle has recently used as part of their advanced searchtools: users may choose to filter the results to show only‘basic’, ‘intermediate’, or ‘advanced’ reading level. This ap-proach has as its advantages both simplicity of user interac-tion and transparency of search behavior. One disadvantageof such an approach is that it may be difficult for users toproperly calibrate their reading level. Also, reading pro-ficiency may change over time, and it may be dependenton the actual query issued. Thus, we can consider ways toconstruct a reading proficiency profile automatically fromsearch behavior. This may include the previous queries andclick-throughs in either the session, or the user’s long-termhistory. We discuss one such approach now.

To match the difficulty distribution p(Rd) of a documentgiven in Sec. 3.1.1, we define a proficiency profile for user u tobe a distribution p(Ru) over levels, allowing us to computethe probability that a user understands a document. Aswith the document, Ru can take values in the range [1, 12].Here, we give a generative model for a user’s search sessionsthat we will use in estimating p(Ru) from a user’s searchbehavior. Although the prior distribution p(Ru) is assumedto be the same for all of a user’s search sessions, the posteriorp(Ru | query) depends on the query and may differ betweensessions. Let Q denote the set of queries that the user hasissued in this session, and let Dq denote the documents thatthe user clicks on in response to the query. We make theassumption that a user clicks and dwells on a documentonly if they can understand it - or more generally, if theylike the page because the reading level is appropriate to theirintent2. A session is generated as follows:

1. rd ∼ p(Rd) (given in Sec. 3.1.1)

2. ru ∼ p(Ru) (to estimate)

3. For all q ∈ Q:

(a) q ∼ p(query | ru)

(b) For all d ∈ Dq:SAT-click = 1 ∼ p(u likes reading level of d | ru, rd)

Typically, users will like reading documents whose difficultyis at or below their average proficiency level, and dislikedocuments more and more as their difficulty increases abovethis average level. To reflect this, we may choose a definitionsuch as

p(u likes level of d | ru, rd) = exp (−max(0, rd − ru)). (1)

However, the appropriate form of Eq. 1 may vary depend-ing on the user and their intent. For example, some expertusers looking for high-difficulty technical material may ac-tually want to penalize easier documents, to avoid introduc-tory material. Other users, such as students, may also want

2A widely-used dwell time satisfaction threshold in Websearch is 30 seconds, termed a ‘satisfied’ click or SAT-click.In reality, dwell time may vary with a number of factors,such as the age or reading proficiency of the user.

material that is neither too difficult nor too easy relative totheir level. Thus, we can instead define

p(u likes level of d | ru, rd) = exp (−(rd − ru)2).

In this study we use Eq. 1 but exploring other forms orlearning this from data is a topic for future work.

The distribution p(query | Ru) would ideally be a lan-guage model that is directly estimated using query logs. Analternative is to use the language model that we developedfor document classification. However, query readability maybe very different from document readability. The distinctionis that the words a user recognizes may be different from thewords that they choose to use in queries. A much simplerapproach is to simply model the length of the query, ignoringthe actual words. As we show in Figure 1, the query lengthcan be informative about a user’s reading proficiency.

We use the above ideas to compute a session-based querydifficulty feature based on the average reading level of thesatisfied clicks that a user enacts in previous queries withinthe session.

If we also have access to the set of web sites that a userfrequently visits, we could use these to help predict the read-ing proficiency of a user. For example, Club Penguin andFunbrain are two sites often visited by children, but lessfrequently visited by adults. This is a topic for future work.

3.3 Re-ranking based on reading difficultyTo learn effective re-rankings and to explore the impor-

tance of features related to reading level, we use the Lamb-daMART [23] algorithm, a state-of-the-art ranking algorithmbased on boosted regression trees. Compared with otherranking approaches LambdaMART is typically more robustto sets of features with widely varying ranges of values, suchas categorical features. Since LambdaMART produces atree-based model, it can be used as a feature selection algo-rithm or to rank features by their importance (Section 5.1.2).Based on the results of our analysis in Section 4, along withevidence on the effectiveness of specific features gathered byother authors in previous studies of expertise [22], we chosethe following set of features for study.Query features. These features rely only on the querystring and include query length in characters and querylength in space-delimited words.Query/session features. If previous queries were presentin a session, we estimate a dynamic reading level for a user bytaking the average reading level of the clicked snippets fromprevious queries in the same user search session. Becauseof the sparse nature of clicks we also compute a confidencevalue for this query level that increases with the sample sizeof clicked snippets. We also include a measure of the lengthof a session, in terms of the number of previous queries.Snippet features. We compute the estimated reading dif-ficulty of a page’s snippet using the algorithm described inSection 3.1.1, and the Dale-Chall semantic variable fromSection 3.1.2. We also include the relative difficulty of thesnippet compared to the levels of the other top-ranked resultsnippets: snippets are sorted by descending reading level,and then the reciprocal rank of the snippet is computedwith respect to that ranking.Page features. Using the same reading level prediction al-gorithm used for snippets, we compute reading difficulty forthe body text of the Web page corresponding to a snippet.We also include a confidence feature for this prediction.Snippet-page features. We compute the (signed) differ-ence between the full page reading level and the snippetreading level.Query-page features. These features capture the strengthof a query-document match: the main signals here are the


https://www.researchgate.net/publication/220479903_Adapting_boosting_for_information_retrieval_measures?el=1_x_8&enrichId=rgreq-451a9a1f4ed5833a698399ab41023c92-XXX&enrichSource=Y292ZXJQYWdlOzIyMTYxNDc1NjtBUzoxMDQzNDAxMDY3NzY1NzdAMTQwMTg4ODAyODkyNg==




Source Feature Name Description

Query query_char_len Query length (in characters)query_word_len Query length (in words)

Query (Session) session_user_level Session-based user reading level estimatesession_user_level_confidence Confidence estimate for user reading levelprev_queries_in_session Number of previous queries in current search session

Snippet snippet_difficulty Reading level of snippetrelative_snippet_difficulty Relative snippet difficulty in top 10 resultsdale_snippet_difficulty Dale difficulty level

Page full_page_difficulty Reading level of page body textfull_page_difficulty_confidence Confidence level for full-page reading level

Snippet-Page snippet_page_level_difference Difference between reading levels of snippet and full pagesnippet_page_difference_confidence Confidence level for snippet-page level difference

Query-Page norm_production_score Normalized ranker score for a pagereciprocal_rank_score Reciprocal rank of page

Query-Snippet snippet_query_diff Signed difference in reading level between query and snippetsnippet_query_diff_abs Absolute difference in reading level between query and snippet

Table 2: The features used by LambdaMART reranking. Features that potentially make use of previous queries in a sessionare denoted (Session).

normalized ranker score from the search engine, and the re-ciprocal rank of a page in the top ten results.Query-snippet features. We compute the absolute andsigned differences between the estimated user reading leveland the estimated snippet reading level.

Table 2 summarizes the set of features used for ranking.

4. LARGE-SCALE QUERY LOG ANALYSISIn this section we perform a summary analysis of search

log data in order to contrast the properties of Kids and non-Kids users and data sources. We conjectured that Kids ses-sions and queries would exhibit differences, especially withrespect to the reading level of preferred result snippets. Ouranalysis also estimates coverage for Kids queries to assessthe likely impact of personalization in this area, and showsthat having both snippet- and page-level reading level pre-dictions is valuable.

4.1 Data set and evaluation methodologyThe primary source of data for this study was a pro-

prietary data set containing the anonymized logs of URLsvisited by users who consented to provide interaction datathrough a widely-distributed browser plug-in. The dataset contained browser-based logs with both searching andbrowsing episodes from which we extract search-related data.These data provide us with examples of real-world searchingbehavior that may be useful in understanding and modelingkids-related search. Log entries include a browser identifier,a timestamp for each page view, and the URL of the Webpage visited. To remove variability caused by geographicand linguistic variation in search behavior, we only includelog entries generated in the English-speaking United Stateslocale. The results described in this paper are based onURL visits during the first week of October 2010 represent-ing millions of Web page visits from hundreds of thousandsof unique users. From these data we extracted search ses-sions from a major commercial Web search engine, using asession extraction methodology similar to [20]. Search ses-sions begin with a query, occur within the same browser andtab instance (to lessen the effect of any multi-tasking thatusers may perform), and terminate following 30 minutes ofuser inactivity.

From these search sessions we extracted search queriesand for each query, we obtained the top ten search results

retrieved by the Web search engine and the titles and thesnippets for each result that were displayed on the searchengine’s result page at query time. We then estimated thegrade level distribution for each of those results using thesnippet text and the full text of the corresponding Web page,per the method described in Section 3.1.1.

In addition, we also obtained binary relevance judgmentsfor each result in the top 10 using a methodology similar tothat in [9]. We define a satisfied (SAT) click in a similar wayto previous work [19] (i.e., with either a dwell time post-clickof 30 seconds or the last SERP click in the session). Advan-tages of these log-based judgments are that many judgmentscan be easily gathered, and that they are personalized to theuser and the query, which is important in the evaluation ofpersonalized search algorithms. With these judgments, wedefine two evaluation scenarios: ‘Last-SAT’, which assignsa positive judgment to one of the top 10 URLs if it is thelast satisfied SERP click in the session (by click time); and‘All-SAT’, which assigns a positive judgment to any satisfiedclick in a session. (In both cases, the remaining top-rankedURLs receive a negative judgment.) While Last-SAT hasbeen shown to be highly indicative of a user’s goal [9], wealso examine All-SAT since informational queries, which aremore likely to have multiple relevant results, may exhibitdifferent performance qualities.

For the Last-SAT case, the Mean Reciprocal Rank (MRR)of the positive judgment is used to evaluate retrieval perfor-mance before and after re-ranking. For All-SAT, we evaluateaverage precision on the top 10 results.

Queries for which we cannot assign a positive judgmentto any top-10 URL are excluded from the feature set. Wealso excluded queries corresponding to twelve very high fre-quency navigational queries3. Although there are benefits toincluding these queries, such as detecting engine switchingbehavior [21], their highly predictable nature across usersmakes them less interesting for a personalization study, andremoving them also greatly reduces data processing demands.

After this filtering, 759,671 queries remained. Of these,555,048 queries (73%) had no previous queries in the samesession and 205,623 had at least one previous query, leaving27% of queries potentially amenable to session-based im-provements. The 27% is low, since previous work has found

3facebook, google, myspace, gmail, bing, yahoo, yahoomail,craigslist, youtube, ebay, aol, hotmail.

https://www.researchgate.net/publication/221613676_Characterizing_and_predicting_search_engine_switching_behavior?el=1_x_8&enrichId=rgreq-451a9a1f4ed5833a698399ab41023c92-XXX&enrichSource=Y292ZXJQYWdlOzIyMTYxNDc1NjtBUzoxMDQzNDAxMDY3NzY1NzdAMTQwMTg4ODAyODkyNg==

https://www.researchgate.net/publication/221614582_Predicting_short-term_interests_using_activity-based_search_context?el=1_x_8&enrichId=rgreq-451a9a1f4ed5833a698399ab41023c92-XXX&enrichSource=Y292ZXJQYWdlOzIyMTYxNDc1NjtBUzoxMDQzNDAxMDY3NzY1NzdAMTQwMTg4ODAyODkyNg==

https://www.researchgate.net/publication/221300543_Smoothing_Clickthrough_Data_for_Web_Search_Ranking?el=1_x_8&enrichId=rgreq-451a9a1f4ed5833a698399ab41023c92-XXX&enrichSource=Y292ZXJQYWdlOzIyMTYxNDc1NjtBUzoxMDQzNDAxMDY3NzY1NzdAMTQwMTg4ODAyODkyNg==


https://www.researchgate.net/publication/221654006_PSkip_Estimating_relevance_ranking_quality_from_web_search_clickthrough_data?el=1_x_8&enrichId=rgreq-451a9a1f4ed5833a698399ab41023c92-XXX&enrichSource=Y292ZXJQYWdlOzIyMTYxNDc1NjtBUzoxMDQzNDAxMDY3NzY1NzdAMTQwMTg4ODAyODkyNg==

Figure 1: Log-odds of query length (in words) for Kids vs.All queries, showing that single-word queries and querieswith eight words or more are more likely for Kids queriesthan general queries.

that 60% of search sessions contained multiple queries [20].This may be explained in part by differences in the search be-havior of users issuing Kids queries and perhaps even the re-moval of the small set of high-frequency navigational queries.

4.2 Query propertiesWe used our development set (Section 5.1) to extract

29,498 total queries and 19,601 unique queries having atleast one click on a SERP result labeled with the Kids&TeensODP category. We compared this to the properties of thefull development set, with no filtering, which we call the Allqueries set.

Many of the most frequent queries found in a recent AOLquery log analysis by Torres et al. [17] also appeared highlyranked in our list in various forms, such as [nick jr ], [star-fall.com], [coloring pages] and [dora]. We found that thedistribution of query lengths for Kids sessions was also sim-ilar to Torres et al. , with longer queries being more likelyon average, as shown in Figure 1, possibly due to a greaterfrequency of natural language queries. Query lengths abovethe line (positive log-odds) are more likely for Kids queries,and those below the line are more likely for All queries. Ourdata also showed that single-word queries were more likely,as the result of a larger proportion of navigational queries.

4.3 Reading level distribution of snippetsOf all SERP clicks, 3.6% were satisfied clicks in our Kids

sessions sample. The distribution of snippet reading levelsfor all SAT clicks for Kids and All queries is shown in Fig-ure 2. The figure shows that the estimated reading difficultyof snippets associated with satisfied clicks is skewed towardlower grade levels for queries coming from Kids sessions,compared to all queries.

As an example of the connection between simple syntac-tic features within a page’s URL string and the estimatedreading level of its snippet, we extracted URLs containingthe substring ‘kids’ and another set containing the substring‘physics’. Figure 3 shows that the two reading level distribu-tions are very different: the ‘physics’ distribution is shiftedtoward a much higher level of difficulty. The All distribu-tion lies in an intermediate area between the two. Our com-parison here makes the assumption that documents aboutphysics are of higher difficulty whereas the vast majority of

Figure 2: The estimated reading difficulty of snippets forpages with satisfied clicks, for Kids sessions and all sessions.

Figure 3: The estimated reading difficulty of snippets ofpages with different substrings in their URL: ‘kids’ and‘physics’ have sharply different difficulty distributions, withthe All distribution lying in between.

kids documents are of lower difficulty. Indeed, we can seethis reflected in the actual distributions in Figure 3.

Note that Figure 2 is computed from user clicks whileFigure 3 is computed from properties of URL strings, whichare user-agnostic. Thus, Figure 2 shows whether users takereading difficulty into consideration in deciding whether theylike a page. For example, it could be that for the query[physics] all users click on documents with reading level 10– this would be captured in the type of histogram shown inFigure 2 but not in Figure 3.

4.4 Snippet-page difficulty gap predicts aver-age dwell time

To assess the importance of using both snippet and pagerepresentations of reading level, we examined the correlationbetween the difference in the predicted reading levels of aSERP snippet vs. the full text of the corresponding Webpage, and the average user dwell time (in seconds) for thatWeb page. We found a strong relationship between thesequantities: for example, the more difficult the underlyingpage is, compared to the clicked snippet for that page, the



Figure 4: Average user dwell time is strongly predicted bythe difference between the query-specific snippet reading dif-ficulty and the underlying page reading difficulty. As actualpage difficulty increases relative to the displayed snippet,user will be more likely to abandon that page quickly. Errorbars denote 95% confidence intervals.

more likely it became that the user would be unsatisfied andleave that page quickly (e.g., spend less than 30 secondsreading it). This is summarized in Figure 4. The meansand confidence intervals for each dwell time, conditioned onsnippet-page reading level difference, are shown in Figure 4and were computed by the bootstrap method [8], using 100iterations over the search log data. Each iteration sampledN = 630, 000 click instances with replacement. The Pearsoncorrelation coefficient over all dwell times of 120 seconds orless was r2 = 0.69: a strong signal about the behavior of auser population. This is an important finding with a num-ber of implications. First, it provides clear evidence thatmodeling both snippet and page level provides a valuablegeneric relevance signal. As we show later in Section 5.1.2,the difference between snippet and page level does have highrelative weight among effective ranking features. Second,it suggests that snippet quality could be improved by con-straining a snippet’s expected reading level to closely matchthe underlying page reading level.

4.5 Coverage for Kids-related queriesTo understand what percentage of queries may be re-

lated to children’s information needs and thus might becandidates for reading level-based personalization, we ex-tracted all URLs from the Kids&Teens categories in theODP and considered them to be children-friendly URLsgiven the ODP editing guidelines for these categories.

We used a query-click bipartite graph from anonymoussearch data containing a node for each query and each clickedURL for one month of 2010 query traffic in the English-speaking United States locale from a large commercial searchengine. Query nodes are connected via edges to the URLs

ODP Kids&Teens URLs 45,879Unique Queries (with clicks on these URLs) 1,360,341Impressions from these queries 286,174,932

Coverage 13.62%

Table 3: Kids&Teens-related traffic statistics (one month).

users actually clicked on for those queries during a searchsession. We looked up the ODP children-friendly URLs inthis graph and extracted the queries leading to clicks onthese URLs. While it is impossible to know if these querieswere issued by users in the age range targeted by ODP withthe Kids&Teens categories, we used this method to char-acterize such queries as having some degree of Kids&Teensintent. Using this technique, we estimated the coverage ofKids&Teens queries to be 13.62% of non-bot traffic. Sum-mary statistics are shown in Table 3.

The dataset used to identify queries with Kids&Teens in-tent includes impressions collected from multiple revisions ofthe underlying Web search results ranking algorithm, mean-ing that there was significant variance Web search resultsreturned for the same queries. As a result, we consider the13.62% coverage presented here as a ceiling of opportunityfor personalization of the search experience for children. Inaddition, our earlier analysis (from Section 4.1) shows thatat least 27% of those Kids&Teens queries might be amenableto additional session-based improvements.

5. EVALUATIONOur evaluation section consists of three parts: re-ranking

results, an analysis of relative feature importance, and ro-bustness evaluation.

5.1 Re-ranking performanceIn this section we confirm the effectiveness of re-ranking

Web search results using reading level features. We examinenot only basic retrieval performance in Section 5.1.1 but alsothe relative importance of different features (Section 5.1.2),and the robustness in rank gains and losses (Section 5.1.3).

We partitioned our log data as described in Section 4 ran-domly into two sets: 25% was used as a development set forLambdaMART parameter optimization, and the remaining75% was used for training/test splits using 10-fold cross-validation. To avoid bias toward longer sessions, these setswere further subsampled to one random query per session.

5.1.1 Basic retrieval performanceIn order to account for the possibility of query sets with

relevant documents biased lower in the ranking (such asdifficult queries), we trained a learned rank-only baselinereranking model using two features: the commercial searchengine ranker score, and the rank position of a document.This model always performs at least as well as the orig-inal commercial ranker score. As described in Section 4,our main evaluation measures are the change in Mean Re-ciprocal Rank (MRR) of the last satisfied click (‘Last-SAT’),and Mean Average Precision (MAP) using all satisfied clicks(‘All-SAT’). In practice, because the vast majority of querieshave only one click, there turns out to be almost no differ-ence in performance between Last-SAT and All-SAT on thisdataset, but we report both numbers for the full query setexperiment for completeness. Because of the proprietarynature of our search engine system we do not report abso-lute MRR, but instead report relative change in MRR on apoint scale from 0 to 100: (MRRMODEL−MRRBASE)·100,where MRRBASE is the learned rank-only baseline MRR for

https://www.researchgate.net/publication/220049076_An_Introduction_To_The_Bootstrap?el=1_x_8&enrichId=rgreq-451a9a1f4ed5833a698399ab41023c92-XXX&enrichSource=Y292ZXJQYWdlOzIyMTYxNDc1NjtBUzoxMDQzNDAxMDY3NzY1NzdAMTQwMTg4ODAyODkyNg==

Experiment Last-SAT All-SAT(MRR Change) (MAP Change)

Rank-only Baseline 0.0 0.0Query+Session +0.7? +0.6?

Query+Session+Page +0.8? +0.8?

Query+Session+Snippet +1.0? +0.9?

All Features +1.2? +1.1?

Table 4: Summary of the relative performance for different feature sets on full query set, in terms of change points inMRR (Last-SAT) and MAP (All-SAT). The Rank-only Baseline is used as the baseline for comparison and thus set to 0.0.Superscripts ? and + denote significance of change at p < 0.01 and p < 0.10 respectively using a paired t-test.

Query subset Num. queries % Total Baseline Model GainRel. MRR Rel. MRR

Kids 15,796 4.4% −4.1 −3.1 +1.0?

Science 23,059 6.8% −9.0 −4.7 +4.3?

Sports 41,139 11.6% +7.2 +8.2 +1.0?

Health 21,581 6.1% −7.3 −7.3 0.0All 545,255 100% 0.0 +1.2 +1.2?

Table 5: Summary of gains in MRR for Last-SAT click from re-ranking with reading-level features. To show the relativedifficulty of different query sets, the Baseline Relative MRR Change is computed relative to the Rank-only Baseline for allqueries in the test set, so that relative MRR for the ‘All’ query set is zero. Similarly, the Model Relative MRR Change isalso computed relative to the Rank-only Baseline for all queries. Gain is the difference between the Model and Baseline.Superscripts ? and + denote significance of change at p < 0.01 and p < 0.10 respectively using a paired t-test.

all queries. We note that the baseline ranking, representinga highly-tuned commercial search engine, is very competi-tive, and even a 1-point change in MRR, when statisticallysignificant, is considered a notable gain.

Table 4 compares the performance of the learned rank-only baseline to our model, for different features of the sys-tem. Across the full query set, there was a statisticallysignificant +1.2 point MRR gain for Last-SAT clicks, and+1.1 point MAP gain for All-SAT clicks. Starting with theRank-only baseline, we added a basic set of Query+Sessionfeatures that made no use of snippet or page-level readingdifficulty predictions, which gave an MRR gain of +0.7. Tocompare the relative utility of page vs. snippet features, wethen added either all snippet-based features, or all page-based features, to the basic Query+Session set. The snippet-based features gave a slightly larger gain (+1.0) comparedto adding full-page features (+0.8). Finally, we achievedthe best overall performance when both snippet and pagefeatures were used together with the other features.

To investigate the effect of re-ranking using reading dif-ficulty features on queries of different topics, we chose theODP Kids&Teens, Science, Sports, and Health categories.For each of these categories, we extracted individual querieshaving at least one click on a URL belonging to that ODPcategory4. Table 5 shows gains and losses achieved by re-ranking for Kids, Science, Sports, and Health subsets ofqueries5. An increasingly negative (resp. positive) value forBase Relative MRR indicates a harder (resp. easier) retrievaltask compared to the All Queries scenario.

Across several useful subclasses of queries, re-ranking withreading level features gave consistent, statistically significant

4For classifying queries in this way, the click did not have tobe a satisfied click.5Note that according to our definition, a query can belongto potentially multiple ODP categories depending on theuser’s clicked pages, so the query counts in Table 5 do notsum to the total number of all queries. Also, total queriesreported here are about 70% of the total queries reported inSection 4, due to combined effects of query subsampling toreduce session length bias, and 10-fold cross-validation split.

gains in MRR compared to the learned rank-only baseline.The most challenging query set in terms of low baseline (Sci-ence) had a rank-only relative baseline of -9.0 compared tothe rank-only baseline of the full query set, while the ‘eas-iest’ subclass was Sports, whose rank-only baseline changewas +7.2 over the full query set. After adding reading levelfeatures, our ranking model achieved net MRR point gainsof +1.0 for Kids queries, +4.3 for Science queries, and +1.0for Sports queries. Somewhat surprisingly, Health queriesshowed no gain - the reasons for this require further study.

It is encouraging to find a class of queries like Sciencequeries that obtain a particularly large benefit from read-ing level features. The Science category contains a higherproportion of more technical material than most other ODPcategories6 and thus search results for those queries might beexpected to have higher reading difficulty entropy, leadingto greater potential for personalization. Kids&Teens pages,in contrast, are typically already tailored for children andthus have less variation in reading level, which may explainthe reduced effect of reading level features on those queries.

5.1.2 Relative effectiveness of ranking featuresTo understand the relative contribution of our query, ses-

sion, snippet, and page-based reading difficulty features tore-ranking effectiveness, we used LambdaMART to obtainscores representing relative feature importance. These scoresare computed as the average reduction in residual squarederror when applying the given feature, averaged over all treesand over all splits. The scores are then normalized relativeto the most informative feature, which has a score of 1.000.Table 6 lists the top-scoring features resulting from our mainexperiment using all features over all queries, averaged over10 cross-validation splits.

Examining the top five features in this list, the highest-weighted feature - perhaps not surprisingly - was the recip-rocal rank score of a page, reflecting the extreme bias towardtop-ranked clicks that is typical of Web search results. Rela-

6For example, the average reading difficulty level estimatedfor snippets for Kids&Teens pages was 4.53, for Sports was4.79 and for Science was 5.34.

Feature Rel. Weight

Reciprocal rank score 1.000Relative snippet difficulty 0.295Query length in characters 0.237Session-based user reading level 0.216Snippet-page reading level difference 0.207Dale snippet difficulty level 0.186Normalized ranker score 0.183Query-snippet reading level difference 0.142Query length in words 0.116Snippet-page level difference confidence 0.081Snippet reading level 0.076Page reading level 0.048Number of previous queries in session 0.030Session-based user reading level confidence 0.019

Table 6: The relative importance of features computed byLambdaMART reranking for all queries. Feature impor-tance in the right column is the average reduction in resid-ual squared error over all trees and over all splits, relativeto the most informative feature.

tive snippet difficulty was the second-most important feature(0.295), which matches what we have observed informally:that when picking from a list of otherwise similar results,users tend to pick the snippet with lowest reading difficulty.Query length in characters was more influential (0.237) thanquery length in words (0.116), although both contributed tothe prediction performance. The predictive power of querylength is in accordance with the log-odds length distribu-tion shown in Figure 1. Estimating the user’s reading profi-ciency from their previous session queries proved to be an-other highly-ranked feature (0.216), showing the importanceof user-specific personalization. As predicted by our analysisin Section 4.4, the reading level difference between a snippetand its corresponding Web page also carries valuable infor-mation, and this is reflected in its position as one of the top5 features (0.207).

Among the remaining features, the Dale difficulty featureof the snippet, with a relative weight of 0.186, proved mod-erately effective as a complementary level prediction. Sev-eral base features, such as snippet-page level difference, havecorresponding confidence features: one of these (for snippet-page level difference) did appear in the top 10 (0.081) whilethe others had much weaker feature scores.

5.1.3 Robustness of re-rankingWhile improving a retrieval metric’s average gain across

queries is important, in the case of re-ranking an existing setof results, the variance of relative gains and losses comparedto the initial ranking is also critical to measure. An algo-rithm may improve average performance, but also increasethe number of queries dramatically hurt by re-ranking.

Figure 5 shows the distribution of gains and losses acrossall queries for the re-ranking model used in Section 3.3, asmeasured by the change in rank position of the last satis-fied click. A robust algorithm is one that is able to achievegood on-average results, while having a minimal failure pro-file – as measured by a statistic like the number of querieshurt by re-ranking (the left half of histogram in Figure 5).Out of a total of 545,245 queries, 450,921 queries (82.7%)had no change; 51,759 (9.4%) were helped (rank of the lastSAT click increased at least one position); and 42,565 (7.8%)were hurt (rank of the last SAT click decreased by at leastone rank position). While more work is needed on methodsto reduce the loss side of the histogram, the multi-position

Figure 5: Histogram showing the variance of losses (left tail)and gains (right tail) using re-ranking by reading level fea-tures on All queries. The loss or gain in rank position of thelast satified click is given on the x-axis. The y-axis denotesthe number of queries (note the log scale).

gains we obtain are encouraging. In particular, for changesof 6 or more rank positions, we achieve a ratio of querieshelped to hurt of greater than 2 to 1.

6. DISCUSSIONPersonalization by reading level can be considered some-

what orthogonal, but complementary, to other dimensions ofpersonalization such as location, domain expertise, or topi-cal interest. Our study is, to our knowledge, the first of itskind to study and evaluate personalization by reading levelas an aspect of large-scale Web search. From our work itis clear that some problems, such as automatically estimat-ing a reliable user profile of reading proficiency, and findingmore accurate or effective features for re-ranking, are non-trivial and require further research. Furthermore, while ourstudy included some features defined over a single user ses-sion, we believe learning and applying a longer-term userhistory could also be quite valuable.

Many improvements and interesting directions remain tohelp users find and understand material appropriate to theirneeds and reading level. For example, while personalizationseeks to adapt the content to the user, we can also considerthe reverse goal: adapting the user to the content. By thiswe mean applying models of reading level and vocabularydifficulty to identify learning opportunities that would helpreduce the gap between the user’s reading proficiency anddocument reading level. For example, a search engine mightidentify critical ‘words to learn’ on a topic, such as bronchitisfor coughs in a home medical care page. A similar idea couldbe used to help non-native language learners.

The role of reading level features in improving query anddocument representations is also a rich area for further study.As our findings on dwell time in Section 4.4 suggest, identi-fying differences in vocabulary or reading level distributionbetween the different representation streams of a Web page,such as anchor text and captions, and those of the underlyingpage could help identify problems with snippet or documentquality, or even distinctions in usage between different usergroups. Also, the variation of dwell times across differentages or reading proficiency levels may also be interesting toinvestigate further. When processing a likely Kids query, thesearch engine could provide more child-appropriate snippets

or query suggestions for that query. Since the reading leveldistribution of a page can be pre-computed and stored inthe index, improvements in reranking that require readinglevel could be applied quickly and reliably. We also believethat the interaction of reading level with topic is importantand intend to explore this combination in future work.

In general, reading difficulty level can serve as a valuablecontextual signal to improve the ranking of documents pre-sented to individual users during a search session. Documentrelevance may be improved using the language models in thispaper because they can characterize user intent based onvocabulary usage during and across search sessions. Searchengines could further leverage these models to personalizeall aspects of the user experience, including captions, ads,images, videos, or even pure presentation features such asfont size, site previews, and page composition. While theactual impact of such personalization efforts on overall usersatisfaction remains a point for further investigation, read-ing level modeling may provide a powerful tool to bridge thevocabulary gap between search engines and their users.

7. CONCLUSIONWe have shown how incorporating reading level features

for users and documents can provide a valuable new signalfor relevance in Web search. We explored three key problemsin improving relevance for search using reading difficulty: es-timating models of user reading proficiency, estimating mod-els of result difficulty, and combining relevance and difficultysignals to re-rank based on the difference between user andresult reading level. We also provided a large-scale analysisof log data to characterize certain aspects of user behaviorand classes of features and queries that were likely to beeffective in personalization using reading difficulty predic-tions. Our results show that statistically significant gainsmay be obtained with a commercial search engine, even forgeneral queries, by incorporating reading difficulty features.Furthermore, we found specific sub-classes of queries, suchas science-oriented queries, that are particularly amenableto improvement. Our work could easily be generalized tomodel domain expertise in specific subject areas, such asthose defined in the Open Directory Project. For example,Web search results could be ranked with the introductorymaterial first, followed by increasingly technical material.Other advances such as level-appropriate query suggestions,result snippets, and site recommendations are also possible.

AcknowledgementsWe thank Dan Liebling and Misha Bilenko for technical as-sistance, and Sue Dumais, Jaime Teevan, and the anony-mous reviewers for their valuable suggestions.

8. REFERENCES

[1] P. N. Bennett, K. Svore, and S. T. Dumais.Classification-enhanced ranking. In Proc. of WWW2010, 111–120.

[2] D. Bilal. Children’s use of the yahooligans! web searchengine: Cognitive, physical, and affective behaviors onfact-based search tasks. J. Am. Soc. Inf. Sci.,51(7):646–665, 2000.

[3] J. Chall, E. Dale. Readability Revisited: The New DaleChall Readability Formula. Brookline Books, 1995.

[4] C. L. Clarke, E. Agichtein, S. Dumais, and R. W.White. The influence of caption features onclickthrough patterns in web search. In Proc. of SIGIR2007, 135–142.

[5] K. Collins-Thompson and J. P. Callan. A languagemodeling approach to predicting reading difficulty. InHLT-NAACL, 193–200.

[6] A. Druin, E. Foss, H. Hutchinson, E. Golub, andL. Hatley. Children’s roles using keyword searchinterfaces at home. In Proc. of CHI 2010, 413–422.

[7] C. Eickhoff, P. Serdyukov, and A. de Vries. Acombined topical/non-topical approach to identifyingweb sites for children. In Proc. of WSDM 2011,505–514.

[8] B. Efron, R. J. Tibshirani. An Introduction to theBootstrap. Chapman & Hall/CRC, New York, 1994.

[9] J. Gao, W. Yuan, X. Li, K. Deng, and J.-Y. Nie.Smoothing clickthrough data for web search ranking.In Proc. of SIGIR 2009.

[10] K. Gyllstrom and M.-F. Moens. Wisdom of the ages:toward delivering the children’s web with thelink-based agerank algorithm. In Proc. of CIKM 2008,159–168.

[11] S. Hirsh. Children’s relevance criteria and informationseeking on electronic resources. JASIST,50(14):1265–1283.

[12] P. Kidwell, G. Lebanon, and K. Collins-Thompson.Statistical estimation of word acquisition withapplication to readability prediction. In Proc. ofEMNLP 2009.

[13] G. Kumaran, R. Jones, and O. Madani. Biasing websearch results for topic familiarity. In Proc. of CIKM2005, 271–272.

[14] X. Liu, W. B. Croft, P. Oh, and D. Hart. Automaticrecognition of reading levels from user queries. InProc. of SIGIR 2004, 548–549.

[15] PuppyIR. PuppyIR: An open source environment toconstruct information services for children. 2011.http://www.puppyir.eu/.

[16] J. Teevan, S. T. Dumais, and E. Horvitz. Personalizingsearch via automated analysis of interests andactivities. In Proc. of SIGIR 2005, 449–456.

[17] S. D. Torres, D. Hiemstra, and P. Serdyukov. Ananalysis of queries intended to search information forchildren. In IIiX 2010, 235–244.

[18] M. van Kalsbeek, J. de Wit, D. Trieschnigg,P. van der Vet, T. Huibers, and D. Hiemstra.Automatic reformulation of children’s search queries.Technical Report TR-CTIT-10-23, June 2010.

[19] K. Wang, T. Walker, and Z. Zheng. Estimatingrelevance ranking quality from web search clickthroughdata. In Proc. of SIGKDD 2009, 1355–1364.

[20] R. W. White, P. N. Bennett, and S. T. Dumais.Predicting short-term interests using activity-basedsearch context. In Proc. of CIKM 2010, 1009–1018.

[21] R. W. White and S. T. Dumais. Characterizing andpredicting search engine switching behavior. In Proc.of CIKM 2009, 87–96.

[22] R. W. White, S. T. Dumais, and J. Teevan.Characterizing the influence of domain expertise onweb search behavior. In Proc. of WSDM 2009,132–141.

[23] Q. Wu, C. J. C. Burges, K. M. Svore, and J. Gao.Adapting boosting for information retrieval measures.Information Retrieval, 3(13):254–270, 2010.

[24] C. X. Zhai, W. W. Cohen, and J. Lafferty. Beyondindependent relevance: methods and evaluationmetrics for subtopic retrieval. In Proc. of SIGIR 2003,10–17.












































































personalizing web search results by reading level

Documents