language identification of search engine queries
DESCRIPTION
Language Identification of Search Engine Queries. Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission College Blvd. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Language Identification of Search Engine Queries](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/1.jpg)
Language Identification of Search Engine Queries
Hakan Ceylan Yookyung KimDepartment of Computer Science Yahoo! Inc.University of North Texas 2821 Mission College Blvd.Denton,TX,76203 Santa Clara,CA,[email protected] [email protected]
ACL 2009
![Page 2: Language Identification of Search Engine Queries](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/2.jpg)
outline
• Introduction• Data Generation• Language Identification• Conclusions and Future Work
![Page 3: Language Identification of Search Engine Queries](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/3.jpg)
Introduction(1)
• Decide in which language a given text is written
• It is heavily studied• It is critical importance to search engines for
queries• Challenges : lack of any standard or publicly
available data set
![Page 4: Language Identification of Search Engine Queries](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/4.jpg)
Introduction(2)
• A case where a correct identification of language is not necessary.
example : query ”homo sapiens” , a user enter this query from Spain. Add a non-linguistic feature to system
![Page 5: Language Identification of Search Engine Queries](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/5.jpg)
Introduction(3)
![Page 6: Language Identification of Search Engine Queries](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/6.jpg)
Data Generation(1)
• Data set : Constructed by the queries with clicked urls From : Yahoo! Search Engine for each language Time : three months time period
![Page 7: Language Identification of Search Engine Queries](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/7.jpg)
Data Generation(2)
• Preprocess : remove any numbers or special characters or
extra spaces. lowercase all the letters of the queries. Calculating the frequencies of the urls for
each query.• A web page is 474 words on the average• Identify the language for web page using one of
the existing methods.
![Page 8: Language Identification of Search Engine Queries](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/8.jpg)
Data Generation(3)
• Using Table 1(T1) and Table 2(T2) to store the above information
T1 : [ q , u , fu ] T2 : [ u , l ] q : query u : a unique url u : url l : language identified for u fu : the frequency of u
• Combine T1 and T2 into T3 T3 : [ q , l , fl , cu,l ]
l : a language fl : the count of clicks for l cu,l : the count of unique urls in language l
![Page 9: Language Identification of Search Engine Queries](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/9.jpg)
Data Generation(4)
• It has many noise. 1. A query maps to more than one language. solve : Giving a weight wq,l for each query to a language set a threshold parameter W if wq,l < W then remove this query
2.navigational query example : ACL 2009
![Page 10: Language Identification of Search Engine Queries](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/10.jpg)
Data Generation(5)
Solve : set two threshold parameter F and U if Fq > F or Uq < U then remove this query• Algorithm
![Page 11: Language Identification of Search Engine Queries](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/11.jpg)
Data Generation(6)
• How to turn our parameter dependent on the size of data set (Silverstein et al.,1999) W = 1 , F = 50 , U = 5
• How many query will be filter 5%~10% of the queries
• Pick 500 queries randomly and annotate them by human
Category-1: If the query does not contain any foreign terms. Category-2: If there exists some foreign terms but the query would still be expected to bring web pages in the same language. Category-3: If the query belongs to other languages, or all the terms are foreign to the annotator.
![Page 12: Language Identification of Search Engine Queries](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/12.jpg)
Data Generation(7)
• How much of this multi-linguality parameter selection eliminate? result : Category-1 : 47.6% Category-1+2 : 60.2%
![Page 13: Language Identification of Search Engine Queries](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/13.jpg)
Language Identification(1)
• Implement three models use a different existing feature
1.statistical model 2.knowledge based model 3.morphological model• EuroParl Corpora• Combine all three models in a machine learning
framework using a novel approach• Add a non-linguistic
![Page 14: Language Identification of Search Engine Queries](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/14.jpg)
Language Identification(2)
• Test set-3500 human annotated queries
![Page 15: Language Identification of Search Engine Queries](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/15.jpg)
Statistical model
• Character based n-gram feature (n=1 to 7)• Vocabulary from training corpus(EuroParl)• Generate a probability distribution from these
count• Above work can use SRILM Toolkit with
Kneser-Ney Discounting and interpolation
![Page 16: Language Identification of Search Engine Queries](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/16.jpg)
Knowledge based model
• Word based n-gram feature (n=1)• Vocabulary from training corpus(EuroParl)• Generate a probability distribution from these
count
![Page 17: Language Identification of Search Engine Queries](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/17.jpg)
Morphological model
• Gather the affix information from corpora in an unsupervised(Harald Hammarstr¨om 2006)
• Give a score for each affix
![Page 18: Language Identification of Search Engine Queries](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/18.jpg)
Language Identification(3)
• Performance
![Page 19: Language Identification of Search Engine Queries](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/19.jpg)
Decision tree classification
• Each model can complement the other in certain cases
• Train data : automatically annotated data set• Feature : confidence score• Use the Kurtosis measure
![Page 20: Language Identification of Search Engine Queries](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/20.jpg)
Decision tree classification
• An example : query “the sovereign individual” and statistical model identifies it as English k = 7.6 > = = ( 4.47 + 1.96 ) so this query’s confidence score is “en-HIGH”• Implement DT classifier by the Weka Machine
Learning Toolkit (Witten and Frank,2005)
![Page 21: Language Identification of Search Engine Queries](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/21.jpg)
Decision tree classification
• Outperform all the models for each size on average
![Page 22: Language Identification of Search Engine Queries](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/22.jpg)
Decision tree classification
Mli,lj : language li misclassified by the system as lj
![Page 23: Language Identification of Search Engine Queries](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/23.jpg)
non-linguistic feature
• Non-linguistic feature is the language information of the country
• It helps the search engine in guessing the language
example : query “how to tape for plantar fasciits”(it is labelled as Category-2) It is classified to Porteguese query
![Page 24: Language Identification of Search Engine Queries](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/24.jpg)
non-linguistic feature
• Increase test set size to 430 queries
![Page 25: Language Identification of Search Engine Queries](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/25.jpg)
Conclusions
• A completely automated method to generate a reliable data set
• Built a decision tree classifier that improves the results on average
• Built a second classifier that takes into account the geographical information of the users
![Page 26: Language Identification of Search Engine Queries](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/26.jpg)
Feature Work
• To improve the accuracy of data generation• More careful examination in parameter values• To extend the number of languages in data set• Consider other alternatives to the decision
tree framework