the infocious web search engine: improving web searching through linguistic analysis alexandros...

26
The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc. {ntoulas, gerald}@infocious.com 2 University of California Los Angeles {ntoulas, cho}@cs.ucla.edu

Upload: sofia-ward

Post on 27-Mar-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc

The Infocious Web Search Engine:

Improving Web Searching Through Linguistic Analysis

Alexandros Ntoulas1,2 Gerald Chao1 Junghoo Cho2

1 Infocious Inc.{ntoulas, gerald}@infocious.com

2 University of California Los Angeles{ntoulas, cho}@cs.ucla.edu

Page 2: The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc

April 10, 2023 WWW 2005 Chiba Japan

Motivation

Current Web search engines identify relevant pages based on keyword matching

Example: jaguar

Jaguar Cars Official worldwide web site of Jaguar Cars.

www.jaguar.com/

Page 3: The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc

April 10, 2023 WWW 2005 Chiba Japan

Motivation

Is keyword matching enough ? Natural languages are inherently ambiguous Example: jaguar

The car brand ? Apple Mac OS X 10.2 ? The animal ? Chemical software …

Page 4: The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc

April 10, 2023 WWW 2005 Chiba Japan

The Infocious Web Search Engine Uses Language Analysis techniques to:

Resolve ambiguities inside Web pages Rank the Web pages based on the

coherence (quality) of the text Help users organize the results in intuitive

ways through categorization Provide suggestions for query refinement

Page 5: The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc

April 10, 2023 WWW 2005 Chiba Japan

What is different about Infocious ? Search Engines today do not apply

Language Analysis to the level Infocious does

It is not simply a matter of applying existing algorithms: need optimizations for Web scale

Features made possible only through language analysis

Makes Language Analysis features intuitive (yet powerful) for the user

Page 6: The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc

April 10, 2023 WWW 2005 Chiba Japan

Architecture

Page 7: The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc

April 10, 2023 WWW 2005 Chiba Japan

Architecture

Crawler

• Follows links to discover Web pages

• Refreshes changed pages using sampling [VLDB’02]

• Can download pages from the Hidden Web [JCDL’05]

Page 8: The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc

April 10, 2023 WWW 2005 Chiba Japan

Architecture

Linguistic Processing

• Resolves language ambiguities [COLING’02]

• Annotates Web pages

• Extracts concepts

• Extracts named entities

• Operates at crawl speed

Page 9: The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc

April 10, 2023 WWW 2005 Chiba Japan

Linguistic Processing: Disambiguation Part-of-speech (POS) tagging Example: house plants

Done probabilistically:Given sentence S, set of tags T find

Tbest(S) = arg maxT P(T | S)

... most house plants are hybrids of plant species

... garden built to house our most valuable plants ...

Adj Noun Noun Verb Noun Prep Noun Noun

Noun VerbD Inf Verb PronP Adv Adj Noun

Page 10: The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc

April 10, 2023 WWW 2005 Chiba Japan

Linguistic Processing: Disambiguation POS information stored inside the index User can manually specify POS at query time (or

click on examples)

Query N:house N:plants

GreenPatio.Com – Tips for buying house plants.Why keep natural indoor plants.... Tips for buying house plants. Care for indoor plants....www.greenpatio.com/tips.shtml

Low Light Plants for the HouseIs a common name for plants in the species Dieffenbachia.... As with most house plants …www.plantsgalore.com/articles/houseplants/houseplants-low-light-plantfacts.htm

Page 11: The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc

April 10, 2023 WWW 2005 Chiba Japan

Linguistic Processing: Disambiguation

Query V:house N:plants

Over Wintering Bonsai … One method is to build a cold frame to house your plants in the winter. ...www.evergreengardenworks.com/overwint.htm

Keeping Your Sunroom Cozy …And if you want to house a hot tub or plants, think about enclosing the …doityourself.com/sunroom/sunroomcozy.htm

POS information stored inside the index User can manually specify POS at query time (or

click on examples)

Page 12: The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc

April 10, 2023 WWW 2005 Chiba Japan

Linguistic Processing: Disambiguation Word-sense disambiguation Previous Example: jaguar Approach through Web page categorization

Use the categories of DMOZ (~600,000) Given set of categories C and a page d

Find maxc C P(c|d)

In Infocious a page may belong to multiple categories

Page 13: The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc

April 10, 2023 WWW 2005 Chiba Japan

Categorization

The category of a result is highlighted onMouseOver()

Allow users to restrict search within a category:

jaguar cat:Computers

Can also be done by clicking on a category

Jaguar CarsOfficial worldwide site of jaguar cars

www.jaguar.com/

Apple Mac OS XThe Apple Mac OS Product

pagewww.apple.com/macosx/

Computers Recreation/AutosComputers

Apple Mac OS X

Page 14: The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc

April 10, 2023 WWW 2005 Chiba Japan

Linguistic Processing: Concept Extraction More accurate phrase identification:

Identify concepts through a set of rules (pre-specified or automatically learned)

Example: VerbPhrase-PrepPhrase-NounPhras lightly tossed with salad dressing tossed with oil and vinegar dressing tossed immediately with blue-cheese dressing Reduced to Concept: tossed with dressing

In the profession of cooking

oil is an important ingredient

Page 15: The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc

April 10, 2023 WWW 2005 Chiba Japan

Answering a query

Default is AND-semantics Query disambiguation (e.g. in query train a

pet Infocious knows train has to be a verb) Ranking takes into account a variety of

factors Presence of keywords, Proximity Title, URL, formatting, font size, coloring etc. Popularity of a page measured by in/out links TextQuality

Page 16: The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc

April 10, 2023 WWW 2005 Chiba Japan

Architecture

TextQuality

• Summarize probabilities from Linguistic Processing into one metric

• Promote coherent text

• Demote incoherent text

Page 17: The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc

April 10, 2023 WWW 2005 Chiba Japan

TextQuality (disabled) Promotes well-written pages (preferable from

the user perspective)

Britney Spears Pictures – britney spears pictures …picture of britney spears, hot pictures of britney spears …britney-spears-pictures.hotyoungstars.com/nude/

Hot Britney Spears Pics - hot britney spears pics,...britney spears, new hot pics of britney spears,...hot-britney-spears-pics.hotyoungstars.com/nude/

Britney Spears Photos – britney spears photos …spears, britney spears nude photos, nude photos of …britney-spears-photos.hotyoungstars.com/nude/

TextQuality DISABLED

Page 18: The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc

April 10, 2023 WWW 2005 Chiba Japan

Is Britney Spears over the edge? Is Britney Spears over the edge? … Britney Spears is a singer …azwestern.edu/modern_lang/esl/cjones/mag/spring2004/britney.htm

IMPERSONATORS – BRITNEY SPEARS Is Proud to Present! Contact: Gary Shortall Back… www.impersonators.com/brittany/brit.html

Britney Spears’ Coke HabitBritney Spears’ Coke Habit Destroys Her…www.emptyv.org/britney_spears.htm

TextQuality ENABLED

TextQuality (enabled) Promotes well-written pages (preferable from

the user perspective)

Page 19: The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc

April 10, 2023 WWW 2005 Chiba Japan

Other Language Analysis-Enhanced Features Key phrases: Present a list of the salient

concepts within the results Related topics: Concepts related to the

present query Hone your search: Suggestion of more

specific queries Spell Checking Personalization: I like Sports but not Politics

Page 20: The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc

April 10, 2023 WWW 2005 Chiba Japan

Evaluation of Categorization

Using Naïve Bayes classifiers for illustration: Language Analysis improves accuracy

Infocious actually employs an improved classification technique (76% accuracy)

We used four different flavors of NB on 100,000 Web pages:

C1: Words C2: Words + POS tags C3: Words + extracted concepts C4: Words + POS + extracted concepts

Page 21: The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc

April 10, 2023 WWW 2005 Chiba Japan

Evaluation of Categorization

C1: Words only C2: Words + POS tags

C3: Words + extracted concepts C4: Words + POS + extracted concepts

Accuracy of NB classifiers

60%

61%

62%

63%

64%

65%

66%

67%

68%

C1 C2 C3 C4

Classifier

Acc

ura

cy3% accurary increase – 8% error reduction

Page 22: The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc

April 10, 2023 WWW 2005 Chiba Japan

User Interface

Page 23: The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc

April 10, 2023 WWW 2005 Chiba Japan

Conclusion

Infocious: uses language analysis to improve Web search

Resolves language ambiguities Incorporates text coherence in the ranking Provides query suggestions and refinements Organizes information intuitively through

categorization

Page 24: The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc

April 10, 2023 WWW 2005 Chiba Japan

Related Work

Web Search Engines: Google, Yahoo!, MSNSearch, Ask/Teoma,

Altavista, Looksmart, Vivisimo, … Enterprise Search

Autonomy, Inquira, Inxight, iPhrase, … Answer Engines

START@MIT, BrainBoost, …

Page 25: The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc

April 10, 2023 WWW 2005 Chiba Japan

Ongoing work

Increase index size (currently ~1 billion pages) through surface & hidden Web-crawls

Apply our Language Analysis algorithms to additional languages

Leverage our Language-annotated repository for additional features (e.g. summarization, machine translation,…)

Investigate how to use Language Analysis to improve relevance in advertisements

Page 26: The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc

April 10, 2023 WWW 2005 Chiba Japan

Thank you !

You can check out our Search Engine at:

www.infocious.com