lycos retriever: an information fusion engine brian ulicny

13
Lycos Retriever: An Information Fusion Engine Brian Ulicny

Upload: anna-mcbride

Post on 12-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lycos Retriever: An Information Fusion Engine Brian Ulicny

Lycos Retriever:An Information Fusion Engine Brian Ulicny

Page 2: Lycos Retriever: An Information Fusion Engine Brian Ulicny

Retriever: Directory Page

Page 3: Lycos Retriever: An Information Fusion Engine Brian Ulicny

Retriever: Image Selection

Page 4: Lycos Retriever: An Information Fusion Engine Brian Ulicny

Retriever: Subtopic Page

Page 5: Lycos Retriever: An Information Fusion Engine Brian Ulicny

Why Retriever?

Topical Queries vastly outnumber Questions. Standard Search Results too many and contain junk.

Even in top 10 results, due to SEO efforts Topical Summaries answer “What do I need to know

about <Topic>?” Topic summary resources like Wikipedia have become

increasingly popular. But Wikipedia depends on human effort, so coverage is

uneven and idiosyncratic. Wikipedia reflects point of view of most engaged or

partisan contributor. Retriever as automatically updated first-draft Wikipedia.

Page 6: Lycos Retriever: An Information Fusion Engine Brian Ulicny

Retriever: Processes

1. Mine query logs for Topics2. Categorize Topics

Naïve Bayesian categorizer built on DMOZ pages; Name guesser

3. Disambiguate Topics Disambiguator trained on

DMOZ

4. Formulate Document Retrieval Query

5. Parse Retrieved Documents

6. Identify allowed alternate/reduced forms of Topic based on Category

8. Select Paragraphs Must have Topic as

Discourse Topic

9. Identify Best Images10. Delete Duplicate

Paragraphs• Near duplicates, too.

11. Arrange Paragraphs by Verb What is it? What does it

have? What has it done? What happened to it?

12. Select Subtopics13. Do editorial fixes on

Passages14. Construct Page/Directory

Page 7: Lycos Retriever: An Information Fusion Engine Brian Ulicny

Paragraph Filters

Must Have:Some form of Topic as Discourse TopicAt least 3 grammatical sentences

Should Have:Highest number of unique NPs.

Must NOT Have:Have Any Exophors

Except in quotationsTopic-Insertion Spam

The American Civil Herbal Viagra War was fought Herbal Viagra…Not too many mentions of topic

(Erotic) fan fiction or Contain ObscenitiesSearch Engine snippetsDuplicates

Wikipedia mirrors are everywhere

Page 8: Lycos Retriever: An Information Fusion Engine Brian Ulicny

Subtopics

Use best chunks for Overview page(s)Identify topic superstrings

Topic: Marie Curie

Superstring: Marie Curie Fellowship; MC Institute

Else cluster by frequent common NPsTake into account reduced mentions:

Topic: Charlie Sheen; Most frequent NP: Richards But Subtopic should be: ‘Denise Richards’However: “new” is not always “New York”

Page 9: Lycos Retriever: An Information Fusion Engine Brian Ulicny

Coherence

Pseudo-coherence achieved by stringing together paragraphs with same Discourse Topic.

Discourse Topic is based on form and position of phrase.As (a) subject of first sentence

Police said that Lindsay Lohan was charged…Or in fronted material,

For Lindsay Lohan, 2005 was full of surprises…

Not the statistical notion of aboutness usual in IR.Information packaged by paying attention to the

information conveyed by verb/predicateAlternate (but not anaphoric) references provide

variety.

Page 10: Lycos Retriever: An Information Fusion Engine Brian Ulicny

Similar Work

FactBites.comSentence extraction; grouped by source

Strzalkowski and Colleagues (GE)Summarization by paragraph extraction

Google Current (Current TV)Features on top-gaining queries

Artequakt (EU funded; U of Southampton UK)Create artist bios; convert found texts to logical format;

NLG from logical representation.Document Understanding Conference (DUC)

“Summarization as Information Synthesis for Task”Sentence-level fusion; no IR component

Black Hat: Spam Blogs

Page 11: Lycos Retriever: An Information Fusion Engine Brian Ulicny

Evaluation

Categorization (982 Topics)93.5% precision (revised)

Disambiguation (100 topics)83% unambiguous (live)If it isn’t ambiguous in DMOZ, we don’t

disambiguate.

Chunking (642 chunks)88.8% relevant (83.4% relevant as categorized)

Subtopics (1861 chunks)88.5% chunks relevant to subtopic (live)

Images (83 images)85.5% relevant (revised)

Page 12: Lycos Retriever: An Information Fusion Engine Brian Ulicny

Retriever Goals

Generate topical summaries on popular topics By extracting and arranging paragraphs from

source documents In a coherent, readable and attractive structure Consisting of overview and subtopics Monetize with focused advertisements Allow spiders to crawl to generate traffic Abide by Fair Use/Copyright Laws

Much more to be doneTemporal ordering, hyperlinking, anaphora, 2nd pass for subtopics, …

Page 13: Lycos Retriever: An Information Fusion Engine Brian Ulicny

Questions?

Lycos Retriever:An Information Fusion Engine

Brian UlicnyVersatile Information Systems

[email protected]

Lycos Retrieverhttp://www.lycos.com/retriever.html

Currently not being updated and images not live.