lecture 6: eitn01 web intelligence and information retrieval

12
logolund Lecture 6: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT – Electrical and Information Technology, Lund University February 26, 2013 A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 1 / 46 logolund Outline 1 Reiteration 2 Recommender systems 3 Indexing, searching 4 Example IR systems A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 2 / 46 logolund Previous lecture Web Search Metasearch engines Web crawling Browsing vs search A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 3 / 46 logolund Web Search Challenges Distributed, dynamic data Large volume Unstructured, heterogeneous data Size, coverage General vs focused Special functions, User interface Ranking Limited overlap between search engines A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 4 / 46

Upload: others

Post on 12-Sep-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 6: EITN01 Web Intelligence and Information Retrieval

logolund

Lecture 6: EITN01 Web Intelligence and InformationRetrieval

Anders Ardö

EIT – Electrical and Information Technology, Lund University

February 26, 2013

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 1 / 46

logolund

Outline

1 Reiteration

2 Recommender systems

3 Indexing, searching

4 Example IR systems

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 2 / 46

logolund

Previous lecture

Web SearchMetasearch enginesWeb crawlingBrowsing vs search

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 3 / 46

logolund

Web Search

ChallengesDistributed, dynamic dataLarge volumeUnstructured, heterogeneous data

Size, coverageGeneral vs focusedSpecial functions, User interfaceRankingLimited overlap between search engines

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 4 / 46

Page 2: Lecture 6: EITN01 Web Intelligence and Information Retrieval

logolund

Search Engine - Basic structure

���������������������������

���������������������������

Database

Interface

Database

Web pagesHTTP Web browserQuery

Answer

CGI−script

Web robot The WebHTTP

Size efficiency response time

software crawling the web (much like a human clicking on links)collect all found web-pages into a database (IR system)offer a web-interface to that database

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 5 / 46

logolund

Google

started late 1990:sEstimated 450,000 low-cost commodity servers (2006)1 trillion links to web pages (July 2008)“over 8 billion web pages”estimate 40 billion pages?goal is to index all the world’s dataGoogle Flu Trends

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 6 / 46

logolund

Metasearch engines

Simultaneously search several individual search enginesQuery translationResult merging

Simple mergeDuplicate detectiontf-idf/similarity rankingPosition based

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 7 / 46

logolund

Web Robot - Basic architecture

Spider, Crawler, Robot, agent, ...

Frontier

List of

unvisited

pages

Database

Get URL

Fetch

Web page

Analyze

Save

pagesWeb

Repository

of visited

pages

URLs

Links

Seed

URLs

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 8 / 46

Page 3: Lecture 6: EITN01 Web Intelligence and Information Retrieval

logolund

Focused Crawling

Frontier

List of

unvisited

pages

Seed

URLs

Database

pagesWeb

Repository

of visited

pages

URLsGet URL

Fetch

Web page

URL

focus

filter

Analyze

Linksfocus

inNot

Within the

focusSave

filterFocus

Focus:

DomainProjectCountryRegionTopicSubject

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 9 / 46

logolund

Basic Algorithm

Add good start pages (seeds) to frontierLOOP:

Choose a page among linksPage OK?

Save pageAdd all links to frontier

Go to LOOP

Save (database(s)):All relevant pages (search engine database)All analyzed pages (seen pages)All new links (frontier)

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 10 / 46

logolund

Browsing

No idea how formulate a queryWilling to invest some timeStructure: flat vs hierarchy

Manual vs automatic classificationLack of standard classification/terminology

Everything vs Quality assessedPrecision - NOT recall

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 11 / 46

logolund

Browsing vs search

SearchLOTS of dataUnstructuredUnrelated items clutter results

BrowsingSmall amounts of dataHierarchically structuredQuality assessed

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 12 / 46

Page 4: Lecture 6: EITN01 Web Intelligence and Information Retrieval

logolund

Lecture 6 agenda

Chapter 9 in “Modern Information Retrieval”;G. Adomavicius, A. Tuzhilin: “Toward the Next Generation ofRecommender systems: A survey of the State-of-the-Art and PossibleExtensions”; Sections 1 - 2

1 Reiteration

2 Recommender systems

3 Indexing, searching

4 Example IR systems

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 13 / 46

logolund

Outline

1 Reiteration

2 Recommender systems

3 Indexing, searching

4 Example IR systems

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 14 / 46

logolund

Recommender systems

text

image

audio

video

Profiles

Preferences

Usage history

content

Recommender

system

Context

Rep

rese

nta

tio

n

Rep

rese

nta

tio

n

User

Recommendations

Media

Representation

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 15 / 46

logolund

Recommender systems

Make machines understandmedia

annotation - metadatacontext

?user

usage historyprofiles

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 16 / 46

Page 5: Lecture 6: EITN01 Web Intelligence and Information Retrieval

logolund

Recommender systems

Content based filteringbased on items similar to what the user has liked in the past

Collaborative filteringbased on opinions of other users (user/item matrix)(user-user similarity, item-item similarity)find like-minded users (neighborhood)predictions for unseen items

Hybrid systems

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 17 / 46

logolund

Recommender systems

x x x ... x ... xx x x ... x ... xx x x ... x ... xx x x ... x ... x

x x x ... x ... x

.

x x x ... x ... x

.

.

p p p ... p ... pp p p ... p ... pp p p ... p ... pp p p ... p ... p

p p p ... p ... p

.

p p p ... p ... p

.

.

USERS

ITEMS

Recommendation

algorithm

collaborative

content−based

...

RecommendationsPredicted ratings

r ... ...

.

.

.

... ...

... ...

... ...

... r ... r ... ... r

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 18 / 46

logolund

Content based filtering

Try to predict a rating based on my own ratingsRepresent items as a set of features

itemj = (w1j , ...wkj)

Users rank items→ user profile in feature spaceuserc = (wc1, ...wck )

Vector space! (feature/item matrix, tf idf, similarity (cosine,Pearson), ...)User profile used as query

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 19 / 46

logolund

Collaborative filtering

Try to predict rating based on other users ratingsMemory based

Make rating based on entire collectionEx user-user: ratingc,s = k ∗

∑c′∈C

sim(c, c′) ∗ ratingc′,s

User c, Item sC Set of users most similar to ck Normalizing factor (usually 1∑

c′∈C

|sim(c, c′)|)

Ex item-item: ratingc,s = k ∗∑s′∈S

sim(s, s′) ∗ ratingc,s

User c, Item sS Set of items most similar to sk Normalizing factor (usually 1∑

s′∈S

|sim(s, s′)|)

Model basedTry to learn a model to be used for predicting ratingsEx: Probabilistic model, Machine learning, ...

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 20 / 46

Page 6: Lecture 6: EITN01 Web Intelligence and Information Retrieval

logolund

Collaborative filtering – Item-Item – I

The Movie – Users matrix: Users ratings (1-5) of movies

Movie Users1 2 3 4 5 6 7 8 9 10 11 12

m1 1 3 5 5 4m2 5 4 4 2 1 3m3 2 4 1 2 3 4 3 5m4 2 4 5 4 2m5 4 3 4 2 2 5m6 1 3 3 2 4

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 21 / 46

logolund

Collaborative filtering – Item-Item – II

Estimate User 5 ranking of movie m1?

Movie Users1 2 3 4 5 6 7 8 9 10 11 12

m1 1 3 ?? 5 5 4m2 5 4 4 2 1 3m3 2 4 1 2 3 4 3 5m4 2 4 5 4 2m5 4 3 4 2 2 5m6 1 3 3 2 4

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 22 / 46

logolund

Collaborative filtering – Item-Item – III

Estimate User 5 ranking of movie m1?Neighbor selection – movies most similar to m1→ m3, m6, m5

Movie Users sim(m1,mx)1 2 3 4 5 6 7 8 9 10 11 12

m1 1 3 ?? 5 5 4 1.0m2 5 4 4 2 1 3 0.26m3 2 4 1 2 3 4 3 5 0.52m4 2 4 5 4 2 0.28m5 4 3 4 2 2 5 0.40m6 1 3 3 2 4 0.48

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 23 / 46

logolund

Collaborative filtering – Item-Item – IV

Estimate User 5 ranking of movie m1?Neighbor selection – movies most similar to m1→ m3, m6, m5Predict ranking rm1,5 as sim(m1,m3)∗rm3,5+sim(m1,m6)∗rm6,5+sim(m1,m5)∗rm5,5

sim(m1,m3)+sim(m1,m6)+sim(m1,m5)

rm1,5 = 0.52∗2+0.48∗3+0.40∗40.52+0.48+0.40 = 2.9

Movie Users sim(m1,mx)1 2 3 4 5 6 7 8 9 10 11 12

m1 1 3 2.9 5 5 4 1.0m2 5 4 4 2 1 3 0.26m3 2 4 1 2 3 4 3 5 0.52m4 2 4 5 4 2 0.28m5 4 3 4 2 2 5 0.40m6 1 3 3 2 4 0.48

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 24 / 46

Page 7: Lecture 6: EITN01 Web Intelligence and Information Retrieval

logolund

Hybrid systems

Content based filtering + Collaborative filteringCombining separate recommendersAdding content based characteristics to collaborative filteringAdding collaborative characteristics to content based filtering

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 25 / 46

logolund

Examples

Amazon, Course Recommender

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 26 / 46

logolund

Outline

1 Reiteration

2 Recommender systems

3 Indexing, searching

4 Example IR systems

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 27 / 46

logolund

Introduction

Sequential searchSmall databasesVolatile data

IndexesLarge databasesSemi-static data

Inverted files

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 28 / 46

Page 8: Lecture 6: EITN01 Web Intelligence and Information Retrieval

logolund

How to represent indexed documents?

43

Documents

break into words

stoplist

stemming*

term weighting*

Index /

database

text

non-stoplist

words

words

stemmed

words

terms with

weights

* Indicates

optional

operation

assign document IDs

document

numbers

and *field

numbers

Lexical analysis

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 29 / 46

logolund

Inverted files

Principal data structureEffectiveAllows fast searchingSubstantial storage overhead

Speed more important than storage

For each termList of document ID’s(Term frequency in each document)(Position in document)

Used forBoolean searchesVector space rankingProximity, phrases

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 30 / 46

logolund

Inverted files

docs t1 t2 t3

D1 1 0 1

D2 1 0 0

D3 0 1 1

D4 1 0 0

D5 1 1 1

D6 1 1 0

D7 0 1 0

D8 0 1 0

D9 0 0 1

D10 0 1 1

Terms D1 D2 D3 D4 D5 D6 D7 …

t1 1 1 0 1 1 1 0

t2 0 0 1 0 1 1 1

t3 1 0 1 0 1 0 0

(From J. W. Schneider: “Informetrics & Scientometrics”)

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 31 / 46

logolund

Inverted files

(From R. Baeza-Yates, B. Ribeiro-Neto: “Modern Information Retrieval”, 2nd Ed, 2010)

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 32 / 46

Page 9: Lecture 6: EITN01 Web Intelligence and Information Retrieval

logolund

Inverted files

(From R. Baeza-Yates, B. Ribeiro-Neto: “Modern Information Retrieval”, 2nd Ed, 2010)

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 33 / 46

logolund

Creation of inverted files

For each term in the dictionarystore ID’s of documents containing that word

Lexical analysis⇒ termsSave terms with document IDSort alphabetically⇒ dictionary(Calculate tf and idf)Create posting list (list of document ID’s per term)

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 34 / 46

logolund

Example

Document A:Now is the time for men tocome to the aid of their coun-try.

Document B:It was a dark night in thecountry. The time was pastmidnight.

Dictionary:Term DocIDtime Amen Aaid Acountry Adark Bnight Bcountry Btime Bmidnight B

Dictionary:

Term DocIDaid Acountry Acountry Bdark Bmen Amidnight Bnight Btime Atime B

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 35 / 46

logolund

Example cont’d

Document A:Now is the time for men tocome to the aid of their coun-try.

Document B:It was a dark night in thecountry. The time was pastmidnight.

Inverted file:Dictionary PostingsTerm Docs ID ID ...aid 1 Acountry 2 A Bdark 1 Bmen 1 Amidnight 1 Bnight 1 Btime 2 A B

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 36 / 46

Page 10: Lecture 6: EITN01 Web Intelligence and Information Retrieval

logolund

Example cont’d

Inverted file:Dictionary PostingsTerm Docs ID ID ...aid 1 Acountry 2 A Bdark 1 Bmen 1 Amidnight 1 Bnight 1 Btime 2 A B

Query: time AND dark

time⇒ posting list P1 = {A,B}dark⇒ posting list P2 = {B}P1 ∩ P2 = {A,B} ∩ {B} = {B}Result Document B(Do ranking)

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 37 / 46

logolund

Example cont’d

Inverted file:Dictionary PostingsTerm Docs ID ID ...aid 1 Acountry 2 A Bdark 1 Bmen 1 Amidnight 1 Bnight 1 Btime 2 A B

Query: time OR dark

time⇒ posting list P1 = {A,B}dark⇒ posting list P2 = {B}P1 ∪ P2 = {A,B} ∪ {B} = {A,B}Result Documents A,B(Do ranking)

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 38 / 46

logolund

Phrase search

D0 = "it is what it is", D1 = "what is it", D2 = "it is a banana"

"a": [2]"banana": [2]"is": [0], [1], [2]"it": [0], [1], [2]"what": [0], [1]

Q: “what is it”? ([0], [1])⋂

([0], [1], [2])⋂

([0], [1], [2]) = ([0], [1])

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 39 / 46

logolund

Phrase search

D0 = "it is what it is", D1 = "what is it", D2 = "it is a banana"

"a": [2]"banana": [2]"is": [0], [1], [2]"it": [0], [1], [2]"what": [0], [1]

Q: “what is it”? ([0], [1])⋂

([0], [1], [2])⋂

([0], [1], [2]) = ([0], [1])As a phrase?

"a": [2, [2]]"banana": [2, [3]]"is": [0, [1,4]], [1, [1]], [2, [1]]"it": [0, [0,3]], [1, [2]], [2, [0]]"what": [0, [2]], [1, [0]]

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 40 / 46

Page 11: Lecture 6: EITN01 Web Intelligence and Information Retrieval

logolund

Outline

1 Reiteration

2 Recommender systems

3 Indexing, searching

4 Example IR systems

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 41 / 46

logolund

Zebra

IndexData: http://www.indexdata.dk/zebra/high-performance, general-purpose structured text indexing andretrieval enginefree, GPL licenseindex records in XML, SGML, MARC, e-mail archives, ...combination of Boolean searching and relevance ranking (tf-idf)supports SRU/CQL, Z39.50, ZOOM

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 42 / 46

logolund

Zebra - XML indexing

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 43 / 46

logolund

Zebra features

supports large databasestens of millions of recordstens of gigabytes of data

regular expression queriesfuzzy queries (spelling correction)index scansfaceted browsingsorting

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 44 / 46

Page 12: Lecture 6: EITN01 Web Intelligence and Information Retrieval

logolund

Lucene/Solr

Apache: http://lucene.apache.org/Lucene: high-performance, full-featured text search engine librarySolr: enterprise search server based on Lucenefree, open sourceindex records via XML over HTTPquery via HTTP GET, XML results

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 45 / 46

logolund

Lucene/Solr features

scalability - efficient replicationhighlighted context snippetsfaceted searchingspelling suggestions for user queries’More Like This’ suggestions for given documentq=video&fl=name,id,score

q=video&sort=inStock asc, score desc

A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 46 / 46