systems for non structured information...

33
SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT Prof. Fabio A. Schreiber Dipartimento di Elettronica e Informazione Politecnico di Milano

Upload: phungthuan

Post on 18-Mar-2018

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

SYSTEMS FOR NON STRUCTURED INFORMATION

MANAGEMENT

Prof. Fabio A. SchreiberDipartimento di Elettronica e Informazione

Politecnico di Milano

Page 2: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 1

INFORMATION SEARCH AND RETRIEVAL

Page 3: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 2

PRESENTATION SCHEMA

• GOALS AND ARCHITECTURES OF INFORMATIOIN RETRIEVAL SYSTEMS

• PHYSICAL AND LOGICAL STORAGE STRUCTURES

• AUTOMATIC TEXT ANALYSIS AND INDEX BUILDING

• INTERNET SEARCHING

Page 4: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 3

INFORMATION MANAGEMENT TECHNOLOGIES

EMBEDDED SISTEMS

INFORMATION SYSTEMS ANALYSIS

DATA INTEGRATION

DISTRIBUTED ETHEROGENEOUS

DATA MANAGEMENT

DATA WAREHOUSE

DATAMINING

WEB INFORMATION SYSTEMS

INFORMATION RETRIEVAL

SISTEMS

DECISION SUPPORT SYSTEMS

NON STRUCTUREDSEMISTRUCTUREDAND MULTIMEDIAL

INFORMATION

MOBILE AND CONTEXT-

AWARE COMPONENTS

•REAL-TIME•MAIN MEMORY•TEMPORALDATABASES

Page 5: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 4

MANAGEMENT INFORMATION SYSTEMS

• INFORMATION– COMPLEX– HIGHLY STRUCTURED

• QUERIES– COMPLEX– MOSTLY RECURRENT

• UPDATES– FREQUENCY IS CASUAL, BUT HIGH– OFTEN ON-LINE

• USED TECHNOLOGY– DATABASE MANAGEMENT SYSTEMS

Page 6: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 5

INFORMATION SEARCH

• INFORMATION– SIMPLE (authors, keywords, colours, patterns, ...)– POORLY STRUCTURED

• QUERIES– COMPLEX

• CLAUSES ARE LOGICALLY CONNECTED• PARTIALLY SPECIFIED• ITERATIVE REFINEMENT

– NON FORESEABLE

Page 7: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 6

INFORMATION SEARCH

• UPDATES– MOSTLY PERIODIC, WITH LOW FREQUENCY– OFTEN OFF-LINE

• USED TECHNOLOGY– INDEXING AND SEARCHING BY KEYWORDS– DIRECT SEARCH ON TEXT

• FULL TEXT• ABSTRACT• SIGNATURE

Page 8: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 7

NON STRUCTURED INFORMATION

DOCUMENT

WHICHEVER INFORMATION COLLECTION SEARCHABLE BY ITS CONTENT– TEXTS– STATISTICAL DATA– IMAGES– SOUNDS

Page 9: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 8

FUNCTIONAL ARCHITECTURE OF AN INFORMATION RETRIEVAL SYSTEM (IRS)

QUERIES DOCUMENTSFORMALLANGUAGE

INDEXED DOCUMENTS

SIMILARITYASSESSMENT

SIMILAR ITEMSEXTRACTION

SEARCH FORMULATION PROCESS

DOCUMENTS STORAGEPROCESS

Page 10: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 9

DOCUMENT SPACE W.R.T. A QUERY RESULT

ALL DOCUMENTS

NON RETRIEVED,BUT NON RELEVANT)

(NRITNRIL)

RETIRIEVED ANDRELEVANT

(RITRIL)

RETRIEVED, BUTNON RELEVANT

(RITNRIL)

NON RETRIEVED,BUT RELEVANT

(NRITRIL)

RELEVANT

RETRIEVED

Page 11: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 10

GOAL OF AN IRS IS TO EFFECTIVELY RETRIEVE ALL THE DOCUMENTS WHICH ARE RELEVANT TO A GIVEN QUERY AND ONLY THEM

• PERFORMANCE INDEXES

RECALL

EFFECTIVENESS IN FINDING THE USEFUL MATERIAL(RELEVANT AND RETRIEVED DOCUMENTS W.R.T. ALL THERELEVANT DOCUMENTS )

PRECISION

EFFECTIVENESS IN REMOVING THE USELESS MATERIAL(RELEVANT AND RETRIEVED DOCUMENTS W.R.T. ALL THERETRIEVED DOCUMENTS )

INFORMATION RETRIEVAL SYSTEMS

+RECALL

RITRILRITRIL NRITRIL

=

PRECISION RITRILRITRIL RITNRIL

=+

Page 12: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 11

INFORMATION RETRIEVAL SYSTEMS

EXPERIMENTAL FINDING:THE USER IS (PSYCHOLOGICALLY) HAPPY WITH LOWRECALL (~20%) VALUES, BUT HIGH PRECISION (~80%) IS REQUIRED

(NRITNRIL)

RETIRIEVED AND RELEVANT

(RITRIL)

(RITNRIL)

NON RETRIEVED,BUT RELEVANT

(NRITRIL)

Page 13: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 12

STORAGE STRUCTURESTHEY DEPEND ON THE PHYSICAL NATURE OF THE DOCUMENT (text, image, ...) AND ON THE INTENDED USAGE

– TEXT• INVERTED FILES

– FOR EACH TERM OR ATTRIBUTE VALUE A DENSE INDEX TO THE FILE IS BUILT

– THE SET OF ALL THE INDEXES CONSTITUTES THE INVERTED FILE

• BIT MAPS

– GRAPHICS• QUADTREES OF DIFFERENT TYPE

– THE IMAGE SPACE IS RECURSIVELY DECOMPOSEDINTO SQUARES UNTIL A SQUARE CONTAINS A SINGLE MEANINGFUL ELEMENT

– THE RESULTING TREE IS CODED AND STORED IN A COMPACT FORMAT

Page 14: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 13

INVERTED FILES

KEYWORDS(CONTRROLLED VOCABULARY)

INVERTED FILE

DOCUMENT REPOSITORY

FILE SYSTEM

INVERSION INDEX

PHYSICAL ARCHITECTURE

LOGICAL STRUCTURE THESAURUS

•SYNONYMS•OMONYMS•DIFFERENT SPELLINGS•SEMANTIC LINKS (CROSS REFERENCE, KWIC)•HIERARCHICAL RELATIONS (GENERAL.-SPECIAL.)

Page 15: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 14

STAIRS STORAGE STRUCTURE

TERMPOINTER TO THEINVERSION FILE

POINTER TO SYNONYMS

# OF DOCUMENTS

# OF OCCURRENCIES

OCCURR.1

OCCURR.2

OCCURR.n

UPPER/LOWERCASE

N° OF THE DOCUMENT

SECTION CODE

N° OF THE SENTENCE

N° OF THEWORD

DOCUMENTADDRESS

PRIVACYCODE

FORMATTEDFIELDS

DOCUMENTHEADER

HEADER OF §1

HEADER OF §2

TEXT§1

TEXT§2

...

DICTIONARYTERMS INVERSION FILE

INDEX TO TEXT TEXT FILE

FROM: SALTON 89

Page 16: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 15

REGION QUADTREE

B

F G

H I

J

L M

N O

Q

37 38

39 40

57 58

59 60

A

B C D E

F G H I J

K

L M N O

P

Q

37 38 39 40 57 58 59 60

FROM: SAMET 90

Page 17: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 16

BITMAP SUPERIMPOSED CODING

• IN ITS BASIC FORM, EACH DOCUMENT IS REPRESENTED BY A ROW IN A BINARY ARRAY, THE COLUMNS OF WHICH REPRESENT THE b RELEVANT TERMS (very expensive)

• THE SUPERIMPOSED VARIANT CODES EACH DOCUMENT WITH A SHORTER (n<<b) BIT STRING

• RELEVANT TERMS ARE CODED WITH n-ARY STRINGS IN WHICH k (k<n) BIT = 1 WHICH ARE OR-ed (false drops i.e., coding synonyms, are generated)

• THE GENERATED TERM CODES ARE LINKED TOGETHERTO PRODUCE THE SIGNATURE

Page 18: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 17

BITMAP SUPERIMPOSED CODING

Data 0000 0010 0000 1000base 0100 0010 0000 0000management 0000 0100 0001 0000system 0000 0000 0101 0000SIGNATURE 0100 0110 0101 1000

IN LARGE DOCUMENT REPOSITORIES, DENSE INDEXES CAN BE BUILT ON THE MAIN TABLE

Page 19: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 18

BITMAPS AND INVERTED FILES

• BITMAPS ARE PROFITABLY USED TO REPRESENT SHORT AND MOSTLY HOMOGENEOUS IN THEIR VOCABULARY TEXTS

• MEMORY OVERHEAD VERSUS THE NUMBER OF DOCUMENTS CONTAINING THE SAME KEY– BIT MAP: CONSTANT– INVERTED LISTS: LINEAR GROWTH

• WITH BITMAP ORGANIZATIONS, QUERY PROCESSING BECOMES A SIMPLE BINARY STRING MATCHINGBETWEEN THE QUERY BITMAP AND THOSE OF THE DOCUMENTS

Page 20: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 19

AUTOMATIC TEXT ANALYSIS

ITS GOAL IS TO EXTRACT THE TERMS TO BE INCLUDED IN THE INDEXES AND THEIR MUTUAL RELATIONSHIPS– SINGLE TERMS (KWOC)– TERMS IN CONTEXT (KWIC)

– EXHAUSTIVE INDEXING (> RECALL)– SPECIFIC INDEXING (> PRECISION)

– DEEP INDEXING (> PERFORMANCE, > COST)– SHALLOW INDEXING (< PERFORMANCE, < COST)

Page 21: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 20

AUTOMATIC TEXT ANALYSIS

ZIPF LAW (least effort principle)

ORDERING THE SET OF WORDS IN A TEXT IN DECREASING FREQUENCY ORDER (RANK), IT CAN BE OBSERVED THAT

RANK(i)*FREQ(i)=COSTANTFOR THE ENGLISH LANGUAGE:

COSTANT ≈ 0.150% OF DISTINCT WORDS ARE FOUND ONLY ONCE 80% OF DISTINCT WORDS DO NOT APPEAR MORE THAN 4 TIMES

Page 22: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 21

OPERATIONS ON TEXT

• COMPRESSION– VARIABLE LENGTH CODES

• MOST FREQUENT WORDS SHORTER CODE• MOST FREQUENT LETTERS SHORTER CODE

HUFFMAN CODE: 3 BIT FOR E, 10 BIT FOR Z ,AVERAGE LENGTH: 4.12 48% COMPRESSION

– DIGRAMS, TRIGRAMS, …, CODING

• CRYPTOGRAPHYREVERSIBLE TEXT TRANSFORMATION

• INFORMATION PRIVACY• ACCESS RIGHTS AUTENTICATION

Page 23: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 22

AUTOMATIC INDEXING• THE CHOICE OF INSERTING OF A TERM INTO AN INDEX IS

TO BE MADE ON THE BASE OF TWO PARAMETERS– ITS RELEVANCE FOR IDENTIFYING A DOCUMENT

RECALL– ITS WEIGHT FOR SINGLING OUT A DOCUMENT FROM

A COLLECTION OF SIMILAR DOCUMENTS PRECISION

• TERM OCCURRENCY PROPERTIES IN A WHOLE COLLECTION OF N DOCUMENTS MUST BE EXAMINED

– THE MOST COMMON FUNCTIONAL TERMS ARE REMOVED (ARTICLES, PREPOSITIONS, ECC.) STOP LIST

– THE FREQUENCY tfij OF REMAINING TERMS Tj IN EACH DOCUMENT Di IS COMPUTED

– A THRESHLD FREQUENCY T IS CHOSEN AND TO EACH DOCUMENT Di ALL THE TERMS Tj ARE ASSIGNED FOR WHICH tfij > T

Page 24: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 23

AUTOMATIC INDEXING

– TERMS WHICH ALLOW A GOOD INDEXING BOTH FOR RECALL AND PRECISION APPEAR

– OFTEN IN INDIVIDUAL DOCUMENTS– SELDOM IN THE REMAINING COLLECTION

• A GOOD PERFORMANCE INDEX IS THE WEIGHTwij=tfij*log(N/dfj )

WHERE THE DOCUMENT FREQUENCY dfj REPRESENTS THE NUMBER OF DOCUMENTS IN THE COLLECTION IN WHICH THE TERM Tj APPEARS

Page 25: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 24

AUTOMATIC INDEXING• ON

– TITLE ONLY– TITLE AND ABSTRACT (best cost/performance)– FULL TEXT

• PROCESS STEPS– REMOVE STOP WORDS– CREATE WORD STEMS BY REMOVING PRE- AND

POST- FIXES– COALESCE EQUIVALENT STEMS THESAURI– WEIGHT REMAINING TERMS– APPLY POSSIBLE THRESHOLDS– INSERT REMAINING TERMS INTO THE INDEX

Page 26: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 25

THESAURI

• THESAURI ALLOW A LARGER RECALL BY SUBSTITUTING TOO SPECIFIC TERMS WITH MORE COMMON SYNONYMS

• STEM USAGE REQUIRES THAT CORRECT LEXICAL RULES ARE FOLLOWED FOR EACH LANGUAGE (e.g. SUBSTITUTION OF THE FINAL I WITH Y)

• STEMS MUST BE AT LEAST THREE CHARACTERS LONG IN ORDER TO BE SIGNIFICANT (the progressive time rulewould truncate King TO K)

Page 27: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 26

DOCUMENT SEARCH• INTERACTIVITY

– AFTER THE FIRST QUERY, THE SYSTEM SHOWS THE NUMBER OF RELEVANT DOCUMENTS

– IN EACH FURTHER ITERATION, THE USER TRIES TO ENHANCE THE PRECISION UNTIL THE NUMBER OF RETRIEVED DOCUMENTS IS MANAGEABLE TO BE DIRECTLY INSPECTED

• RANKING– DOCUMENTS ARE PRESENTED IN RELEVANCE ORDER

BASED ON WEIGHTS ASSIGNED TO THE DIFFERENT TERMS

• BROWSING– SIMILAR DOCUMENTS ARE GROUPED IN A SINGLE

CLASS AND INSPECTED “BY PROXIMITY”

Page 28: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 27

DOCUMENT SEARCH• RELEVANCE FEEDBACK

– THE SYSTEM INVITES THE USER TO EVALUATE THE RELEVANCE OF EACH RETRIEVED DOCUMENT

– FROM THE ANSWERS, THE SYSTEM TUNES THE TERM WEIGHTS IN THE DOCUMENTS

• USER PROFILES– INFORMATION ABOUT

• MOST CONSULTED DOCUMENTS• RELEVANCE ANALYSIS RESULTS • INFORMATION ABOUT THE WORK CONTEXT

– DYNAMIC MANAGEMENT IS NEEDED – CAN BE USED IN WORKING ENVIRONMENTS WITH

WELL DEFINED, CUSTOMARY USERS

Page 29: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 28

LANGUAGES FOR DOCUMENT SEARCHING

• QUERY LANGUAGES ARE MOSTLY BASED ON FUNDAMENTAL SET OPERATORS - AND, OR, NOT - AND THEIR COMBINATIONS

• SUPPLEMENTARY OPERATORS– TERMS ORDERING– TERMS CONTIGUITY– WILDCARDS (truncation or separation)– SEARCH FIELD (title, abstract, full text)

• OTHER COMMANDS– DOCUMENT DATA BANK CHOICE– THESAURUS INSPECTION– SEARCH RESULT MEMORIZATION– ........

Page 30: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 29

NETWORK SEARCH

THE MAIN DIFFERENCES BETWEN WEB SEARCHING AND TRADITIONAL INFORMATION RETRIEVAL ARE:– HIGHER HETEROGENEITY OF WEB INFORMATION– EXTREMELY LARGE DIMENSIONS OF THE SEARCH DOMAIN

(year 2005)• 8x109 STATIC WEB PAGES AMOUNTING TO 102 TBYTE• 1 MILLION/DAY NEW PAGES (very high volatility)• 140x103 SEARCHES / MINUTE (Google 2004)

– EVEN IF THE RECALL IS LARGE, ONLY THE VERY FIRSTDOCUMENTS ARE EXAMINED

– OWING TO THEIR COMMERCIAL VALUE TO ADVERTISERS, SORTING AND RANKING ALGORITHMS ARE AMONG THE BEST KEPT INDUSTRIAL SECRETS!

Page 31: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 30

NETWORK SEARCH• SEARCH ENGINES USE

– CENTRALIZED SEARCH INDEXES WITH TREE CATEGORIZATION OF CONTENTS

– BOTH CONTENT AND CONTEXT– EFFECTIVE DOCUMENT CLASSIFICATION – PORTALS (SUBJECT GATEWAYS)

• TRADIZIONAL ENGINES INDEX INDIVIDUAL PAGES

• A PORTAL, AMONG OTHER FEATURES, RECOGNIZES A DOCUMENT AS SUCH, AND IT KEEPS INFORMATION CHERENCE

Page 32: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 31

SEARCH ENGINES• DIRECTORY BASED (Magellan, ... )

KNOWLEDGE IS ORGANIZED INTO TREE STRUCTURES; WEB PAGES ARE CLASSIFIED ACCORDINGLY

• CLASSIFICATION IS A HEAVY JOB• IF THE REQUIRED INFORMATION DOES NOT FALL INTO

THE CLASSIFICATION FINDING IT IS IMPOSSIBLE

• “SPIDER” BASED (Alta Vista, Lycos, Google, ... )SPECIFIC PROGRAMS LOOK FOR EVERYTING AND ORGANIZE THE TOPICS IN WHICHEVER MODE

• THE SPIDER ESPLORES THE WEB AND FINDS THE PAGES• A DATABASE STORES THE RETRIEVED INFORMATION AND

THE RELEVANCE SORTING ALGORITHMS

• A USER INTERFACE ALLOWS QUERY FORMULATION AND RESULT PRESENTATION

Page 33: SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENThome.deib.polimi.it/schreibe/TeSI/Materials/Schreiber/PdfLections/... · systems for non structured information management ... (cross

Fabio A. Schreiber© Inf. retrieval 32

SEARCH ENGINESGOOGLE• BORN AS A RESEARCH PRODUCT AT STANFORD• IT USES

– AN INDEX WITH MORE THAN 109 PAGES– SPIDER ADDING MORE OR LESS 106 PAGE/DAY

• IT MANAGES 200 MILION/DAY SEARCHES• SEARCH RESULTS ARE EVALUATED BY MEANS OF

PageRank™ TECHNOLOGY– RELEVANCE IS COMPUTED BY MEANS OF MATHEMATICAL

FORMULAS WITH 500*106 VARIABLES AND 2*109 TERMS– IT ALLOWS BOTH FOR PAGE CONTENT AND FOR

REFERENCES MADE FROM OTHER PAGES, CLASSIFIED AS TO RELEVANCE

– TRIES TO AVOID USERS INTERFERENCE IN RANKING