systems for non structured information...
TRANSCRIPT
SYSTEMS FOR NON STRUCTURED INFORMATION
MANAGEMENT
Prof. Fabio A. SchreiberDipartimento di Elettronica e Informazione
Politecnico di Milano
Fabio A. Schreiber© Inf. retrieval 1
INFORMATION SEARCH AND RETRIEVAL
Fabio A. Schreiber© Inf. retrieval 2
PRESENTATION SCHEMA
• GOALS AND ARCHITECTURES OF INFORMATIOIN RETRIEVAL SYSTEMS
• PHYSICAL AND LOGICAL STORAGE STRUCTURES
• AUTOMATIC TEXT ANALYSIS AND INDEX BUILDING
• INTERNET SEARCHING
Fabio A. Schreiber© Inf. retrieval 3
INFORMATION MANAGEMENT TECHNOLOGIES
EMBEDDED SISTEMS
INFORMATION SYSTEMS ANALYSIS
DATA INTEGRATION
DISTRIBUTED ETHEROGENEOUS
DATA MANAGEMENT
DATA WAREHOUSE
DATAMINING
WEB INFORMATION SYSTEMS
INFORMATION RETRIEVAL
SISTEMS
DECISION SUPPORT SYSTEMS
NON STRUCTUREDSEMISTRUCTUREDAND MULTIMEDIAL
INFORMATION
MOBILE AND CONTEXT-
AWARE COMPONENTS
•REAL-TIME•MAIN MEMORY•TEMPORALDATABASES
Fabio A. Schreiber© Inf. retrieval 4
MANAGEMENT INFORMATION SYSTEMS
• INFORMATION– COMPLEX– HIGHLY STRUCTURED
• QUERIES– COMPLEX– MOSTLY RECURRENT
• UPDATES– FREQUENCY IS CASUAL, BUT HIGH– OFTEN ON-LINE
• USED TECHNOLOGY– DATABASE MANAGEMENT SYSTEMS
Fabio A. Schreiber© Inf. retrieval 5
INFORMATION SEARCH
• INFORMATION– SIMPLE (authors, keywords, colours, patterns, ...)– POORLY STRUCTURED
• QUERIES– COMPLEX
• CLAUSES ARE LOGICALLY CONNECTED• PARTIALLY SPECIFIED• ITERATIVE REFINEMENT
– NON FORESEABLE
Fabio A. Schreiber© Inf. retrieval 6
INFORMATION SEARCH
• UPDATES– MOSTLY PERIODIC, WITH LOW FREQUENCY– OFTEN OFF-LINE
• USED TECHNOLOGY– INDEXING AND SEARCHING BY KEYWORDS– DIRECT SEARCH ON TEXT
• FULL TEXT• ABSTRACT• SIGNATURE
Fabio A. Schreiber© Inf. retrieval 7
NON STRUCTURED INFORMATION
DOCUMENT
WHICHEVER INFORMATION COLLECTION SEARCHABLE BY ITS CONTENT– TEXTS– STATISTICAL DATA– IMAGES– SOUNDS
Fabio A. Schreiber© Inf. retrieval 8
FUNCTIONAL ARCHITECTURE OF AN INFORMATION RETRIEVAL SYSTEM (IRS)
QUERIES DOCUMENTSFORMALLANGUAGE
INDEXED DOCUMENTS
SIMILARITYASSESSMENT
SIMILAR ITEMSEXTRACTION
SEARCH FORMULATION PROCESS
DOCUMENTS STORAGEPROCESS
Fabio A. Schreiber© Inf. retrieval 9
DOCUMENT SPACE W.R.T. A QUERY RESULT
ALL DOCUMENTS
NON RETRIEVED,BUT NON RELEVANT)
(NRITNRIL)
RETIRIEVED ANDRELEVANT
(RITRIL)
RETRIEVED, BUTNON RELEVANT
(RITNRIL)
NON RETRIEVED,BUT RELEVANT
(NRITRIL)
RELEVANT
RETRIEVED
Fabio A. Schreiber© Inf. retrieval 10
GOAL OF AN IRS IS TO EFFECTIVELY RETRIEVE ALL THE DOCUMENTS WHICH ARE RELEVANT TO A GIVEN QUERY AND ONLY THEM
• PERFORMANCE INDEXES
RECALL
EFFECTIVENESS IN FINDING THE USEFUL MATERIAL(RELEVANT AND RETRIEVED DOCUMENTS W.R.T. ALL THERELEVANT DOCUMENTS )
PRECISION
EFFECTIVENESS IN REMOVING THE USELESS MATERIAL(RELEVANT AND RETRIEVED DOCUMENTS W.R.T. ALL THERETRIEVED DOCUMENTS )
INFORMATION RETRIEVAL SYSTEMS
+RECALL
RITRILRITRIL NRITRIL
=
PRECISION RITRILRITRIL RITNRIL
=+
Fabio A. Schreiber© Inf. retrieval 11
INFORMATION RETRIEVAL SYSTEMS
EXPERIMENTAL FINDING:THE USER IS (PSYCHOLOGICALLY) HAPPY WITH LOWRECALL (~20%) VALUES, BUT HIGH PRECISION (~80%) IS REQUIRED
(NRITNRIL)
RETIRIEVED AND RELEVANT
(RITRIL)
(RITNRIL)
NON RETRIEVED,BUT RELEVANT
(NRITRIL)
Fabio A. Schreiber© Inf. retrieval 12
STORAGE STRUCTURESTHEY DEPEND ON THE PHYSICAL NATURE OF THE DOCUMENT (text, image, ...) AND ON THE INTENDED USAGE
– TEXT• INVERTED FILES
– FOR EACH TERM OR ATTRIBUTE VALUE A DENSE INDEX TO THE FILE IS BUILT
– THE SET OF ALL THE INDEXES CONSTITUTES THE INVERTED FILE
• BIT MAPS
– GRAPHICS• QUADTREES OF DIFFERENT TYPE
– THE IMAGE SPACE IS RECURSIVELY DECOMPOSEDINTO SQUARES UNTIL A SQUARE CONTAINS A SINGLE MEANINGFUL ELEMENT
– THE RESULTING TREE IS CODED AND STORED IN A COMPACT FORMAT
Fabio A. Schreiber© Inf. retrieval 13
INVERTED FILES
KEYWORDS(CONTRROLLED VOCABULARY)
INVERTED FILE
DOCUMENT REPOSITORY
FILE SYSTEM
INVERSION INDEX
PHYSICAL ARCHITECTURE
LOGICAL STRUCTURE THESAURUS
•SYNONYMS•OMONYMS•DIFFERENT SPELLINGS•SEMANTIC LINKS (CROSS REFERENCE, KWIC)•HIERARCHICAL RELATIONS (GENERAL.-SPECIAL.)
Fabio A. Schreiber© Inf. retrieval 14
STAIRS STORAGE STRUCTURE
TERMPOINTER TO THEINVERSION FILE
POINTER TO SYNONYMS
# OF DOCUMENTS
# OF OCCURRENCIES
OCCURR.1
OCCURR.2
OCCURR.n
UPPER/LOWERCASE
N° OF THE DOCUMENT
SECTION CODE
N° OF THE SENTENCE
N° OF THEWORD
DOCUMENTADDRESS
PRIVACYCODE
FORMATTEDFIELDS
DOCUMENTHEADER
HEADER OF §1
HEADER OF §2
TEXT§1
TEXT§2
...
DICTIONARYTERMS INVERSION FILE
INDEX TO TEXT TEXT FILE
FROM: SALTON 89
Fabio A. Schreiber© Inf. retrieval 15
REGION QUADTREE
B
F G
H I
J
L M
N O
Q
37 38
39 40
57 58
59 60
A
B C D E
F G H I J
K
L M N O
P
Q
37 38 39 40 57 58 59 60
FROM: SAMET 90
Fabio A. Schreiber© Inf. retrieval 16
BITMAP SUPERIMPOSED CODING
• IN ITS BASIC FORM, EACH DOCUMENT IS REPRESENTED BY A ROW IN A BINARY ARRAY, THE COLUMNS OF WHICH REPRESENT THE b RELEVANT TERMS (very expensive)
• THE SUPERIMPOSED VARIANT CODES EACH DOCUMENT WITH A SHORTER (n<<b) BIT STRING
• RELEVANT TERMS ARE CODED WITH n-ARY STRINGS IN WHICH k (k<n) BIT = 1 WHICH ARE OR-ed (false drops i.e., coding synonyms, are generated)
• THE GENERATED TERM CODES ARE LINKED TOGETHERTO PRODUCE THE SIGNATURE
Fabio A. Schreiber© Inf. retrieval 17
BITMAP SUPERIMPOSED CODING
Data 0000 0010 0000 1000base 0100 0010 0000 0000management 0000 0100 0001 0000system 0000 0000 0101 0000SIGNATURE 0100 0110 0101 1000
IN LARGE DOCUMENT REPOSITORIES, DENSE INDEXES CAN BE BUILT ON THE MAIN TABLE
Fabio A. Schreiber© Inf. retrieval 18
BITMAPS AND INVERTED FILES
• BITMAPS ARE PROFITABLY USED TO REPRESENT SHORT AND MOSTLY HOMOGENEOUS IN THEIR VOCABULARY TEXTS
• MEMORY OVERHEAD VERSUS THE NUMBER OF DOCUMENTS CONTAINING THE SAME KEY– BIT MAP: CONSTANT– INVERTED LISTS: LINEAR GROWTH
• WITH BITMAP ORGANIZATIONS, QUERY PROCESSING BECOMES A SIMPLE BINARY STRING MATCHINGBETWEEN THE QUERY BITMAP AND THOSE OF THE DOCUMENTS
Fabio A. Schreiber© Inf. retrieval 19
AUTOMATIC TEXT ANALYSIS
ITS GOAL IS TO EXTRACT THE TERMS TO BE INCLUDED IN THE INDEXES AND THEIR MUTUAL RELATIONSHIPS– SINGLE TERMS (KWOC)– TERMS IN CONTEXT (KWIC)
– EXHAUSTIVE INDEXING (> RECALL)– SPECIFIC INDEXING (> PRECISION)
– DEEP INDEXING (> PERFORMANCE, > COST)– SHALLOW INDEXING (< PERFORMANCE, < COST)
Fabio A. Schreiber© Inf. retrieval 20
AUTOMATIC TEXT ANALYSIS
ZIPF LAW (least effort principle)
ORDERING THE SET OF WORDS IN A TEXT IN DECREASING FREQUENCY ORDER (RANK), IT CAN BE OBSERVED THAT
RANK(i)*FREQ(i)=COSTANTFOR THE ENGLISH LANGUAGE:
COSTANT ≈ 0.150% OF DISTINCT WORDS ARE FOUND ONLY ONCE 80% OF DISTINCT WORDS DO NOT APPEAR MORE THAN 4 TIMES
Fabio A. Schreiber© Inf. retrieval 21
OPERATIONS ON TEXT
• COMPRESSION– VARIABLE LENGTH CODES
• MOST FREQUENT WORDS SHORTER CODE• MOST FREQUENT LETTERS SHORTER CODE
HUFFMAN CODE: 3 BIT FOR E, 10 BIT FOR Z ,AVERAGE LENGTH: 4.12 48% COMPRESSION
– DIGRAMS, TRIGRAMS, …, CODING
• CRYPTOGRAPHYREVERSIBLE TEXT TRANSFORMATION
• INFORMATION PRIVACY• ACCESS RIGHTS AUTENTICATION
Fabio A. Schreiber© Inf. retrieval 22
AUTOMATIC INDEXING• THE CHOICE OF INSERTING OF A TERM INTO AN INDEX IS
TO BE MADE ON THE BASE OF TWO PARAMETERS– ITS RELEVANCE FOR IDENTIFYING A DOCUMENT
RECALL– ITS WEIGHT FOR SINGLING OUT A DOCUMENT FROM
A COLLECTION OF SIMILAR DOCUMENTS PRECISION
• TERM OCCURRENCY PROPERTIES IN A WHOLE COLLECTION OF N DOCUMENTS MUST BE EXAMINED
– THE MOST COMMON FUNCTIONAL TERMS ARE REMOVED (ARTICLES, PREPOSITIONS, ECC.) STOP LIST
– THE FREQUENCY tfij OF REMAINING TERMS Tj IN EACH DOCUMENT Di IS COMPUTED
– A THRESHLD FREQUENCY T IS CHOSEN AND TO EACH DOCUMENT Di ALL THE TERMS Tj ARE ASSIGNED FOR WHICH tfij > T
Fabio A. Schreiber© Inf. retrieval 23
AUTOMATIC INDEXING
– TERMS WHICH ALLOW A GOOD INDEXING BOTH FOR RECALL AND PRECISION APPEAR
– OFTEN IN INDIVIDUAL DOCUMENTS– SELDOM IN THE REMAINING COLLECTION
• A GOOD PERFORMANCE INDEX IS THE WEIGHTwij=tfij*log(N/dfj )
WHERE THE DOCUMENT FREQUENCY dfj REPRESENTS THE NUMBER OF DOCUMENTS IN THE COLLECTION IN WHICH THE TERM Tj APPEARS
Fabio A. Schreiber© Inf. retrieval 24
AUTOMATIC INDEXING• ON
– TITLE ONLY– TITLE AND ABSTRACT (best cost/performance)– FULL TEXT
• PROCESS STEPS– REMOVE STOP WORDS– CREATE WORD STEMS BY REMOVING PRE- AND
POST- FIXES– COALESCE EQUIVALENT STEMS THESAURI– WEIGHT REMAINING TERMS– APPLY POSSIBLE THRESHOLDS– INSERT REMAINING TERMS INTO THE INDEX
Fabio A. Schreiber© Inf. retrieval 25
THESAURI
• THESAURI ALLOW A LARGER RECALL BY SUBSTITUTING TOO SPECIFIC TERMS WITH MORE COMMON SYNONYMS
• STEM USAGE REQUIRES THAT CORRECT LEXICAL RULES ARE FOLLOWED FOR EACH LANGUAGE (e.g. SUBSTITUTION OF THE FINAL I WITH Y)
• STEMS MUST BE AT LEAST THREE CHARACTERS LONG IN ORDER TO BE SIGNIFICANT (the progressive time rulewould truncate King TO K)
Fabio A. Schreiber© Inf. retrieval 26
DOCUMENT SEARCH• INTERACTIVITY
– AFTER THE FIRST QUERY, THE SYSTEM SHOWS THE NUMBER OF RELEVANT DOCUMENTS
– IN EACH FURTHER ITERATION, THE USER TRIES TO ENHANCE THE PRECISION UNTIL THE NUMBER OF RETRIEVED DOCUMENTS IS MANAGEABLE TO BE DIRECTLY INSPECTED
• RANKING– DOCUMENTS ARE PRESENTED IN RELEVANCE ORDER
BASED ON WEIGHTS ASSIGNED TO THE DIFFERENT TERMS
• BROWSING– SIMILAR DOCUMENTS ARE GROUPED IN A SINGLE
CLASS AND INSPECTED “BY PROXIMITY”
Fabio A. Schreiber© Inf. retrieval 27
DOCUMENT SEARCH• RELEVANCE FEEDBACK
– THE SYSTEM INVITES THE USER TO EVALUATE THE RELEVANCE OF EACH RETRIEVED DOCUMENT
– FROM THE ANSWERS, THE SYSTEM TUNES THE TERM WEIGHTS IN THE DOCUMENTS
• USER PROFILES– INFORMATION ABOUT
• MOST CONSULTED DOCUMENTS• RELEVANCE ANALYSIS RESULTS • INFORMATION ABOUT THE WORK CONTEXT
– DYNAMIC MANAGEMENT IS NEEDED – CAN BE USED IN WORKING ENVIRONMENTS WITH
WELL DEFINED, CUSTOMARY USERS
Fabio A. Schreiber© Inf. retrieval 28
LANGUAGES FOR DOCUMENT SEARCHING
• QUERY LANGUAGES ARE MOSTLY BASED ON FUNDAMENTAL SET OPERATORS - AND, OR, NOT - AND THEIR COMBINATIONS
• SUPPLEMENTARY OPERATORS– TERMS ORDERING– TERMS CONTIGUITY– WILDCARDS (truncation or separation)– SEARCH FIELD (title, abstract, full text)
• OTHER COMMANDS– DOCUMENT DATA BANK CHOICE– THESAURUS INSPECTION– SEARCH RESULT MEMORIZATION– ........
Fabio A. Schreiber© Inf. retrieval 29
NETWORK SEARCH
THE MAIN DIFFERENCES BETWEN WEB SEARCHING AND TRADITIONAL INFORMATION RETRIEVAL ARE:– HIGHER HETEROGENEITY OF WEB INFORMATION– EXTREMELY LARGE DIMENSIONS OF THE SEARCH DOMAIN
(year 2005)• 8x109 STATIC WEB PAGES AMOUNTING TO 102 TBYTE• 1 MILLION/DAY NEW PAGES (very high volatility)• 140x103 SEARCHES / MINUTE (Google 2004)
– EVEN IF THE RECALL IS LARGE, ONLY THE VERY FIRSTDOCUMENTS ARE EXAMINED
– OWING TO THEIR COMMERCIAL VALUE TO ADVERTISERS, SORTING AND RANKING ALGORITHMS ARE AMONG THE BEST KEPT INDUSTRIAL SECRETS!
Fabio A. Schreiber© Inf. retrieval 30
NETWORK SEARCH• SEARCH ENGINES USE
– CENTRALIZED SEARCH INDEXES WITH TREE CATEGORIZATION OF CONTENTS
– BOTH CONTENT AND CONTEXT– EFFECTIVE DOCUMENT CLASSIFICATION – PORTALS (SUBJECT GATEWAYS)
• TRADIZIONAL ENGINES INDEX INDIVIDUAL PAGES
• A PORTAL, AMONG OTHER FEATURES, RECOGNIZES A DOCUMENT AS SUCH, AND IT KEEPS INFORMATION CHERENCE
Fabio A. Schreiber© Inf. retrieval 31
SEARCH ENGINES• DIRECTORY BASED (Magellan, ... )
KNOWLEDGE IS ORGANIZED INTO TREE STRUCTURES; WEB PAGES ARE CLASSIFIED ACCORDINGLY
• CLASSIFICATION IS A HEAVY JOB• IF THE REQUIRED INFORMATION DOES NOT FALL INTO
THE CLASSIFICATION FINDING IT IS IMPOSSIBLE
• “SPIDER” BASED (Alta Vista, Lycos, Google, ... )SPECIFIC PROGRAMS LOOK FOR EVERYTING AND ORGANIZE THE TOPICS IN WHICHEVER MODE
• THE SPIDER ESPLORES THE WEB AND FINDS THE PAGES• A DATABASE STORES THE RETRIEVED INFORMATION AND
THE RELEVANCE SORTING ALGORITHMS
• A USER INTERFACE ALLOWS QUERY FORMULATION AND RESULT PRESENTATION
Fabio A. Schreiber© Inf. retrieval 32
SEARCH ENGINESGOOGLE• BORN AS A RESEARCH PRODUCT AT STANFORD• IT USES
– AN INDEX WITH MORE THAN 109 PAGES– SPIDER ADDING MORE OR LESS 106 PAGE/DAY
• IT MANAGES 200 MILION/DAY SEARCHES• SEARCH RESULTS ARE EVALUATED BY MEANS OF
PageRank™ TECHNOLOGY– RELEVANCE IS COMPUTED BY MEANS OF MATHEMATICAL
FORMULAS WITH 500*106 VARIABLES AND 2*109 TERMS– IT ALLOWS BOTH FOR PAGE CONTENT AND FOR
REFERENCES MADE FROM OTHER PAGES, CLASSIFIED AS TO RELEVANCE
– TRIES TO AVOID USERS INTERFERENCE IN RANKING