information retrieval (1955-1992) primary users –law clerks –reference librarians –(some) news...
DESCRIPTION
Growth of the Web ? # of web sites or Volume of web traffic Mosaic Netscape Volume doubling every 6 months Exponential GrowthTRANSCRIPT
![Page 1: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a4d1b3b7f8b9ab05999ebe3/html5/thumbnails/1.jpg)
Information Retrieval(1955-1992)• Primary Users
– Law Clerks– Reference Librarians– (Some) News organizations, product research, congressional
committees, medical/chemical abstract searches• Primary Search Models
– Boolean keyword searches on Abstract, Title, keyword• Vendors
– Mead Data Central(Lexis – Nexis)– Dialog– Westlaw– Total searchable online data : O(10 terabytes)
![Page 2: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a4d1b3b7f8b9ab05999ebe3/html5/thumbnails/2.jpg)
Information Retrieval(1993+)• Primary users
– 1st time computer users– novices
• Primary search modes– Still Boolean keyword searches with limited probabilistic
models– But FULL TEXT Retrieval
• Vendors– Lycos, Infoseek, Yahoo, Excite, AltaVista, Google– Total online data : ???
![Page 3: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a4d1b3b7f8b9ab05999ebe3/html5/thumbnails/3.jpg)
Growth of the Web
?
1992 1993 1994 1995 1996 1997 1998
# of web sitesor
Volume of webtraffic
Mosaic Netscape
Volume doubling every 6 months
ExponentialGrowth
![Page 4: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a4d1b3b7f8b9ab05999ebe3/html5/thumbnails/4.jpg)
Observation• Early IR system basically extended library catalog systems, allowing
– Keyword searches,– Limited abstract searches
in addition to Author/Title/Subject and including Boolean combination functionality
• IR was seen as reference retrieval (full documents still had to be ordered/delivered by hand)
![Page 5: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a4d1b3b7f8b9ab05999ebe3/html5/thumbnails/5.jpg)
In ContrastToday, IR has a much wider role in the age of digital libraries
• Full document retrieval
(hypertext, postscript or optical image(TIFF)
representations)
• Question answering
![Page 6: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a4d1b3b7f8b9ab05999ebe3/html5/thumbnails/6.jpg)
Old ViewFuntion of IR :
Map queries to relevant documents
New View Satisfy user’s information need
Infer goals/information need from:
- query itself
- past user query history
- User profiling(aol.com vs. CS dept.)
- Collective analysis of other user feedback on similar queries
15
1 8
… AND … OR …
![Page 7: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a4d1b3b7f8b9ab05999ebe3/html5/thumbnails/7.jpg)
In addition, return information in a format useful/intelligible to the user
• weighted orderings
• clusterings of documents by different attributes
• visualization tools
** Text Understanding techniques to extract answer to questions or at least subregion of text
Who is the current mayor of Columbus, Ohio?
don’t need full AP/CNN article on city scandals,
just the answer(and available source for proof)
![Page 8: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a4d1b3b7f8b9ab05999ebe3/html5/thumbnails/8.jpg)
Boolean SystemsFunction #1 : Provide a fast, compact index into the database (of documents or references)
Chihuahua
Nanny
(granularity)Index options- Doc number- Page number in Doc- Actual word offset
Data structure:Inverted file
![Page 9: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a4d1b3b7f8b9ab05999ebe3/html5/thumbnails/9.jpg)
Boolean OperationsChihuahua AND Nanny Join ( )
Chihuahua OR Nanny Union ( )
Proximity searches
Chihuhua W/3 Nanny
![Page 10: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a4d1b3b7f8b9ab05999ebe3/html5/thumbnails/10.jpg)
Vector IR model___________________________________________________________________________________________
d1 d2
f( ) f( )
V1 V2
Find optimal f( )
Sim (Vi , VQ) = Sim’ (Di , Q)Sim (V1, V2) Sim’ (d1 , d2)
Query
Cosine distance
___________________________________________________________________________________________
![Page 11: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a4d1b3b7f8b9ab05999ebe3/html5/thumbnails/11.jpg)
Vector models
D1
D2
Bit vector capturing essence/meaning of D1
Query
V1
V2
Q1
Find max Sim (Vi , Q1)
Sim (V1 , Q1)
___________________________________________________________________________________________
___________________________________________________________________________________________
![Page 12: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a4d1b3b7f8b9ab05999ebe3/html5/thumbnails/12.jpg)
Dimensionality Reductiond1
f( )
V1
V1^
Dimensionality Reduction(SVD/LSI)
Initial (term) vector representation
More compact/reduced dimensionality model of d1
___________________________________________________________________________________________
![Page 13: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a4d1b3b7f8b9ab05999ebe3/html5/thumbnails/13.jpg)
Clustering wordsOffset K - hash(w) - hash(cluster(w)) - hash(cluster(stem(w)))
Japanese
JapanNippon Japanese
NihonJapanese
Japanese ..
Raw TermVector
CondensedVector
3
1
192
V1D1
The
5 Japan *
Stem : books bookcomputer computcomputation comput
1V
![Page 14: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a4d1b3b7f8b9ab05999ebe3/html5/thumbnails/14.jpg)
The soap opera
The soap residue
an opera by Verdi
001
SoapOperaSoap opera
110
SoapOperaSoap opera
d1
Collocation(PhrasalTerm)
d2
![Page 15: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a4d1b3b7f8b9ab05999ebe3/html5/thumbnails/15.jpg)
Vector Abstractly is a compressed document(meaning preserving)
document
m1 f(d1)
document
m2 f(d2)
Compression : m1 = m2 iff d1 = d2 f( ) must be invertible
Summarization : m1 = m2 iff d1 and d2 are about the same thing(mean the same thing)
A meaningor contextvector representation
………………
………………
![Page 16: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a4d1b3b7f8b9ab05999ebe3/html5/thumbnails/16.jpg)
What is the optimal method for meaning preserving compression?
Issues
• size of representation(ideally size(Vi) << size(Di))
• cost of computation of vectors
– one time cost at model creation
• cost of similarity function
• must be computed for each query
• crucial to speed that this be minimized
![Page 17: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a4d1b3b7f8b9ab05999ebe3/html5/thumbnails/17.jpg)
– header processingretain/model cross references
1. remark (most) function words
NOT or 2. downweight by frequency 3. use text analysis +
decide which function words carry meaning.
)(V ref )(V ref VV 332211i
![Page 18: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a4d1b3b7f8b9ab05999ebe3/html5/thumbnails/18.jpg)
Supervised Learning/Training
Project #1A
Chihuahua Breeding ClubB
PersonalC
Junk mailJ
recognizer
recognizer
recognizer
recognizer
Collective Discrimination
Inputdatastream
In Real time(ongoing)
BAC J
Trai
nin
gLabelled(routed)output
![Page 19: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a4d1b3b7f8b9ab05999ebe3/html5/thumbnails/19.jpg)
Other related problems: Mail/News Routing and Filtering
DataStream
Project #1 at workProject #2 at workChihuahua breedingScuba clubPersonalJunk mail
Typically model long-term information needs(People put effort into training and user feedback that they aren’t willingto invest for single query-based IR)
Inboxes(prioritize)
119
121
125
131
![Page 20: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a4d1b3b7f8b9ab05999ebe3/html5/thumbnails/20.jpg)
Features for classification
• Subject line• Source/Sender• X-annotations• Date/time• Length• Other recipients• Message content
(regions weighted differently)
![Page 21: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a4d1b3b7f8b9ab05999ebe3/html5/thumbnails/21.jpg)
Probabilistic IR models – Intermediate Topic models/detectors
f( )
TDA
TopicDetectors
(TopicModels)
TDB TDE
TV1
S
0 1 0 1 0 0
V1
V2
Q
V1
f( )V2
d1 d2
1 0 0 0 0 0
0 0 0 1 0 0