a design space approach to analysis of information retrieval adaptive filtering systems
DESCRIPTION
A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems. Dmitriy Fradkin, Paul Kantor DIMACS, Rutgers University. What Is This Work About?. Small-scale view: We analyze differences between two implementations of Rocchio method and discuss choices of parameters. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/1.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 1
A Design Space Approach to Analysis of Information Retrieval
Adaptive Filtering Systems
Dmitriy Fradkin, Paul Kantor
DIMACS,
Rutgers University
![Page 2: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/2.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 2
What Is This Work About?
• Small-scale view: We analyze differences between two implementations of Rocchio method and discuss choices of parameters.
• Large-scale view: The problem of constructing an IR/AF system can be seen as an optimization problem in a large design space. (Well-known methods are simply points in this space.)
![Page 3: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/3.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 3
Large-Scale View
• Use optimization methods to find optimal choices of parameters. These optimal choices do not have to correspond to well-known methods or standard practices.
• Design space optimization methods have been suggested for designing VLSI chips [Bahuman et. al. 2002], airplanes [Schwabacher and Gelsey, 1996; Zha et. al. 19996] and HVAC systems [Szykman 1997].
![Page 4: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/4.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 4
What’s in a name?
• We find that even a single “name” involves an enormous number of design choices.
• TREC2002 Adaptive Filtering– DIMACS: Rocchio method
– Chinese Academy of Sciences: Rocchio Method
• One method performs almost twice as well as the other.
![Page 5: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/5.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 5
For any system:
• Choose Data Representation• Construct Initial Classifier• Training Phase:
• Incorporate labeled examples
• Supplement with “pseudo positives” and “pseudo negatives”
• Set the threshold
• Filtering Phase: as new documents arrive • Evaluate performance
• Update the classifier model
• Update threshold
![Page 6: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/6.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 6
All of these are usually:
• Characterized informally, as a choice, and the exclusion of alternatives.
• Seen as points on a map – but to understand the significance of these choices we need to explore the real territory.
• So: we must interpolate between the choices made in one method and those made in another.
![Page 7: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/7.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 7
Interpolation
• Identify the corresponding design decisions
• Develop a “path” between them – sometimes called a “homotopy” from the
topological concept of smoothly distorting one shape (say a coffee cup) into another (say, a doughnut).
• Study the effectiveness along various paths among design options.
![Page 8: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/8.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 8
Interpolation Aspects for IR/AF
• Term Representation
• Term Weighting
• Computing Scores
• Setting Classifier Threshold
• Document Set Representation
• Pseudolabeled Documents in Training
![Page 9: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/9.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 9
Interpolation Aspects (cont.)
• Query Initialization
• Unjudged document in test
• Query Update
• Quitting Strategy
![Page 10: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/10.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 10
Example: Term Representation
otherwise 0
0,d)(t,f' if d)),(t,log(f'1 d)f(t,
Where f’(t,d) is number of times a term occurs in a document
![Page 11: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/11.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 11
Example: Term Weighting
• DIMACS: • CAS:
• Homotopy:
)i'(t)
T((t)iD
1
1log
60
61
1log
, i'(t)
,)), if i'(ti'(t)
T(
(t)iC
iC iD i (t)λi)λ(t)(i)i(t,λ 1
i’(t) is the number of documents, in training set T, containing term t.
![Page 12: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/12.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 12
Example: Score Computation
• DIMACS:
• CAS:
• Homotopy:
q)W(dq)(ds DD,,
elsewhere 0
diagonal,on 1 wλiw ))(i(t,λ)w(t,λ
i’(t) is the number of documents, in training set T, containing term t. W is a diagonal matrix of weights
||||||||
,,
qWd
q)W(dq)(ds
C
CC
;)i(t,λ(t)w iD
;2)i(t,λ(t)w iC
![Page 13: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/13.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 13
Example: Score Interpolation
)λ(1s)λφ(s)λ,s,s(s SDSSDC C
(d)sm
m(d))φ(s C
C
DC
Same mapping for scoresand for thresholds from CASscale to DIMACS scale:
Homotopy:
![Page 14: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/14.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 14
Example: Setting Thresholds
• DIMACS:
• CAS:
• Homotopy:
is chosen to optimize utility
Threshold for query q after seeing document i:
(q,i)τD
.submissionlast thesinceseen
documents ofnumber - ,005.0 where
otherwise ,1
6000 if ,1
negative isutility if ,1
1
1
i
C
iC
C
C
z
)(q,iτ
z)(q,iτ
)(q,iτ
(q,i)τ
)λ(q,i)(τ(q,i))λφ(τ)λτ(q,i SSC DS 1,
![Page 15: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/15.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 15
Example: Set Representation
• DIMACS
• CAS
• Homotopy
Sx
xS
Sv1
)(
Sx
xSv )(
Sxr
r x)S)(λ(
)λv(S111
1,
![Page 16: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/16.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 16
Example: Pseudo-labeled Documents
• CAS method does not make use of pseudo-labeled documents in training stage
• DIMACS method: Given “density” parameters (d+ and d-) and “proportion” (p+ and p-), score unlabeled training documents and choose top and bottom sets according to “proportion”. Then pick documents out of these sets according to corresponding “density”.
• Interpolate between density and proportion parameters (DIMACS) and 0 (CAS).
![Page 17: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/17.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 17
Example: Query Initialization
)(')(')(')('' ip
ip
iitermsinit DvyDvxDvDvqq General Formula:
DIMACS:
CAS:
)(')1()(')1()())1(3(),( ipp
ipp
itermsp
init DvyDvxDvqq
Homotopy:
0' 0,y' 0, x'1,' ,3'
0' , |D|
5y' ,
|D|
2 x'1,' ,1'
-ip
ip
![Page 18: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/18.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 18
Example: Unjudged Documents
• A submitted document for which there is no label is “unjudged”. DIMACS ignores such documents. CAS considers such documents pseudo-negative if its score is less than 0.6.
• Can view this as a threshold:
uuuu 6.00)1(6.0)(u
![Page 19: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/19.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 19
Example: Query Update
)()()()( ppinit DyvDxvDvDvqq
)())1(0125.03.1()())1(125.08.1()(),( pyyinit
y DvDvDvqq
General Formula:
DIMACS:
CAS: 3.1y 0, x1.8, 1, ,1
0.0125y ,0 x125,.0 1, ,1
Homotopy:
![Page 20: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/20.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 20
Example: Quitting Strategy
• DIMACS: if after 50 submissions the utility is negative, stop submitting for this topic
• CAS: no quitting strategy Alternatively:
)1(02.0
1)(
0 :CAS
0.02 :DIMACS
negative. isutility documents 1
submittingafter ifQuit
q
q
![Page 21: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/21.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 21
Experimental Evaluation• TREC11 Data - Reuters Corpus v1• 23,000 training; 800,000 test• 100 topics (50 assessor, 50 intersection)• 3 positive and 0 negative examples per topic
5.1
5.0)5.0T11NU,max(T11SU
||2
|)||(|||2T11NU
T
DDD u
T+ - all positive documents; D+ - submitted positive;D- - submitted negative; Du – submitted unlabelled
![Page 22: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/22.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 22
Diagonal Interpolation
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 0.2 0.4 0.6 0.8 1
No quitting
With Quitting
Lambda 0 0.2 0.4 0.6 0.8 1 CAS Average T11SU, no quitting 0.033 0.103 0.26 0.364 0.404 0.394 0.405Average T11SU, with quitting 0.113 0.139 0.263 0.364 0.404 0.394 0.405
![Page 23: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/23.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 23
Documents Retrieved
![Page 24: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/24.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 24
Parameter Analysis
• It is possible to analyze effect of individual parameters at each point in space by taking “small steps” along the parameter axis.
• Requires a lot of computational effort
• Results may not be easy to interpret
![Page 25: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/25.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 25
Example of Parameter Analysis
\lambda 0.7 0.7 0.8 0.8 0.9 0.9 relevant nonrelevant relevant nonrelevant relevant nonrelevant\lambda_\alpha 2086 975 ... ... 2089 1010\lambda_\gamma 2273 1233 ... ... 1923 830\lambda_p 2043 1129 ... ... 2014 939\lambda_y 2106 1005 ... ... 2029 948\lambda_u 2065 1037 ... ... 2071 977\lambda_i 2062 977 2062 977 2062 977\lambda_w 2055 1000 ... ... 2119 1007\lambda_S 2153 1021 ... ... 2123 1031\lambda_r 2037 993 ... ... 2149 1044\lambda_q 2062 977 ... ... 2062 977
Effect of individual parameters on number of relevant andnonrelevant documents retrieved around 0.8 point
![Page 26: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/26.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 26
Results based on topic type
assessor intersection # topics avg. T11SU difference # topics avg. T11SU differenceCAS better than 0.8 18 -0.047 25 -0.0370.8 better than CAS 25 0.062 13 0.011CAS and 0.8 equal 7 0 12 0Total 50 0.014 50 -0.015
Comparison of CAS results and 0.8 diagonal homotopy point
![Page 27: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/27.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 27
Additional Experiments
• Reordered TREC documents
• Experimented with 77 topics on OHSUMED dataset (1987-1988 as training data, 1989-1991 as test)
The results are similar to those on the original
TREC task.
![Page 28: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/28.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 28
Result of Experiments with Reordering
Lambda 0.0 0.8 1.0
Average T11SU
0.108 0.406 0.391
Standard Deviation
0.002 0.002 0.004
Average Results on 5 re-orderings of TREC test set:
![Page 29: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/29.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 29
OHSUMED Results
Lambda 0 0.2 0.4 0.6 0.7 0.8 0.9 1Mean T11SU, no quitting 0.005 0.051 0.361 0.467 0.474 0.463 0.464 0.482Mean T11SU, with quitting 0.138 0.132 0.319 0.467 0.474 0.463 0.464 0.482
0
0.1
0.2
0.3
0.4
0.5
0.6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
No quitting
With quitting
![Page 30: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/30.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 30
Documents Retrieved: OHSUMED
Documents Retrieved
0
2000
4000
6000
8000
10000
12000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
lambda
Nu
mb
er
of
Do
cum
en
ts
Not Relevant
Relevant
![Page 31: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/31.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 31
Discussion
• We demonstrate the design complexity hidden under “Rocchio method”
• We provide specific models for interpolating between design choices
• These interpolation options can work for methods that are significantly more different (for example Rocchio and SVM).
![Page 32: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/32.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 32
Discussion (cont.)
• These models should help researchers explore their systems, and regions “between systems”
• Suggests a new approach to designing IR systems: finding a set of (interpolation) parameters optimizing performance
• This can be done with existing optimization methods.
![Page 33: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/33.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 33
A Note on Interpolation Limits
The need for two endpoint systems is not
very restrictive:
• Some interpolation parameters can be moved beyond [0,1] interval.
• The endpoints themselves can be moved.
![Page 34: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/34.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 34
Abstract Interpolation
• More abstractly: do not interpolate every single parameter –work at higher abstraction levels
• Ex: representation block, scoring block, thresholding block, etc.
• Can use this with several systems• This is at a lower level than ensembles of
classifiers.
![Page 35: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/35.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 35
Caveat
In moving to large design space we still face two major problems:
• The range of parameters cannot be explored exhaustively, and non-smooth optimization is needed
• Requires a lot of labeled data that is usually produced manually and is in short supply.
![Page 36: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051622/56814f60550346895dbd1626/html5/thumbnails/36.jpg)
November 10, 2004 Dmitriy Fradkin, CIKM'04 36
Acknowledgments
• KD-D group via NSF grant EIA-0087022
• Andrei Anghelescu, Vladimir Menkov
• Jamie Callan
• Members of DIMACS MMS project
• CAS researchers
• Ian Soboroff
• Anonymous reviewers