a study on query expansion methods for patent retrieval
DESCRIPTION
A Study on Query Expansion Methods for Patent Retrieval. 24 October 2011. Walid Magdy Gareth Jones Centre for Next Generation Localisation School of Computing Dublin City University. Outline. Agenda. What is the Problem? Why Patents? Current Solutions Testing Existing Approaches - PowerPoint PPT PresentationTRANSCRIPT
A Study on Query Expansion A Study on Query Expansion Methods for Patent RetrievalMethods for Patent Retrieval
Walid MagdyWalid Magdy Gareth JonesGareth JonesCentre for Next Generation Localisation
School of Computing
Dublin City University
24 October 2011
OutlineWhat is the Problem?Why Patents?Current SolutionsTesting Existing ApproachesNew ApproachResultsConclusion
MotivationPatent CharacteristicsPrior WorkApplying Standard QENovel MethodOutcomeFindings
Agenda
Why Patents?Challenging wording
Using vague and general terms
Strange combination of terms
No defined query (what words to select for search?)
Low retrieval effectiveness
Recall-oriented IR task
Hypothesis:QE better query/doc match better results
Prior WorkPseudo Relevance Feedback (PRF)(Kishida K, NTCIR-3; Itoh H, NTCIR-4)
QE using Rocchio formula: no significant improvementQE using Taylor formula: no significant improvementReweighting query terms using PRF: no significant improvement
Inter Query Expansion (QE) for Patent Invalidity Search(Takeuchi H. et al, NTCIR-5)
QE for individual claims from same patent topic: significant improvement, but not applicable for other patent search tasks
Improving Retrievability for Patents(Bashir and Rauber, ECIR 2010)
Enrich queries to improve the retrievability of patents with low chance of retrieval, but not tested for real patent search task
Testing QE for Prior-Art Patent SearchCLEF-IP 2010:
1.35M patents from the EPO1.35K English patent topics
Collection contains EN/FR/DE patents, with translations of titles and claims in three languages
Expand query by: PRF vs. WordNet
Use (Magdy et al., 2011) as BL without citation extraction (full patent description section as query)
MAP and PRES was used for evaluationBL: 0.14 MAP, 0.486 PRES
Applying Pseudo Relevance FeedbackPRF implemented in Indri was used
Different values of FB terms and docs was tested
Terms
Docs10 20 30 50
MAPBL = 0.1399
5 0.037 0.053 0.062 0.07210 0.031 0.046 0.053 0.06120 0.026 0.036 0.042 0.049
PRESBL = 0.486
5 0.196 0.234 0.247 0.26510 0.190 0.222 0.235 0.25120 0.178 0.205 0.216 0.232
Using WordNet for ExpansionExpand terms in query using synonyms, hyponyms for nouns and verbsApply QE to sample 100 topics, then use best combination to the full 1.35k topics set
MAP PRES value %change value %change
Baseline 0.1668 NA 0.584 NANS 0.1680 +0.7% 0.562 -3.7%NS+NH 0.1680 +0.7% 0.561 -3.8%NS+VS 0.1677 +0.5% 0.551 -5.6%NS+NH+VS+VH 0.1540 -7.6% 0.544 -6.8%
Baseline 0.1399 NA 0.486 NAWordNet (NS) 0.1364 -2.5% 0.484 -1.0%
Standard QE ApproachesPRF:
Significant degradation in retrieval effectiveness.
This can be expected due to the low initial retrieval precision
WordNet:Statistically significant degradation of results, but with some successful instances (31% of topics)
Large reduction in retrieval speed, since average query size is at least 5 times larger (34 times larger for the NS+NH+VS+VH)
A new effective and efficient QE method is required!
Automatically Generated SynSet
Align SentencesRemove Stopwords
Stem WordsAlign Terms
Backoff Alignment
English fields
French transl.
ENFR terms dic.
FREN terms dic.
ENEN terms dic.
process for eliminating foreign matter from a waste heat streamprocédé pour éliminer de la matière étrangère d'un courant de chaleur perdue
process elimin foreign matter wast heat stream
procéd élimin mati étrangèr cour chaleur perdu
elimin:élimin 0.71elimin 0.13
élimin: remov 0.71elimin 0.14
elimin:remov 0.6elimin 0.16
elimin:remov 0.85elimin 0.15
Samples of the Output
motormotor weightweight traveltravel colorcolor linklinkmotormotor 0.64enginengin 0.36
weightweight 0.86wtwt 0.14
traveltravel 0.67movemove 0.19displacdisplac 0.14
colorcolor 0.56colourcolour 0.25dyedye 0.19
linklink 0.4connectconnect 0.18bondbond 0.17crosslinkcrosslink0.13bindbind 0.12
clothcloth tubetube areaarea gamegame playplayfabricfabric 0.36clothcloth 0.3garmentgarment 0.2tissutissu 0.14
tubetube 0.88pipepipe 0.12
areaarea 0.4zonezone 0.23regionregion 0.2surfacsurfac 0.17
setset 0.6gamegame 0.4
setset 0.3playplay 0.24readread 0.2gamegame 0.16reproducreproduc0.1
SynSet QE Results8M parallel EN/FR sentences were extracted from EPO patent collection to generate SynSets
Two runs were adopted:Expanding query using SynSet without weights (Usynset)Utilizing SynSet probabilities as weights to terms in query
MAP PRES value %change value %change
Baseline 0.1399 NA 0.486 NA
Wsynset 0.1440 +2.9% 0.485 -0.7%Usynset 0.1402 +0.2% 0.480 -1.7%
SynSet ExpansionSignificantly better MAP, but significantly worse PRESi.e. better retrieval at very high ranks, but worse ranking of relevant results over all ranks and less recall
Some topics were improved (34% of topics), but some were degraded (39% of topics).
Significantly more efficient than PRF and WordNet (query size is only 60% larger)
Deeper Look on SynSetNo features with high correlation to SynSet QE success
Initial retrieval quality of BL does not relate to the performance of QE
Topic ID Baseline Wsynset %change Topic ID Baseline Wsynset %change
PAC-1704 0.000 0.174 +∞ PAC-1510 0.030 0.012 -60%PAC-195 0.000 0.215 +∞ PAC-210 0.160 0.000 -100%PAC-1225 0.105 0.532 +408% PAC-220 0.201 0.000 -100%PAC-1670 0.124 0.637 +415% PAC-56 0.263 0.040 -85%PAC-954 0.514 0.763 +48% PAC-784 0.323 0.027 -92%PAC-122 0.590 0.944 +60% PAC-42 0.459 0.216 -53%PAC-579 0.630 0.902 +43% PAC-906 0.571 0.214 -63%PAC-1113 0.669 0.880 +32% PAC-1498 0.662 0.307 -54%
ConclusionsPRF is not effective with patent prior-art search
WordNet QE for patent search:Leads to overall significant degradation of retrievalHas some positive impact on the retrieval of some topicsHigh computational cost
SynSet QE for patent search:The most effective and efficient QE technique among those testedSignificant improvement for very high ranks, but significant degradation of overall ranking and recallNo indication of when it fails/succeedsSynSet can be used as a lexical resource for patent examiners
Future Work
More analysis to better understand when QE fails/succeeds
Applying SynSet on real patent examiners’ queries rather than automatically formulated queries
Combining different QE methods
Alternative methods for query modification, for example query reduction (QR)
Please Check in CIKM Poster SessionMagdy W. and G. J. F. Jones. An Efficient Method for Using Machine Translation Technologies in Cross-Language Patent Search. Ganguly D., J. Leveling, W. Magdy, and G. J. F. Jones. Query Reduction based on Pseudo-Relevant Documents.
Thank youThank you