a study on query expansion methods for patent retrieval

A Study on Query Expansion A Study on Query Expansion Methods for Patent RetrievalMethods for Patent Retrieval

Walid MagdyWalid Magdy Gareth JonesGareth JonesCentre for Next Generation Localisation

School of Computing

Dublin City University

24 October 2011

OutlineWhat is the Problem?Why Patents?Current SolutionsTesting Existing ApproachesNew ApproachResultsConclusion

MotivationPatent CharacteristicsPrior WorkApplying Standard QENovel MethodOutcomeFindings

Agenda

Why Patents?Challenging wording

Using vague and general terms

Strange combination of terms

No defined query (what words to select for search?)

Low retrieval effectiveness

Recall-oriented IR task

Hypothesis:QE better query/doc match better results

Prior WorkPseudo Relevance Feedback (PRF)(Kishida K, NTCIR-3; Itoh H, NTCIR-4)

QE using Rocchio formula: no significant improvementQE using Taylor formula: no significant improvementReweighting query terms using PRF: no significant improvement

Inter Query Expansion (QE) for Patent Invalidity Search(Takeuchi H. et al, NTCIR-5)

QE for individual claims from same patent topic: significant improvement, but not applicable for other patent search tasks

Improving Retrievability for Patents(Bashir and Rauber, ECIR 2010)

Enrich queries to improve the retrievability of patents with low chance of retrieval, but not tested for real patent search task

Testing QE for Prior-Art Patent SearchCLEF-IP 2010:

1.35M patents from the EPO1.35K English patent topics

Collection contains EN/FR/DE patents, with translations of titles and claims in three languages

Expand query by: PRF vs. WordNet

Use (Magdy et al., 2011) as BL without citation extraction (full patent description section as query)

MAP and PRES was used for evaluationBL: 0.14 MAP, 0.486 PRES

Applying Pseudo Relevance FeedbackPRF implemented in Indri was used

Different values of FB terms and docs was tested

Terms

Docs10 20 30 50

MAPBL = 0.1399

5 0.037 0.053 0.062 0.07210 0.031 0.046 0.053 0.06120 0.026 0.036 0.042 0.049

PRESBL = 0.486

5 0.196 0.234 0.247 0.26510 0.190 0.222 0.235 0.25120 0.178 0.205 0.216 0.232

Using WordNet for ExpansionExpand terms in query using synonyms, hyponyms for nouns and verbsApply QE to sample 100 topics, then use best combination to the full 1.35k topics set

MAP PRES value %change value %change

Baseline 0.1668 NA 0.584 NANS 0.1680 +0.7% 0.562 -3.7%NS+NH 0.1680 +0.7% 0.561 -3.8%NS+VS 0.1677 +0.5% 0.551 -5.6%NS+NH+VS+VH 0.1540 -7.6% 0.544 -6.8%

Baseline 0.1399 NA 0.486 NAWordNet (NS) 0.1364 -2.5% 0.484 -1.0%

Standard QE ApproachesPRF:

Significant degradation in retrieval effectiveness.

This can be expected due to the low initial retrieval precision

WordNet:Statistically significant degradation of results, but with some successful instances (31% of topics)

Large reduction in retrieval speed, since average query size is at least 5 times larger (34 times larger for the NS+NH+VS+VH)

A new effective and efficient QE method is required!

Automatically Generated SynSet

Align SentencesRemove Stopwords

Stem WordsAlign Terms

Backoff Alignment

English fields

French transl.

ENFR terms dic.

FREN terms dic.

ENEN terms dic.

process for eliminating foreign matter from a waste heat streamprocédé pour éliminer de la matière étrangère d'un courant de chaleur perdue

process elimin foreign matter wast heat stream

procéd élimin mati étrangèr cour chaleur perdu

elimin:élimin 0.71elimin 0.13

élimin: remov 0.71elimin 0.14

elimin:remov 0.6elimin 0.16

elimin:remov 0.85elimin 0.15

Samples of the Output

motormotor weightweight traveltravel colorcolor linklinkmotormotor 0.64enginengin 0.36

weightweight 0.86wtwt 0.14

traveltravel 0.67movemove 0.19displacdisplac 0.14

colorcolor 0.56colourcolour 0.25dyedye 0.19

linklink 0.4connectconnect 0.18bondbond 0.17crosslinkcrosslink0.13bindbind 0.12

clothcloth tubetube areaarea gamegame playplayfabricfabric 0.36clothcloth 0.3garmentgarment 0.2tissutissu 0.14

tubetube 0.88pipepipe 0.12

areaarea 0.4zonezone 0.23regionregion 0.2surfacsurfac 0.17

setset 0.6gamegame 0.4

setset 0.3playplay 0.24readread 0.2gamegame 0.16reproducreproduc0.1

SynSet QE Results8M parallel EN/FR sentences were extracted from EPO patent collection to generate SynSets

Two runs were adopted:Expanding query using SynSet without weights (Usynset)Utilizing SynSet probabilities as weights to terms in query

MAP PRES value %change value %change

Baseline 0.1399 NA 0.486 NA

Wsynset 0.1440 +2.9% 0.485 -0.7%Usynset 0.1402 +0.2% 0.480 -1.7%

SynSet ExpansionSignificantly better MAP, but significantly worse PRESi.e. better retrieval at very high ranks, but worse ranking of relevant results over all ranks and less recall

Some topics were improved (34% of topics), but some were degraded (39% of topics).

Significantly more efficient than PRF and WordNet (query size is only 60% larger)

Deeper Look on SynSetNo features with high correlation to SynSet QE success

Initial retrieval quality of BL does not relate to the performance of QE

Topic ID Baseline Wsynset %change Topic ID Baseline Wsynset %change

PAC-1704 0.000 0.174 +∞ PAC-1510 0.030 0.012 -60%PAC-195 0.000 0.215 +∞ PAC-210 0.160 0.000 -100%PAC-1225 0.105 0.532 +408% PAC-220 0.201 0.000 -100%PAC-1670 0.124 0.637 +415% PAC-56 0.263 0.040 -85%PAC-954 0.514 0.763 +48% PAC-784 0.323 0.027 -92%PAC-122 0.590 0.944 +60% PAC-42 0.459 0.216 -53%PAC-579 0.630 0.902 +43% PAC-906 0.571 0.214 -63%PAC-1113 0.669 0.880 +32% PAC-1498 0.662 0.307 -54%

ConclusionsPRF is not effective with patent prior-art search

WordNet QE for patent search:Leads to overall significant degradation of retrievalHas some positive impact on the retrieval of some topicsHigh computational cost

SynSet QE for patent search:The most effective and efficient QE technique among those testedSignificant improvement for very high ranks, but significant degradation of overall ranking and recallNo indication of when it fails/succeedsSynSet can be used as a lexical resource for patent examiners

Future Work

More analysis to better understand when QE fails/succeeds

Applying SynSet on real patent examiners’ queries rather than automatically formulated queries

Combining different QE methods

Alternative methods for query modification, for example query reduction (QR)

Please Check in CIKM Poster SessionMagdy W. and G. J. F. Jones. An Efficient Method for Using Machine Translation Technologies in Cross-Language Patent Search. Ganguly D., J. Leveling, W. Magdy, and G. J. F. Jones. Query Reduction based on Pseudo-Relevant Documents.

Thank youThank you

a study on query expansion methods for patent retrieval

Documents

verbsapply qe

languagesexpand query

retrievability of patents

standard qe approachesprf

patents bashir

enfrde patents

significant improvementqe

query expansion methods