a study on query expansion methods for patent retrieval

16
A Study on Query A Study on Query Expansion Methods for Expansion Methods for Patent Retrieval Patent Retrieval Walid Magdy Walid Magdy Gareth Jones Gareth Jones Centre for Next Generation Localisation School of Computing Dublin City University 24 October 2011

Upload: teagan

Post on 05-Feb-2016

31 views

Category:

Documents


0 download

DESCRIPTION

A Study on Query Expansion Methods for Patent Retrieval. 24 October 2011. Walid Magdy Gareth Jones Centre for Next Generation Localisation School of Computing Dublin City University. Outline. Agenda. What is the Problem? Why Patents? Current Solutions Testing Existing Approaches - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Study on Query Expansion Methods for Patent Retrieval

A Study on Query Expansion A Study on Query Expansion Methods for Patent RetrievalMethods for Patent Retrieval

Walid MagdyWalid Magdy Gareth JonesGareth JonesCentre for Next Generation Localisation

School of Computing

Dublin City University

24 October 2011

Page 2: A Study on Query Expansion Methods for Patent Retrieval

OutlineWhat is the Problem?Why Patents?Current SolutionsTesting Existing ApproachesNew ApproachResultsConclusion

MotivationPatent CharacteristicsPrior WorkApplying Standard QENovel MethodOutcomeFindings

Agenda

Page 3: A Study on Query Expansion Methods for Patent Retrieval

Why Patents?Challenging wording

Using vague and general terms

Strange combination of terms

No defined query (what words to select for search?)

Low retrieval effectiveness

Recall-oriented IR task

Hypothesis:QE better query/doc match better results

Page 4: A Study on Query Expansion Methods for Patent Retrieval

Prior WorkPseudo Relevance Feedback (PRF)(Kishida K, NTCIR-3; Itoh H, NTCIR-4)

QE using Rocchio formula: no significant improvementQE using Taylor formula: no significant improvementReweighting query terms using PRF: no significant improvement

Inter Query Expansion (QE) for Patent Invalidity Search(Takeuchi H. et al, NTCIR-5)

QE for individual claims from same patent topic: significant improvement, but not applicable for other patent search tasks

Improving Retrievability for Patents(Bashir and Rauber, ECIR 2010)

Enrich queries to improve the retrievability of patents with low chance of retrieval, but not tested for real patent search task

Page 5: A Study on Query Expansion Methods for Patent Retrieval

Testing QE for Prior-Art Patent SearchCLEF-IP 2010:

1.35M patents from the EPO1.35K English patent topics

Collection contains EN/FR/DE patents, with translations of titles and claims in three languages

Expand query by: PRF vs. WordNet

Use (Magdy et al., 2011) as BL without citation extraction (full patent description section as query)

MAP and PRES was used for evaluationBL: 0.14 MAP, 0.486 PRES

Page 6: A Study on Query Expansion Methods for Patent Retrieval

Applying Pseudo Relevance FeedbackPRF implemented in Indri was used

Different values of FB terms and docs was tested

 Terms

 

Docs10 20 30 50

MAPBL = 0.1399

5 0.037 0.053 0.062 0.07210 0.031 0.046 0.053 0.06120 0.026 0.036 0.042 0.049

PRESBL = 0.486

5 0.196 0.234 0.247 0.26510 0.190 0.222 0.235 0.25120 0.178 0.205 0.216 0.232

Page 7: A Study on Query Expansion Methods for Patent Retrieval

Using WordNet for ExpansionExpand terms in query using synonyms, hyponyms for nouns and verbsApply QE to sample 100 topics, then use best combination to the full 1.35k topics set

  MAP PRES  value %change value %change

Baseline 0.1668 NA 0.584 NANS 0.1680 +0.7% 0.562 -3.7%NS+NH 0.1680 +0.7% 0.561 -3.8%NS+VS 0.1677 +0.5% 0.551 -5.6%NS+NH+VS+VH 0.1540 -7.6% 0.544 -6.8%

Baseline 0.1399 NA 0.486 NAWordNet (NS) 0.1364 -2.5% 0.484 -1.0%

Page 8: A Study on Query Expansion Methods for Patent Retrieval

Standard QE ApproachesPRF:

Significant degradation in retrieval effectiveness.

This can be expected due to the low initial retrieval precision

WordNet:Statistically significant degradation of results, but with some successful instances (31% of topics)

Large reduction in retrieval speed, since average query size is at least 5 times larger (34 times larger for the NS+NH+VS+VH)

A new effective and efficient QE method is required!

Page 9: A Study on Query Expansion Methods for Patent Retrieval

Automatically Generated SynSet

Align SentencesRemove Stopwords

Stem WordsAlign Terms

Backoff Alignment

English fields

French transl.

ENFR terms dic.

FREN terms dic.

ENEN terms dic.

process for eliminating foreign matter from a waste heat streamprocédé pour éliminer de la matière étrangère d'un courant de chaleur perdue

process elimin foreign matter wast heat stream

procéd élimin mati étrangèr cour chaleur perdu

elimin:élimin 0.71elimin 0.13

élimin: remov 0.71elimin 0.14

elimin:remov 0.6elimin 0.16

elimin:remov 0.85elimin 0.15

Page 10: A Study on Query Expansion Methods for Patent Retrieval

Samples of the Output

motormotor weightweight traveltravel colorcolor linklinkmotormotor 0.64enginengin 0.36

weightweight 0.86wtwt 0.14

traveltravel 0.67movemove 0.19displacdisplac 0.14

colorcolor 0.56colourcolour 0.25dyedye 0.19

linklink 0.4connectconnect 0.18bondbond 0.17crosslinkcrosslink0.13bindbind 0.12

clothcloth tubetube areaarea gamegame playplayfabricfabric 0.36clothcloth 0.3garmentgarment 0.2tissutissu 0.14

tubetube 0.88pipepipe 0.12

areaarea 0.4zonezone 0.23regionregion 0.2surfacsurfac 0.17

setset 0.6gamegame 0.4

setset 0.3playplay 0.24readread 0.2gamegame 0.16reproducreproduc0.1

Page 11: A Study on Query Expansion Methods for Patent Retrieval

SynSet QE Results8M parallel EN/FR sentences were extracted from EPO patent collection to generate SynSets

Two runs were adopted:Expanding query using SynSet without weights (Usynset)Utilizing SynSet probabilities as weights to terms in query

  MAP PRES  value %change value %change

Baseline 0.1399 NA 0.486 NA

Wsynset 0.1440 +2.9% 0.485 -0.7%Usynset 0.1402 +0.2% 0.480 -1.7%

Page 12: A Study on Query Expansion Methods for Patent Retrieval

SynSet ExpansionSignificantly better MAP, but significantly worse PRESi.e. better retrieval at very high ranks, but worse ranking of relevant results over all ranks and less recall

Some topics were improved (34% of topics), but some were degraded (39% of topics).

Significantly more efficient than PRF and WordNet (query size is only 60% larger)

Page 13: A Study on Query Expansion Methods for Patent Retrieval

Deeper Look on SynSetNo features with high correlation to SynSet QE success

Initial retrieval quality of BL does not relate to the performance of QE

Topic ID Baseline Wsynset %change   Topic ID Baseline Wsynset %change

PAC-1704 0.000 0.174 +∞   PAC-1510 0.030 0.012 -60%PAC-195 0.000 0.215 +∞   PAC-210 0.160 0.000 -100%PAC-1225 0.105 0.532 +408%   PAC-220 0.201 0.000 -100%PAC-1670 0.124 0.637 +415%   PAC-56 0.263 0.040 -85%PAC-954 0.514 0.763 +48%   PAC-784 0.323 0.027 -92%PAC-122 0.590 0.944 +60%   PAC-42 0.459 0.216 -53%PAC-579 0.630 0.902 +43%   PAC-906 0.571 0.214 -63%PAC-1113 0.669 0.880 +32%   PAC-1498 0.662 0.307 -54%

Page 14: A Study on Query Expansion Methods for Patent Retrieval

ConclusionsPRF is not effective with patent prior-art search

WordNet QE for patent search:Leads to overall significant degradation of retrievalHas some positive impact on the retrieval of some topicsHigh computational cost

SynSet QE for patent search:The most effective and efficient QE technique among those testedSignificant improvement for very high ranks, but significant degradation of overall ranking and recallNo indication of when it fails/succeedsSynSet can be used as a lexical resource for patent examiners

Page 15: A Study on Query Expansion Methods for Patent Retrieval

Future Work

More analysis to better understand when QE fails/succeeds

Applying SynSet on real patent examiners’ queries rather than automatically formulated queries

Combining different QE methods

Alternative methods for query modification, for example query reduction (QR)

Page 16: A Study on Query Expansion Methods for Patent Retrieval

Please Check in CIKM Poster SessionMagdy W. and G. J. F. Jones. An Efficient Method for Using Machine Translation Technologies in Cross-Language Patent Search. Ganguly D., J. Leveling, W. Magdy, and G. J. F. Jones. Query Reduction based on Pseudo-Relevant Documents.

Thank youThank you