how to kill inventors: testing the massacrator © 2.0 algorithm for inventor identification...

22
HOW TO KILL INVENTORS: TESTING THE MASSACRATOR © 2.0 ALGORITHM FOR INVENTOR IDENTIFICATION Francesco Lissoni ([email protected] ) Michele Pezzoni ([email protected] ) DIMI - Università di Brescia KITES –Università Bocconi, Milano

Upload: cameron-tucker

Post on 25-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HOW TO KILL INVENTORS: TESTING THE MASSACRATOR © 2.0 ALGORITHM FOR INVENTOR IDENTIFICATION Francesco Lissoni (Francesco.Lissoni@unibocconi.it)  Francesco.Lissoni@unibocconi.it

HOW TO KILL INVENTORS: TESTING THE MASSACRATOR© 2.0 ALGORITHM FOR

INVENTOR IDENTIFICATION

Francesco Lissoni ([email protected])

Michele Pezzoni ([email protected])

DIMI - Università di Brescia

KITES –Università Bocconi, Milano

Page 2: HOW TO KILL INVENTORS: TESTING THE MASSACRATOR © 2.0 ALGORITHM FOR INVENTOR IDENTIFICATION Francesco Lissoni (Francesco.Lissoni@unibocconi.it)  Francesco.Lissoni@unibocconi.it

2

Identifying inventors with an algorithm: what does it mean in practice?

Identifying inventors within a patent database consists in assigning unique codes to inventors listed on different patents who are believed to be same person, to the extent that they are homonyms or quasi-homonyms and possibly share similar characteristics.

In order to identify inventors we use an algorithm, we follow three steps (as described by Raffo and Luhillery, 2009):1. Cleaning & Parsing2. Matching3. Filtering

Page 3: HOW TO KILL INVENTORS: TESTING THE MASSACRATOR © 2.0 ALGORITHM FOR INVENTOR IDENTIFICATION Francesco Lissoni (Francesco.Lissoni@unibocconi.it)  Francesco.Lissoni@unibocconi.it

3

Databases & Benchmark databases produced by APE-INV

• PatStat-Kites database which contains patent applications filed at the EPO since 1978. PatStat-Kites results from cleaning and parsing the original PatStat data by means of the Massacrator©1.0 algorithm (by Gianluca Tarasconi)

• French Academic Benchmark• EPFL Benchmark [Federal Polytechnic of Lausanne, Switzerland]

French and EPFL benchmarks are used for testing & setting purposes of the Massacrator© algorithm

Page 4: HOW TO KILL INVENTORS: TESTING THE MASSACRATOR © 2.0 ALGORITHM FOR INVENTOR IDENTIFICATION Francesco Lissoni (Francesco.Lissoni@unibocconi.it)  Francesco.Lissoni@unibocconi.it

4

Massacrator© 2.0: Cleaning & Parsing

1. Punctuation characters are removed and text strings are converted to ASCII

2. Parsing: separate fields for inventor's name, address, city, region and state are created.

Inventor’s name includes surname, second, third or fourth names, and suffixes, such as “junior”, “senior”, “III”. Personal titles (“Prof.”, “Professor”) are discarded.

3. Further parsing: separate fields for each token (word) in inventor’s name, as resulting from 2.

Massacrator© will proceed to matching according to tokens inventor’s name (after parsing step 3)

Page 5: HOW TO KILL INVENTORS: TESTING THE MASSACRATOR © 2.0 ALGORITHM FOR INVENTOR IDENTIFICATION Francesco Lissoni (Francesco.Lissoni@unibocconi.it)  Francesco.Lissoni@unibocconi.it

5

Massacrator© 2.0: Matching

1. List and sort all the tokens from the inventors’ names in alphabetical order

2. Compute the 2-GRAM distance for subsequent tokens (that is, token in row n and token in row n+1)

3. Define groups of tokens as follows:– starting from the top of the sorted list, assign word in row 1 to group

1;– move to token in row 2: if 2-GRAM distance from token in row 1 is less

than or equal to a pre-determined value assign it to group 1; otherwise create a separate group, in this case group 2;

– … and so on for row n and row n+1].

Page 6: HOW TO KILL INVENTORS: TESTING THE MASSACRATOR © 2.0 ALGORITHM FOR INVENTOR IDENTIFICATION Francesco Lissoni (Francesco.Lissoni@unibocconi.it)  Francesco.Lissoni@unibocconi.it

6

Massacrator© 2.0: Matching• Example:

– the tokens PEZZONI and PEZZOPANE belong to different groups (2,3)– PEZZOTI, PEZZOTTA and PEZZOTTI belong to the same group (4)

STRING 2G Scores GR(thr. 0.15)

FREQ in Patastat

PEZZOLA 1 3

PEZZOLATO 0.10 1 1

PEZZOLI 0.14 1 20

PEZZOLO 0.12 1 1

PEZZONI 0.17 2 6

PEZZOPANE 0.17 3 1

PEZZOTI 0.17 4 1

PEZZOTTA 0.13 4 3

PEZZOTTI 0.10 4 5

PEZZULLI 0.20 5 1

Page 7: HOW TO KILL INVENTORS: TESTING THE MASSACRATOR © 2.0 ALGORITHM FOR INVENTOR IDENTIFICATION Francesco Lissoni (Francesco.Lissoni@unibocconi.it)  Francesco.Lissoni@unibocconi.it

7

Massacrator© 2.0: Matching• We compute the number of tokens for each inventor say n1,n2 and we find

the minimum min(n1,n2).Eg. KNIGHT DAVID JOHN (n1=3) and KNIGHT JOHN (n2 =2)

• We proceed to match all inventors who have min(n1,n2) number of tokens belonging to the same group, whatever the order of the tokens

• Example:

• This approach results in 10 millions of matched inventors. According to the number groups the number of matches grows very fast.

Page 8: HOW TO KILL INVENTORS: TESTING THE MASSACRATOR © 2.0 ALGORITHM FOR INVENTOR IDENTIFICATION Francesco Lissoni (Francesco.Lissoni@unibocconi.it)  Francesco.Lissoni@unibocconi.it

8

Massacrator© 2.0: Filtering

1. Network criteriaa. [Coinventor]b. [Three.Degrees] of separationc. ASE

2. Geographical criteriaa. [City]b. [Province]c. [Region]d. State [State]e. Street name and civic nr

[Street]

3. Applicant related criteriaa. [Applicant]b. [Small.Applicant] Applicant

with less then 50 inventors4. IPC class criteria

a. 4 digits in common [IPC.4]b. 6 digits in common [IPC.6]c. 12 digits in common [IPC.12]

5. Others criteriaa. Priority dates differ for less

then 5 years [Five.Years]b. Citations [Citation]c. Rare surname [Rare.Surname]

For any pair m of matched inventors i and j, we consider 15 criteria in order to compute the similarity scores, namely:

NB Each criterion is represented by a dummy variable!! Ex.: Common coinventor dummy = 1 if matched inventors share at least one coinventor

Page 9: HOW TO KILL INVENTORS: TESTING THE MASSACRATOR © 2.0 ALGORITHM FOR INVENTOR IDENTIFICATION Francesco Lissoni (Francesco.Lissoni@unibocconi.it)  Francesco.Lissoni@unibocconi.it

9

ASE (by Hsini Huang, Li Tang, John Walsh)

Page 10: HOW TO KILL INVENTORS: TESTING THE MASSACRATOR © 2.0 ALGORITHM FOR INVENTOR IDENTIFICATION Francesco Lissoni (Francesco.Lissoni@unibocconi.it)  Francesco.Lissoni@unibocconi.it

10

Testing methodology

What do we want to test?• What is the impact of criteria on Precision and Recall?

Setting the algorithm• Do we have to consider all the criteria, or is better to select?

• Which are the more appropriate criteria to maximize Precision? and Recall?

Page 11: HOW TO KILL INVENTORS: TESTING THE MASSACRATOR © 2.0 ALGORITHM FOR INVENTOR IDENTIFICATION Francesco Lissoni (Francesco.Lissoni@unibocconi.it)  Francesco.Lissoni@unibocconi.it

11

Testing methodology: Measures of PerformanceMassacrator output: [ inventor i, patent pi , inventor j, patent pj , D_αm ]where D_αm (refers to the pair m comparing inventors i and j) is a binary variable that takes value 1 if matched inventors i and j are believed to be the same person (positive match) and 0 otherwise (negative match).

Benchmark outputs: [ inventor i, patent pi , inventor j, patent pj , D_γm ]False/true positives/negatives are calculated by comparing Massacrator©'s results to information in the benchmark databases (D_γm)

Page 12: HOW TO KILL INVENTORS: TESTING THE MASSACRATOR © 2.0 ALGORITHM FOR INVENTOR IDENTIFICATION Francesco Lissoni (Francesco.Lissoni@unibocconi.it)  Francesco.Lissoni@unibocconi.it

12

Testing methodology:How do we get D_αm ?

5 steps:1. Matching: inventors from the PatStat-Keins

database are matched one to another. A set of dummy variables xk is associated to each pair (a,b,c) of inventors, where each variable (xcity,xIPC.4,..) corresponds to one of the filtering criteria

2. Simulation_1: we draw randomly W (=3) vectors of weights (ωw) from an uniform Bernoulli (success prob.=0.5) multivariate distribution (K dimensions). [ωcity

w =1 (0) means (not) that criterion is selected]

3. For each pair of matched (a,b,c) we can compute a similarity score (αm,w) that is the number of criteria in common:

city IPC.4a 1 1b 1 0c 0 1 X [3x2]

ω1 ω2 ω3

city 0 1 1

IPC.4 1 1 0 Ω [2x3]

ω1 ω2 ω3

a 1 2 1b 0 1 1c 1 1 0

A = X x Ω [3x3]

Page 13: HOW TO KILL INVENTORS: TESTING THE MASSACRATOR © 2.0 ALGORITHM FOR INVENTOR IDENTIFICATION Francesco Lissoni (Francesco.Lissoni@unibocconi.it)  Francesco.Lissoni@unibocconi.it

13

Testing methodology: Exercises

4. Simulation_2: In each simulation run we set a thresholdw value for the similarity score (αm,w), above which the two inventors in match m are considered the same person. The threshold is drawn randomly from a uniform distribution U(0,4):

5. For each run (ωw), we identify the positive and negative matches by comparing the information contained in benchmark databases -> true/false positive/negative -> precision/recall

Page 14: HOW TO KILL INVENTORS: TESTING THE MASSACRATOR © 2.0 ALGORITHM FOR INVENTOR IDENTIFICATION Francesco Lissoni (Francesco.Lissoni@unibocconi.it)  Francesco.Lissoni@unibocconi.it

14

Precision and Recall VS threshold each point corresponds to a value of Precision and Recall according to a

specific set of weights ωw

Page 15: HOW TO KILL INVENTORS: TESTING THE MASSACRATOR © 2.0 ALGORITHM FOR INVENTOR IDENTIFICATION Francesco Lissoni (Francesco.Lissoni@unibocconi.it)  Francesco.Lissoni@unibocconi.it

15

Precision and Recall VS organization each point corresponds to a value of Precision and Recall according to a

specific set of weights ωw

Page 16: HOW TO KILL INVENTORS: TESTING THE MASSACRATOR © 2.0 ALGORITHM FOR INVENTOR IDENTIFICATION Francesco Lissoni (Francesco.Lissoni@unibocconi.it)  Francesco.Lissoni@unibocconi.it

16

Regression (1/2)

Precision = β0 + β Ω + ε

Recall = β0 + β Ω + ε

•It is a regression of precision and recall on matrix of weights Ω

•Weights are independent and identically distributed by definition

Obs.ω1 ω2 ω3

Var.

city 0 1 1

IPC.4 1 1 0 Ω [2x3]

continues...

Page 17: HOW TO KILL INVENTORS: TESTING THE MASSACRATOR © 2.0 ALGORITHM FOR INVENTOR IDENTIFICATION Francesco Lissoni (Francesco.Lissoni@unibocconi.it)  Francesco.Lissoni@unibocconi.it

17

Regression (2/2)

•All criteria show up a trade-off between Precision and Recall except COINVENTOR which always increases both Precision and Recall

•The interaction with EPFL dummy measures the different impact of the variable in the two benchmarks (French academics or EPFL scientist)

Page 18: HOW TO KILL INVENTORS: TESTING THE MASSACRATOR © 2.0 ALGORITHM FOR INVENTOR IDENTIFICATION Francesco Lissoni (Francesco.Lissoni@unibocconi.it)  Francesco.Lissoni@unibocconi.it

18

Setting the algorithm:finding dominant solutions (the frontier)

Balanced9 obs.

High precision7 obs.

High recall5 obs.

Page 19: HOW TO KILL INVENTORS: TESTING THE MASSACRATOR © 2.0 ALGORITHM FOR INVENTOR IDENTIFICATION Francesco Lissoni (Francesco.Lissoni@unibocconi.it)  Francesco.Lissoni@unibocconi.it

19

We test for the over (under)-representation of each criterion among the dominant solutions

• Each weight (ωcity , ωIPC.4) is a random variable with a

distribution Bernoulli (p=0.5) we expect an average value of 0.5 (avg[ωcity

]=0.5=p)

• We test for the over/under representation of each criterion among the subsets of dominant solutions (i.e. balanced, high precision, high recall and dominant solutions). It means to test if one criterion is selected (ωcity

w=1) more (less) frequently in a subset of solutions.

Page 20: HOW TO KILL INVENTORS: TESTING THE MASSACRATOR © 2.0 ALGORITHM FOR INVENTOR IDENTIFICATION Francesco Lissoni (Francesco.Lissoni@unibocconi.it)  Francesco.Lissoni@unibocconi.it

20Average[p-value H0 =0.5 Ha <0.5, p-value H0 =0.5 Ha >0.5]

Page 21: HOW TO KILL INVENTORS: TESTING THE MASSACRATOR © 2.0 ALGORITHM FOR INVENTOR IDENTIFICATION Francesco Lissoni (Francesco.Lissoni@unibocconi.it)  Francesco.Lissoni@unibocconi.it

21

Conclusion

• To get the balanced result, according to benchmarks, we include in filtering the following characteristics:IPC.4, Citation, City, Street, IPC.12, Applicant, Small Applicant, Coinventor, Three Degrees, ASE

•...and set a minimum threshold of 2.54 (i.e. at least 3 characteristic in common)

•We computed other two alternative version of the cleaned data, one that maximises Precision [High Precision] and another that maximises Recall [High Recall]

Page 22: HOW TO KILL INVENTORS: TESTING THE MASSACRATOR © 2.0 ALGORITHM FOR INVENTOR IDENTIFICATION Francesco Lissoni (Francesco.Lissoni@unibocconi.it)  Francesco.Lissoni@unibocconi.it

22

Results

• Balanced: 2806516 inv. -> 2197767 inv. (-22%)

• High Precision: 2806516 inv. -> 2481582 inv. (-12%)

• High Recall: 2806516 inv. -> 2032701 inv. (-28%)