untangling names lessons learned (so far) from the linking of ipni and tropicos julius welby rbg kew...

33
Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew [email protected]

Upload: mitchell-harrison

Post on 27-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

Untangling Names

Lessons learned (so far) from the linking ofIPNI and TROPICOS

Julius WelbyRBG Kew

[email protected]

Page 2: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

TROPICOS + IPNI

Page 3: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

Why match?

Page 4: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

Why is this difficult?

Page 5: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

Variation

Calophyllum kiong K.Schum. & Lauterb.

Fl. Deutsch. Sudsee, 450.

Calophyllum kiong Lauterb. & K.Schum.

Die Flora der Deutschen Schutzgebiete in der Sudsee 1900

Page 6: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

Duplication• Poa annua L. -- Sp. Pl. 68. 1753 (GCI)• Poa annua L. -- Species Plantarum 2 1753 (APNI)• Poa annua L. -- Sp. Pl. 68. (IK)

Page 7: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

Duplication• Calophyllum microphyllum Scheff

in Tijdschr. Nederl. Ind. xxxii. (1871) 406. (IK)• Calophyllum microphyllum Planch. & Triana

in Ann. Sc. Nat. Ser. IV. xv. (1861) 282. (IK)• Calophyllum microphyllum T.Anders.

Fl. Brit. Ind. (J. D. Hooker). i. 272. (IK)

Page 8: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

Matching

Page 9: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

Fields

1 Calophyllum Calophyllum

2 kiong kiong

3 K.Schum. & Lauterb. Lauterb. & K.Schum.

4 Fl. Deutsch. Sudsee Die Flora der Deutschen…

5 450. 1900

Page 10: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

Lesson 1

Speed matters

Page 11: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

Speed matters

2,500 by 2,000 by 4 fields

20,000,000 comparisons

~5.5 hours at 1ms per comparison

Page 12: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

Be lazy

• Do as little as possible• Do easy things if possible• Do hard things only if necessary• Only expend effort when it’s worth it

Page 13: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

Be lazy

• Do as little as possible– Specify fields as ‘must match’– If a ‘must match’ field fails

• Mark the match as failed• Stop comparing fields

Page 14: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

Parameterised matchingspecies

infragenusinfraspeciesauthorsrank …

Page 15: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

How lazy?

Page 16: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

Optimising

• The order of field matching is important– Choose suitable fields to match first– Aim to fail matches early

• Significant speed-up

Page 17: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

Also, for speed

• Do as little as possible– Do escaping or standardisation once

– Done on import for each dataset

– Keep field matching functions clean

Page 18: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

More speed optimisation• Do easy things if possible

– Define cascading tests– Do easy tests first, if practical

– Length comparisons– Composition comparisons

Page 19: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

Speed Lessons

• Speed matters

• Minimise comparisons made– ‘Must match’ parameters– Match fields in an efficient order

• Do data cleaning once, up front

• Look for ways to fail matches cheaply

Page 20: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

Accuracy

Page 21: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

Accuracy

False +

False -

OK

Page 22: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

Strict match F-

OK

Page 23: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

Fuzzy match

F+OK

Page 24: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

Doughnut of uncertainty

Page 25: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

Lesson 2:Look at near misses

Page 26: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

Near misses are checkable

Page 27: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

One approach• Currently, to get best results:

– Tend towards strictness– Handle false negatives

Page 28: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

One approach• Currently, best results from:

– Tend towards strictness– Handle false negatives

• Failures on ‘rightmost’ fields can be written to a report

• Checked and fed back in as escapes

• Rerun

Page 29: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

Lesson 3:Remove predictable variation

Page 30: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

Predictable variation• Gendered endings

• Common alternatives– Endings:

• ii,i• Iae,ae

• Dataset specific quirks:– &, &

Page 31: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

The framework• Python

• Psyco• Modular• Extensible • In progress• More details will be available on the TDWG website• Source code availability

Page 32: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

The framework• Some results (HTML)

Page 33: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

Thanks to• Bob Magill• Sally Hinchcliffe• The Moore Foundation

• Contact:• [email protected]• or after Jan 2007 :

[email protected]