1 interpretation and fault-tolerant identification of relationship data holger wandt colloquium taal...

46
1 Interpretation and fault-tolerant identification of relationship data Holger Wandt Colloquium Taal en Spraak KU Nijmegen Wednesday 3 March 2004

Upload: brooke-brumwell

Post on 14-Dec-2015

232 views

Category:

Documents


2 download

TRANSCRIPT

1

Interpretation and fault-tolerant identification of relationship

dataHolger Wandt

Colloquium Taal en Spraak KU Nijmegen

Wednesday 3 March 2004

2

Overview• The use of knowledge tables

Relationship data: segmentation, storageAttributesStatisticsRulesA closer look

• How do we use the knowledge and the rules in interpretation?

• The Rolodex-demo

3

ANK Engineering Ltd. Appleford

4

Monsieur e/o Madame Durand

5

Int. Transp. Ond. Joh. Tilburg Hardinxv./Giessend. e/o

6

Fysiotherapeutisch CentrumArie en Jolanda KruizengaIntake Unit 1

7

Dr. John Park jr. BA, MR EconS, MKM

8

Siemens ElectroCom GmbH & Co.Postdienstautomatisierung und Technologieentwicklung

9

DE POSTc/o mevrouw A. Vanderwalle-Van DammeIndustrieel Ingenieur Logistiek

10

RegTP, Regulierungsbehörde für Telekommunikation und Post

11

CQCS International Consulting

12

Chowhounds DelightRestaurant & BarAttn: John Peter Arnold

13

Eerste Roelofarendveense Papierfabriek Anno 1931 NVh.o.d.n. “Papier Hier”

14

NATIONALE SOCIALE VERZEKINGSKAS VOOR MIDDENSTAND EN BEROEPEN SUKKURSALE BRUGGE V.Z.W. / A.S.B.L.

15

Suomen Posti OYTuotteet/ Mediapalvelut/ Osoitepalvelut

16

Let’s summarize….• Surnames• Given names• Forms of address• Titles• Prefixes/infixes and prepositions/articles• Additions• Professions• Geographical items• Legal forms• Company words• Divisions• Company names• Ordinals

17

Relationship data

• LCR manages and maintains 3 knowledge databases for each country:• 1stbase

• Fambase

• DicMan

• LCR manages and maintains country specific synonym tables

18

Storage of relationship data

• Segmentation (define groups of data)

• Attributes of groups

• Attributes of particular items

• Link between items (abbreviation, plural, etc.)

19

 STATISTICS BE DE NLSurnames 337410 1006097 277312

Given names 20618 22425 25569

FoA 269 131 136

Titles 284 1739 279

Prefix/Infix & articles/prepositions

654 664 498

Additions 324 192 143

Professions 968 2792 355

Geogr. items 12416 32248 18611

Legal forms 236 1835 138

Company words 20467 8121 5920

Divisions 172 160 90

Company names 1967 1504 684

Ordinals 421 293 71

20

General and country specific rules

• Capitalization

• Punctuation

• Word break

• Abbreviation

21

Capitalization

• Belgium:• Flemish: Karin Van der Ploeg

• Walloon: Henri de La Censerie

• Germany:• E.v. Buskirk KG

• Verband der Chemischen Industrie e.V.

• Netherlands:• Puffelen r.a., Victor van

• Puffelen RA, de heer Van

22

Punctuation

• Mr Theodor St.John• mr. Olaf Oudendijk• Martin Klaus Lehmann• Martin, Klaus & Lehmann• HA.DI.WE. Inh: Hans-Dieter Weber• Don Quichotte N.V./S.A.• Don Quichotte NV/SA

23

Epitaph

Here lies

my beloved wife Christine

In heaven she is not

in hell I know

It’s written for

everyone to be seen

24

Word break

J.P.L. den He-

yer Groepsex-

cursies

General and country specific rules:- In NL: ma-chi-nes- In GB: ma-chines

NEVER: mac-hines

25

Abbreviation

General rule for BE, DE and NL: Every word must not be abbreviated further than its first Vowel-Consonant (VC) group or its first Consonant-Vowel-Consonant (CVC) group.

Abbreviation – abbrev. – abbr.

Consonant – conson. – cons.

There are country specific abbreviations: Ges.m. beschränkt. Haft. / Handelsmij./ Stnrs. / R.P. and RR.PP.

But beware of the

Hotel Association Française

26

A closer look: Family names

• Prefixes• Names consisting of several parts • Names with a foreign language attribute• Diacritic symbols

27

Prefixes

• In NL separation of prefix and family name is necessary for sorting purposes

• In the Human Inference databases: 22.000 family names with prefix in BE 15.000 family names with prefix in DE30.000 family names with prefix in NL

• Validation of names: Le Galloudec, but not Galloudec

28

Names consisting of several parts

• Double-barrelled names with and without hyphen:

Adelheid de Boer-van BuitenDirk Segaert vanden Bussche• Double-barrelled name with infix:Arie Gansneb genaamd Tengnagel tot den

Bonckenhave• Double-barrelled name without infix:Martina Galloux Wittevrouw

29

Names with a foreign language attribute

• Three categories:

Arabic: el Bahlaoui Husseini al Fharid

Chinese/Vietnamese: Cuong Buo Chan

Spanish/Portuguese: Fonseca Aranda de Pereira Rodriguez

30

Diacritic symbols

• All diacritics have to be recorded in the database.

Preferences in Capital ConversionValidation of names

• Examples:• Büch

• Hällström

• Özgüleç

• Güçlütürk

31

32

Interpretation of relationship data

• Different kinds of relationship data• Different attributes• General and country specific rules

(capitalization, abbreviation, etc.)• Signification differs due to context• Due to the ambiguity of relationship data,

correct interpretation is no picnic

33

Different kinds of relationship data with different attributes

• Betonmortelfabriek BEMOTI Tilburg bv• Tilburgse Betonmortelfabriek BEMOTI bv

• RegTP, Regulierungsbehörde für Telekommunikation und Post

• CQCS International Consulting

• Servicebureau Jansen/ Jansen Elektroservice• De Boer Landbouwmachines/ De Boer Machinebouw

34

Signification can differ as consequence of context, rules for abbreviation, capitalization and punctuation

• Art Gallery Wandt & Wandt• Wandt Fachhandel für Kunstart.• Art. Wandt Kunsthandel

• van Walbeek, M.B.A.• Van Walbeek, MBA

35

Significations: How can they be determined?

• Does the item exist in the particular knowledge universe?

• Can the significations be resolved or deducted (acronyms and compounds)?

• If the item does not exist in the knowledge universe, what is the most probable signification, considering the context?

36

Can the item be deducted or resolved?

• NeVoBo Nederlandse Volleybalbond• KLM Koninklijke Nederlandse

Luchtvaartmaatschappij• AAAA

• Maschinenfabrik Mertens• Carburateurbinnenverlichtingsfabriek Mertens

37

The item is not found in the knowledge universe

• Harry Edward Johnson • Harry Edward Ireallygotaweirdsurname

• IBM Computing• HAL Computing

• Hermans Groente & Fruit, A’dam• Johnson Sarvice & Cnosult, Chelsee

38

ContextMetzgerei Theo Frankfurtgiven name/surname?Metzgerei Theo Frankfurtgiven name/ geographical item?

Karin Jansen – Bloemengiven name/surname/company word?Karin Jansen – Bloemengiven name/surname – surname (maiden name)?

39

Patterns

Restaurant Die Vier Jahreszeiten

Café Het Nerveuze Schaap

Jasmijn Bloemen en Planten

Helena Catering & Imbiß

Consultingservice QCS Amsterdam

Aardappelhandel ABC Paterswolde

40

Patterns?

chr. bond v. ambtenaren

chr. bond van zomers

KARL OTTO GRAF LAMBSDORFF

EVA MARIA BARON POTOCKI

Hi-Fi Johanson & Gruber GmbH

Em-Lo Emmerich und Lohmeier GmbH

41

Multiple occurrences

An item must be stored in all its significations• Beh. Behandlung, Behälter, Behörde,

Behinderte

• Ond. Onderzoek, Onderhoud, Onderneming, Onderwijs, Onderling

42

Interpretation step by step

• Read appellation• Divide appellation in relevant sections and

ascribe all possible significations to the sections• Apply context and grouping rules and chose the

most probable combination of significations• Score the found items, the small context, the

large context and the corrections for special cases.

43

Interpretation Signification<WORD>

Knowledge

Universe

Appearance Context

44

The rolodex demo

45

46

For more information:

[email protected]