1 interpretation and fault-tolerant identification of relationship data holger wandt colloquium taal...
TRANSCRIPT
1
Interpretation and fault-tolerant identification of relationship
dataHolger Wandt
Colloquium Taal en Spraak KU Nijmegen
Wednesday 3 March 2004
2
Overview• The use of knowledge tables
Relationship data: segmentation, storageAttributesStatisticsRulesA closer look
• How do we use the knowledge and the rules in interpretation?
• The Rolodex-demo
16
Let’s summarize….• Surnames• Given names• Forms of address• Titles• Prefixes/infixes and prepositions/articles• Additions• Professions• Geographical items• Legal forms• Company words• Divisions• Company names• Ordinals
17
Relationship data
• LCR manages and maintains 3 knowledge databases for each country:• 1stbase
• Fambase
• DicMan
• LCR manages and maintains country specific synonym tables
18
Storage of relationship data
• Segmentation (define groups of data)
• Attributes of groups
• Attributes of particular items
• Link between items (abbreviation, plural, etc.)
19
STATISTICS BE DE NLSurnames 337410 1006097 277312
Given names 20618 22425 25569
FoA 269 131 136
Titles 284 1739 279
Prefix/Infix & articles/prepositions
654 664 498
Additions 324 192 143
Professions 968 2792 355
Geogr. items 12416 32248 18611
Legal forms 236 1835 138
Company words 20467 8121 5920
Divisions 172 160 90
Company names 1967 1504 684
Ordinals 421 293 71
21
Capitalization
• Belgium:• Flemish: Karin Van der Ploeg
• Walloon: Henri de La Censerie
• Germany:• E.v. Buskirk KG
• Verband der Chemischen Industrie e.V.
• Netherlands:• Puffelen r.a., Victor van
• Puffelen RA, de heer Van
22
Punctuation
• Mr Theodor St.John• mr. Olaf Oudendijk• Martin Klaus Lehmann• Martin, Klaus & Lehmann• HA.DI.WE. Inh: Hans-Dieter Weber• Don Quichotte N.V./S.A.• Don Quichotte NV/SA
23
Epitaph
Here lies
my beloved wife Christine
In heaven she is not
in hell I know
It’s written for
everyone to be seen
24
Word break
J.P.L. den He-
yer Groepsex-
cursies
General and country specific rules:- In NL: ma-chi-nes- In GB: ma-chines
NEVER: mac-hines
25
Abbreviation
General rule for BE, DE and NL: Every word must not be abbreviated further than its first Vowel-Consonant (VC) group or its first Consonant-Vowel-Consonant (CVC) group.
Abbreviation – abbrev. – abbr.
Consonant – conson. – cons.
There are country specific abbreviations: Ges.m. beschränkt. Haft. / Handelsmij./ Stnrs. / R.P. and RR.PP.
But beware of the
Hotel Association Française
26
A closer look: Family names
• Prefixes• Names consisting of several parts • Names with a foreign language attribute• Diacritic symbols
27
Prefixes
• In NL separation of prefix and family name is necessary for sorting purposes
• In the Human Inference databases: 22.000 family names with prefix in BE 15.000 family names with prefix in DE30.000 family names with prefix in NL
• Validation of names: Le Galloudec, but not Galloudec
28
Names consisting of several parts
• Double-barrelled names with and without hyphen:
Adelheid de Boer-van BuitenDirk Segaert vanden Bussche• Double-barrelled name with infix:Arie Gansneb genaamd Tengnagel tot den
Bonckenhave• Double-barrelled name without infix:Martina Galloux Wittevrouw
29
Names with a foreign language attribute
• Three categories:
Arabic: el Bahlaoui Husseini al Fharid
Chinese/Vietnamese: Cuong Buo Chan
Spanish/Portuguese: Fonseca Aranda de Pereira Rodriguez
30
Diacritic symbols
• All diacritics have to be recorded in the database.
Preferences in Capital ConversionValidation of names
• Examples:• Büch
• Hällström
• Özgüleç
• Güçlütürk
32
Interpretation of relationship data
• Different kinds of relationship data• Different attributes• General and country specific rules
(capitalization, abbreviation, etc.)• Signification differs due to context• Due to the ambiguity of relationship data,
correct interpretation is no picnic
33
Different kinds of relationship data with different attributes
• Betonmortelfabriek BEMOTI Tilburg bv• Tilburgse Betonmortelfabriek BEMOTI bv
• RegTP, Regulierungsbehörde für Telekommunikation und Post
• CQCS International Consulting
• Servicebureau Jansen/ Jansen Elektroservice• De Boer Landbouwmachines/ De Boer Machinebouw
34
Signification can differ as consequence of context, rules for abbreviation, capitalization and punctuation
• Art Gallery Wandt & Wandt• Wandt Fachhandel für Kunstart.• Art. Wandt Kunsthandel
• van Walbeek, M.B.A.• Van Walbeek, MBA
35
Significations: How can they be determined?
• Does the item exist in the particular knowledge universe?
• Can the significations be resolved or deducted (acronyms and compounds)?
• If the item does not exist in the knowledge universe, what is the most probable signification, considering the context?
36
Can the item be deducted or resolved?
• NeVoBo Nederlandse Volleybalbond• KLM Koninklijke Nederlandse
Luchtvaartmaatschappij• AAAA
• Maschinenfabrik Mertens• Carburateurbinnenverlichtingsfabriek Mertens
37
The item is not found in the knowledge universe
• Harry Edward Johnson • Harry Edward Ireallygotaweirdsurname
• IBM Computing• HAL Computing
• Hermans Groente & Fruit, A’dam• Johnson Sarvice & Cnosult, Chelsee
38
ContextMetzgerei Theo Frankfurtgiven name/surname?Metzgerei Theo Frankfurtgiven name/ geographical item?
Karin Jansen – Bloemengiven name/surname/company word?Karin Jansen – Bloemengiven name/surname – surname (maiden name)?
39
Patterns
Restaurant Die Vier Jahreszeiten
Café Het Nerveuze Schaap
Jasmijn Bloemen en Planten
Helena Catering & Imbiß
Consultingservice QCS Amsterdam
Aardappelhandel ABC Paterswolde
40
Patterns?
chr. bond v. ambtenaren
chr. bond van zomers
KARL OTTO GRAF LAMBSDORFF
EVA MARIA BARON POTOCKI
Hi-Fi Johanson & Gruber GmbH
Em-Lo Emmerich und Lohmeier GmbH
41
Multiple occurrences
An item must be stored in all its significations• Beh. Behandlung, Behälter, Behörde,
Behinderte
• Ond. Onderzoek, Onderhoud, Onderneming, Onderwijs, Onderling
42
Interpretation step by step
• Read appellation• Divide appellation in relevant sections and
ascribe all possible significations to the sections• Apply context and grouping rules and chose the
most probable combination of significations• Score the found items, the small context, the
large context and the corrections for special cases.