learning semantic relations using very large corpora uwe quasthoff institut für informatik...
TRANSCRIPT
Learning semantic relations using very large corpora
Uwe QuasthoffInstitut für Informatik
Universität Leipzig
www.wortschatz.uni-leipzig.de
U. Quasthoff 2
Contents
Part 1: Introduction to the Wortschatz lexicon
Part 2: Collocations and the collocation measure
Part 3: Applications of collocations
Part 4: Clustering Collocations
Part 5: The Vector Space Model
Part 6: Combining Simple Methods
Part 7: Temporal Analysis: Words of the Day
Part 8: Document Similarity
U. Quasthoff 3
Language Data German English Type of data # of entries # of entries Number of word forms 6.539.214 1.240.002 Sentences 24.788.212 13.037.997 Grammar 2.772.369 - Pragmatics 33.948 - Description 135.914 - Morphology 3.189.365 - Subject areas 1.415.752 - Relations 449.619 - Collocations (sentence-based) 31.776.212 10.106.066 Collocations (NN) 4.701.917 1.382.286 Index 3.6743.996 17.994.102
U. Quasthoff 4
Große Abfrage für LeipzigAnzahl: 9967Beschreibung: Stadt in Deutschland (über 250000 E) Stadt in SachsenGrammatikangaben: Wortart: Eigenname
Form(en): Leipzig [9967], Leipzigs [276]-er-Adj. / Einwohner zu Stadt: Leipziger [3553]
Teilwort von: VfB Leipzig [403], SC Leipzig [183], ..., Erste Baugesellschaft Leipzig AG [8], ...
Beispiel: Auch Debütanten aus angrenzenden Sprachräumen, die sich über Leipzig den deutschen Buchmarkt erschließen möchten, bietet die Buchmesse ein geeignetes Forum. (Quelle: OTS-Newsticker)
Kollokationen im Satz: Dresden (1488), Berlin (694), Halle (470), Universität (266), Sachsen (265), ..., DDR-Bürger (5), DDR-Innenminister (5), DTSB (5), ...
Linke Nachbarn: Universität (392), Stadt (201), Reclam (102), Handelshochschule (51), Oper (50), Karl-Marx-Universität (48), Raum (36), ...
U. Quasthoff 5
Empirical Analysis of
Associations
Stimulus Human response # of persons collocations signifikance Butter Brot 60 Brot 51 weich 40 Käse 49 Milch 32 Zucker 29 Margarine 27 Milch 23 Käse 20 Margarine 22 Fett(e) 16 Mehl 18 gelb 14 Eier 16 Butterbrot 8 Pfund 14 Dose 6 zerlassener 13 essen 6 Fleisch 13
U. Quasthoff 6
Collocations for Schweine
On the right side we find a collection of similar animals (all in plural) Rinder, Hühner, Kühe, Schafe.
On the left side we find words describing the aspect of slaughtering.
U. Quasthoff 7
Collocations for Stich
Two groups for different meanings:• Tennis (Michael Stich, Boris Becker etc.).• The cards game Skatwith the tree players Vorhand (lead), Mittelhand, and Hinterhand.The thin connection between Becker and Vorhand is representing Beckers strong forehand.
U. Quasthoff 8
Funny collocation sets
Identifying English words in German text: Collocations for the:
of, and, to, The, on, for, is, from, you, with, that, it, world, are, be, not, We, at, World, we, have, this, by, they, when, You, can, When, into, what, your, or, But, time, And, like, over, Breaking, only, one, but, shall, which, has, What, road, as, On, same, people, out, our, This, It, way, best, who, no, my, more, his, up, their, ...
The same way we find dialect words. Berlin dialect is identified using the collocations for ick:
det, nich, Ick, Det, hab, is, ne, ooch, keene, wat, weeß, uff, de, ma, nu, keen, dat, aba, och, jing, jut, Nee, meen, Jöre, een, mach, inne, watt, wa, jenuch, kieke, janze, kumm, janz, tau, Mutta, janzen, hätt, sag, wieda, kleene, ha, hör, imma, un, habense, kriejen, ejal, zwee, nischt, nee, Wetta, jedacht, hebb, heff, ...
U. Quasthoff 9
Analysis of other languages
The above procedure was (without changes) applied to
• English
• French
• Dutch
• Sorbian
• Italian
U. Quasthoff 10
Dutch and French
U. Quasthoff 11
Sorbian and Italian
U. Quasthoff 12
U. Quasthoff 13
U. Quasthoff 14
Part 2
Part 1: Introduction to the Wortschatz lexicon
Part 2: Collocations and the collocation measure
Part 3: Applications of collocations
Part 4: Clustering Collocations
Part 5: The Vector Space Model
Part 6: Combining Simple Methods
Part 7: Temporal Analysis: Words of the Day
Part 8: Document Similarity
U. Quasthoff 15
The Common Birthday Problem
The common birthday problem: What is the probability p of the existence of (at least) two people with the same birthday given n people randomly chosen?
Modification: What is the probability p of the existence of k couples with the same birthday (different birthdays are allowed for different couples) a boys and b girls randomly chosen?
Reformulation as collocation problem:
Common Birthday Problem Collocation Problem
Number a of boys Number a of sentences containing A
Number b of girls Number b of sentences containing B
Number of days of a year (i.e. n=365) Total number n of sentences
Number k of couples with same birthday
Number k of sentences containing both A and B
U. Quasthoff 16
Introduction to Poisson Distribution
We can calculate the probability of multiple joint occurrences of independent events as follows:
Given two independent events to be observed in an experiment with probabilities pa and pb, resp. The probability of their joint occurrence is pa pb .
Next we repeat the experiment n times, and we are interested in k joint occurrences. Using λ= n pa pb we get the probability
.!
1 ek
p kk
For at least k joint occurrences we get the probability
To measure the surprise for the joint occurrence for non-independent events we just calculate the probability as if they were independent. Next we are surprised to see such a rare event.
.!
1
kl
l el
U. Quasthoff 17
The Collocation Measure
The collocation measure of the two words A and B is defined as the negative logarithm of the above probability divided by log n. For λ=ab/n we get
.log
!
11log
B)sig(A,
1
0
n
ie
k
i
i
Approximations: If (k+1)/λ>10 (this is typically true) and, moreover k>10 we get:
and n
kk
log
!loglog B)sig(A,
.log
)1log(log B)sig(A,
n
kk
U. Quasthoff 18
Comparison to log-Likelihood
Comparison of the formulae:
Translating the log-Likelihood formula into our notation and ignoring small terms we get
Compared to
.)log(log
B)Lgl(A,n
kk
.log
)1log(log B)sig(A,
n
kk
Note: This may only apply to the typical case (k+1) / λ > 10.
U. Quasthoff 19
Comparing Results: The sources
IDS Cosmas I (W-PUB)
Wortschatz(German)
Corpus Size 374 Mio 255 Mio
Sources Mainly Newspapers
Mainly Newspapers
Window size Fixed size (here: ±5 words
Sentence
Collocation Measure
Log Likelihood Poisson Distribution
U. Quasthoff 20
Comparing Results: Collocations for Bier
Rank IDS Cosmas I
Cosmas Rating
Wortschatz (German)
Sig-Rating
1 Wein 4351 trinken 12342 trinken 2745 Wein 6483 getrunken 1715 getrunken 478
4 kühles 1627 Liter 4605 Glas 1379 trinkt 4286 Liter 1318 Glas 3487 Faß 1236 Schnaps 3188 Fass 1139 Hektoliter 3009 Flasche 1071 Flaschen 27210 Hektoliter 899 gebraut 269
11 Trinkt 881 24212 Flaschen 873 Sekt 239
Kaffee
U. Quasthoff 21
Properties of sig(n,k,a,b) I
Simple co-occurrence: A and B occur only once, and they occur together:
sig(n,1,1,1) → 1.This should ensure that the minimum significance threshold is independent of the corpus size.
Independence: A and B occur statistically independent with probabilities p and q:
sig(n,npq,np,nq) → 0.
Enlarging the corpus by a factor m:sig(mn, mk, ma, mb) = m sig(n, k, a, b)
This is useful for comparing corpora of different size.
U. Quasthoff 22
Properties of sig(n,k,a,b) II
Additivity: The unification of the words B and B‘ just adds the corresponding significances. For k/b≈k‘/b‘ we have
sig(n,k,a,b) + sig(n,k‘,a,b‘) ≈ sig(n,k+k‘,a,b+b‘)
This has applications for grouping words due to various methods.
Maximum:
max sig(A, B) ≈ a B
It might be useful to know how strong a collocation is compared to the possible maximum.
U. Quasthoff 23
Part 3
Part 1: Introduction to the Wortschatz lexicon
Part 2: Collocations and the collocation measure
Part 3: Applications of collocations
Part 4: Clustering Collocations
Part 5: The Vector Space Model
Part 6: Combining Simple Methods
Part 7: Temporal Analysis: Words of the Day
Part 8: Document Similarity
U. Quasthoff 24
Applications
• Collocations of inflected forms or basic forms?• A numeric measure for polysemy• Identification of Proper Names and Phrases• Compound Analysis
U. Quasthoff 25
Collocations of inflected forms or basic forms?
• Collocations of basic forms will give more results because of higher frequency.
• But: Collocations of basic forms and inflected forms may differ strongly.
Example:
Collocations for As:
Karo (488), K (393), Pik (391), Treff (307), Coeur (296), D (258), Herz (190), Karte (189), Kreuz (178), As (166), As (166), Süd (145), Matchball (113), Hinterhand (110), West (110), Karo-Bube (101), Ärmel (95), Ost (94), Vorhand (86), …
Collocations for Asse:
Ivanisevic (72), Becker (63), schlug (62), Aufschlag (56), servierte (47), Sampras (40), Goran Ivanisevic (32), Spiel (25), Stich (24), gewann (24), Wolfenbüttel (23), Kroate (22), schlagen (22), Asse (21), Asse (21), Match (21), Satz (21), …
U. Quasthoff 26
Application of additivity
First calculate collocations for inflected forms, then use additivity to calculate the measure for basic forms, if you want.
Example:
Collocations for Bundeskanzler: ..., betonte (46), ..., betont (21), ..., betonten (7), ...
Additivity gives significance 74 for the pair Bundeskanzler, {betonen, betont, betonte, betonten}.
U. Quasthoff 27
A numeric measure for polysemy: Space
The collocations of space taken from our general language corpus of English fall mainly into three classes: The subject areas computer, real estate and outer space. The corresponding senses of space are denoted with space1, space2, and space3.
Assigning the top 30 collocations of space (disk, shuttle, square, station, NASA, feet, …) to these three senses we get an qualitative estimate of these senses:
space1 28.2%: disk (2629), memory (718), storage (479), program (308), RAM
(307), free (300), hard (336)
space2 53.2%: shuttle (2618), station (991), NASA (920), Space (602), launch
(505), astronauts (473), Challenger (420), manned (406), NASA's (297), flight (293), Atlantis (291) Mir (335), rocket (329), orbit (326), Discovery (341), mission (385)
space3: 18.6%: square (1163), feet (822), leased (567), office (382), lessor (390)
U. Quasthoff 28
Proper Names and Phrases
A large relative collocation measure sigC(A) indicates that a reasonable
part of all occurrences of the word C is together with A. Hence, C might be the head with respect to A.
Left Word Right Word “head”
Alzheimersche Krankheit left
AQA total left
Anorexia nervosa left and right
Algighiero Boetti left and right
30jährige US-Bond right
André Lussi right
U. Quasthoff 29
Compound analysis using multi-word collocations
Assume we know that Geschwindigkeitsüberschreitung has the parts Geschwindigkeit and Überschreitung. If a multi-word collocation (here: Überschreitung der Geschwindigkeit) is of some predefined form we accept this collocation as a semantic description.
Pattern Word A Word B CompoundA aus B Orgie Farben FarbenorgieA der B Bebauung Insel Inselbebauung
A mit B Feld Getreide Getreidefeld
A in der B Feldbau Regenzeit Regenzeitfeldbau
A für B Übung Anfänger Anfängerübung
A für die B Gebäude Flugsicherung Flugsicherungsgebäude
A von B Anbau Kaffee KaffeeanbauA zur B Andrang Eröffnung Eröffnungsandrang
U. Quasthoff 30
Part 4
Part 1: Introduction to the Wortschatz lexicon
Part 2: Collocations and the collocation measure
Part 3: Applications of collocations
Part 4: Clustering Collocations
Part 5: The Vector Space Model
Part 6: Combining Simple Methods
Part 7: Temporal Analysis: Words of the Day
Part 8: Document Similarity
U. Quasthoff 31
Clustering
So far, we are able to find relations between words. They are still of unknown type. Moreover, different types are mixed.
Problem 1: How to construct sets of relations of a fixed type?
Problem 2: How to identify the type of a relation using background information?
U. Quasthoff 32
Collocations of collocation sets
The production of collocations is now applied on sets of (next neighbour or sentence) collocations instead of sentences.
The collection of 500.000 sentence collocations has the following ‚sentence‘ for Hemd:
Hemd Krawatte Hose weißes Anzug weißem Jeans trägt trug bekleidet weißen Jacke schwarze Jackett schwarzen Weste kariertes Schlips Mann
The collection of 250.000 next neighbor collocations has the following two ‚sentences‘ for Hemd:
weißes weißem weißen blaues kariertes kariertem offenem aufs karierten gestreiftes letztes ...
näher bekleidet ausgezogen spannt trägt aufknöpft ausgeplündert auszieht wechseln aufgeknöpft ausziehen ...
U. Quasthoff 33
Erklärte (declaired) using sentences
Sprecher (2581), werde (2302), gestern (1696), seien (1440), Wir (1187), bereit (929), wolle (839), Vorsitzende (807), Anfrage (775), Präsident (721)
U. Quasthoff 34
Erklärte (declared):Using NN-collocations
sagte (137), betonte (59), sprach (55), kündigte (44), wies (37), nannte (36), warnte (27), bekräftigte (24), meinte (24),kritisierte (23)
U. Quasthoff 35
Collocation set for Leipzig (other cities in black)
in, Dresden, Berlin, Halle, Leipzig, Leipzig, und, Universität, Sachsen, Erfurt, Chemnitz, UM, Frankfurt, Hamburg, Rostock, Magdeburg, München, Leipziger, Hannover, Messe, Zwickau, studierte, nach, aus, Stadt, Stuttgart, Jena, Düsseldorf, Nürnberg, Reclam, Messestadt, sächsischen, DDR, Kischko, am, Köln, Däbritz, Karl-Marx-Universität, In, Rische, ostdeutschen, geboren, sächsische, bewölkt, Völkerschlacht, Bredow, Taucha, VEB, Edmond, Verlag, Buchmesse, Gewandhausorchester, Städten, Strombörse, Deutschen, Institut, GmbH, Lindner, Wurzen, GV, Verbundnetz, Ampler, Frankfurt am Main, Potsdam, Reclam Verlag, Städte, Cottbus, Versandzentrum, Handelshochschule, Hinrich Lehmann-Grube, Gera, Kirchentag, Völkerschlachtdenkmal, Buchstadt, Thomanerchor, Unterhaching, Lübeck, Oper, Dessau, Meppen, Studium, MDR, Philosophie, eröffnet, wurde, Anke Huber, Jens Lehmann, Turowski, Uwe Ampler, Weimar, ostdeutsche, Hecking, IAT, Boomtown, Buchkunst, Engelmann, Freistaat, Liebers, Dortmund, Mai, Mannheim, Schwerin, neuen Bundesländern, Grischin, VNG, Wende, bei, AG, Auto Mobil International, Cindy Klemrath, Gewandhaus, Messegelände, Parteitag, Bremen, Montagsdemonstrationen, Neubrandenburg, Gustav Kiepenheuer Verlag, Karl-Marx-Stadt, Journalistik, Ostdeutschland, Thomas Liese, Essen, Heidenreich, Udo Zimmermann, Umweltforschungszentrum, DHFK, Hochschule, Mainz, Oktober, Wolfgang Engel, Deutschen Hochschule für Körperkultur, Frankfurt/Main, Heldenstadt, Trommer, Wolfsburg, EBL, Universitäten, Wien, Bautzen, ...
U. Quasthoff 36
First Iteration for Leipzig (other cities in black)
Frankfurt, Berlin, München, Stuttgart, Köln, Dresden, Hamburg, Hannover, Düsseldorf, Bremen, Karlsruhe, Potsdam, Wien, Paris, Magdeburg, Halle, Tübingen, Bonn, Freiburg, New York, Chemnitz, Darmstadt, Augsburg, Erfurt, Mannheim, Schweiz, Ulm, Bochum, Wiesbaden, Hanau, Braunschweig, Schwerin, Münster, Frankfurt am Main, London, USA, Regensburg, Cottbus, Göttingen, Kassel, Moskau, Passau, Rostock, Straßburg, Deutschland, Konstanz, Ausland, Dortmund, Heidelberg, Mainz, Würzburg, Zürich, Aachen, Offenbach, Weimar, Gießen, Koblenz, Italien, Chicago, Mailand, Osnabrück, Prag, Rom, Saarbrücken, Wuppertal, Niederlanden, Gera, Basel, Lyon, Nürnberg, Holland, Marburg, St. Petersburg, Amerika, Genf, Kaiserslautern, Tel Aviv, Woche, September, Tiergarten, dort, eröffnet, Budapest, Essen, Jena, Jerusalem, Neubrandenburg, Athen, Frankreich, Vereinigten Staaten, Amsterdam, Baden-Württemberg, Februar, Tempelhof, Trier, Venedig, Bayreuth, England, Erlangen, Indien, Belgrad, Duisburg, Heilbronn, Kairo, Ludwigsburg, Oldenburg, Oxford, Stockholm, Washington, Großbritannien, Görlitz, Kreuzberg, Lausanne, Lübeck, Mitte, Wochenende, April, Australien, Griechenland, Singapur, Florenz, Kanada, Kiel, Kopenhagen, Madrid, Mai, Südafrika, Tegel, Türkei, soeben, Bad Homburg, Bundesrepublik, Göppingen, Heute, Hongkong, Ingolstadt, Japan, Lande, Miami, Mittwoch, Oder, Sarajewo, Afghanistan, Argentinien, Baden-Baden, Bayern, Deutschlands, Europa, Haus, Iran, Istanbul, Peking, Rußland, neu, ...
U. Quasthoff 37
Second Iteration for Leipzig (other cities in black)
Stuttgart, München, Frankfurt, Hamburg, Hannover, Köln, Berlin, Dresden, Bremen, Darmstadt, Karlsruhe, Freiburg, Potsdam, Mannheim, Wiesbaden, Düsseldorf, Tübingen, Magdeburg, Gießen, Augsburg, Rostock, Kassel, Halle, Ulm, Hanau, Heidelberg, Ludwigsburg, Konstanz, Nürnberg, Bonn, Schwerin, Münster, Wien, Dortmund, Würzburg, Chemnitz, Passau, Göttingen, Erfurt, Mitte, Aachen, Mainz, Friedberg, Nord, Regensburg, Braunschweig, Cottbus, New York, Kreuzberg, Frankfurt am Main, Göppingen, Tiergarten, Esslingen, Ravensburg, II, Hessen, Ost, Lübeck, Charlottenburg, Böblingen, Offenbach, Oldenburg, Osnabrück, Traunstein, Paris, Bad Homburg, London, Prenzlauer Berg, Neukölln, Tempelhof, Hellersdorf, Koblenz, Essen, Fulda, Trier, Lüneburg, Prag, Chicago, Landshut, Reinickendorf, USA, Wilmersdorf, Kiel, Bochum, Deutschland, Mittelfranken, Schöneberg, Marzahn, Oberbayern, Eimsbüttel, Niederrhein, Unterfranken, Wuppertal, Friedrichshain, Spandau, Oberfranken, Lichtenberg, Moskau, Oberpfalz, Bielefeld, Schweiz, Kaiserslautern, Kempten, Bayreuth, Schwaben, Zürich, Bamberg, Ingolstadt, Mailand, Oder, Heilbronn, Altona, Sarajewo, Marburg, Ansbach, Harburg, Berlin-Mitte, Jena, Steglitz, Suhl, Görlitz, Baden-Württemberg, Hessen-Süd, dort, Italien, Weimar, West, Saarbrücken, Ausland, Bayern, Ostwestfalen-Lippe, Moabit, Offenburg, Main, Polen, Amsterdam, Westliches, Mittlerer, eröffnet, ...
U. Quasthoff 38
Part 5
Part 1: Introduction to the Wortschatz lexicon
Part 2: Collocations and the collocation measure
Part 3: Applications of collocations
Part 4: Clustering Collocations
Part 5: The Vector Space Model
Part 6: Combining Simple Methods
Part 7: Temporal Analysis: Words of the Day
Part 8: Document Similarity
U. Quasthoff 39
Feature Vectors Given by Collocations
• If two words A and B have similar contexts, that is, they are alike in their use, this indicates that there is a semantic relation between A and B of some kind.
• A kind of average context for every word A is formed by all collocations for A above a certain significance threshold.
• This average context of A is transferred into a feature vector of A of dimension n (the total number of words) as usual. The feature vector of word A is a description of the meaning of A, because the most important words of the contexts of A are included.
• Clustering of feature vectors can be used to investigate the relations between a group of similar words and to figure out whether or not all the relations are of the same kind.
U. Quasthoff 40
Clustering Months and Days
Jahres _____________________ Uhr, Ende, abend, vergangenen, Anfang, Jahres, Samstag, Freitag, Mitte, Sonntag
Donnerstag _ | Uhr, abend, heutigen, Nacht, teilte, Mittwoch, Freitag, worden, mitteilte, sagte
Dienstag _|_ | Uhr, abend, heutigen, teilte, Freitag, worden, kommenden, sagte, mitteilte, Nacht
Montag _ | | Uhr, abend, heutigen, Dienstag, kommenden, teilte, Freitag, worden, sagte, morgen
Mittwoch _|_|_ | Uhr, abend, heutigen, Nacht, Samstag, Freitag, Sonntag, kommenden, nachmittag
Samstag ___ | | Uhr, abend, Samstag, Nacht, Sonntag, Freitag, Montag, nachmittag, heutigen
Sonntag _ | | | Uhr, abend, Samstag, Nacht, Montag, kommenden, morgen, nachmittag, vergangenen
Freitag _|_|_|_____________ | Uhr, abend, Ende, Jahres, Samstag, Anfang, Freitag, Sonntag, heutigen, worden
Januar _________________ | | Uhr, Ende, Jahres, Anfang, Mitte, Samstag, Mai, August, März, Januar
August _______________ | | | Uhr, Ende, Jahres, Anfang, Mitte, Samstag, Mai, August, Januar, März
Juli _____________ | | | | Uhr, Jahres, Ende, Anfang, Mitte, Mai, Samstag, August, Januar, März
März ___________ | | | | | Uhr, Ende, Jahres, Anfang, Mitte, Samstag, Mai, Januar, März, April
Mai _________ | | | | | | Uhr, Ende, Jahres, Anfang, Mitte, Samstag, März, Januar, Mai, vergangenen
September _______ | | | | | | | Uhr, Ende, Jahres, Anfang, Mitte, Mai, Januar, März, Samstag, vergangenen
Februar _ | | | | | | | | Uhr, Januar, Jahres, Anfang, Mitte, Ende, März, November, Samstag, vergangenen
Dezember _|___ | | | | | | | | Uhr, Jahres, Ende, Anfang, Mitte, Mai, Januar, März, Samstag, vergangenen
November _ | | | | | | | | | Uhr, Jahres, Ende, Anfang, Mitte, September, vergangenen, Dezember, Samstag
Oktober _|_ | | | | | | | | | Uhr, Ende, Jahres, Anfang, Mai, Mitte, Samstag, September, März, vergangenen
April _ | | | | | | | | | | Uhr, Ende, Jahres, Mai, Anfang, März, Mitte, Prozent, Samstag, Hauptversammlung
Juni _|_|_|_|_|_|_|_|_|_|_|_
Example (1):
U. Quasthoff 41
Clustering Leaders and Verbs of Utterance
Example (2): Clustering Leaders Präsident _________ sagte, Boris Jelzin, erklärte, stellvertretende, Bill Clinton, stellvertretender, Richter
Vorsitzender _______ | sagte, erklärte, stellvertretende, stellvertretender, Richter, Abteilung, bestätigte
Vorsitzende ___ | | sagte, erklärte, stellvertretende, Richter, bestätigte, Außenministeriums, teilte, gestern
Sprecher _ | | | sagte, erklärte, Außenministeriums, bestätigte, teilte, gestern, mitteilte, Anfrage
Sprecherin _|_|_ | | sagte, erklärte, stellvertretende, Richter, Abteilung, bestätigte, Außenministeriums, sagt
Chef _ | | | Abteilung, Instituts, sagte, sagt, stellvertretender, Professor, Staatskanzlei, Dr.
Leiter _|___|_|_|_
Example (3): Clustering Verbs of Utteranceverwies _____________ Sprecher, werde, gestern, Vorsitzende, Polizei, Sprecherin, Anfrage, Präsident, gebe
mitteilte ___________ | Sprecher, werde, gestern, Vorsitzende, Polizei, Sprecherin, Anfrage, Präsident, Montag
meinte _______ | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Anfrage, Präsident, gebe, Interview
bestätigte_____ | | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Anfrage, Präsident, gebe, Interview
betonte ___ | | | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Präsident, gebe, Interview, würden, Bonn
sagte _ | | | | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Präsident, gebe, Interview, würden
erklärte _|_|_|_|_ | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Präsident, Anfrage, gebe, Interview
warnte _ | | | Präsident, Vorsitzende, SPD, eindringlich, Ministerpräsident, CDU, Außenminister, Zugleich
sprach _|_______|_|_|_
U. Quasthoff 42
The Clustering Algorithm
The Single Link Hierarchical Agglomerative Clustering Method (HACM) works bottom up like this:
• All words are treated as (basic) items. Each item has a description (feature vector).
• In each step of the clustering algorithm the two items A and B with the most similar descriptions are searched and fitted together to create a new complex item C combining the words in A and B. Each step of the clustering algorithm reduces the number of items by one.
• The feature vector for C is constructed from the feature vectors of A and B by „identifying“ the words A and B and calculating their joint collocations.
• The algorithm stops if only one item is left or or if all remaining feature vectors are orthogonal. This results usually in a very natural clustering if the threshold for constructing the feature vectors is suitably chosen.
U. Quasthoff 43
Part 6
Part 1: Introduction to the Wortschatz lexicon
Part 2: Collocations and the collocation measure
Part 3: Applications of collocations
Part 4: Clustering Collocations
Part 5: The Vector Space Model
Part 6: Combining Simple Methods
Part 7: Temporal Analysis: Words of the Day
Part 8: Document Similarity
U. Quasthoff 44
Part 3: Combining Non-contradictory Partial Results
The results of these combination either give more and / or better results.
Identical Results
Two or more of the above algorithms may suggest a certain relation between two words, for instance, cohyponymy.
Example: If both the second order collocations and clustering by feature vectors independently yield similar sets of words as a result, this may be taken as an indication of cohyponymy between the words, e. g. sagte, betonte, kündigte, wies, nannte, warnte, bekräftigte, meinte [...] (German verbs of utterance).
U. Quasthoff 45
Types of Relations
Symmetric Relations: A relation r is called symmetric if r(A, B) always implies r(A, B). Examples of symmetric relations are– synonymy,– cohyponymy (or similarity),– elements of a certain subject area, and– relations of unknown type.
Usually, sentence collocations express symmetric relations.Anti-symmetric Relations: Let us call a relation r anti-symmetric if r(A, B) never implies r(A,
B). Examples of anti-symmetric relations are hyponymy, relations between properties and its owners like action and actor, or class and instance.
Usually, next neighbor collocations of two words express anti-symmetric relations. In the case of next neighbor collocations consisting of more than two words (like A prep/det/conj B e. g. Samson and Delilah), the relation might be symmetric, for instance in the case of conjunctions like and or or.
Transitivity: Transitivity of a relation means that r(A, B) and r(B, C) always implies r(A, C). In general, a relation found experimentally will not be transitive, of course. But there may be a part where transitivity holds.
Some of the most prominent transitive relations are the cohyponymy, hyponymy, synonymy relations.
U. Quasthoff 46
Supporting Second Results
In second combination type a known relation given by one method of extraction is verified by an identical but unnamed second result as follows:
Result 1: There is certain relation r between A and B
Result 2: There is some strong (but unknown) relation between A and B (e. g. given by a collocation set)
Conclusion: Result 1 holds with more evidence.
Example
Result 2: The German compound Entschädigungsgesetz can be divided into Gesetz and Entschädigung with an unknown relation.
Result 1 is given by the four word next neighbor collocation Gesetz über die Entschädigung. Similarly Stundenkilometer is analyzed as Kilometer pro Stunde.
In these examples, result 1 is not enough because there are collocations like Woche auf dem Tisch which do not describe a meaningful semantic relation.
U. Quasthoff 47
Combining Three Results
Result 1: There is relation r between A and B
Result 2: B is similar to B' (cohyponymy)
Result 3: There is some strong but unknown relation between A and B'
Conclusion: There is a relation r between A and B'‚
Example: As result 1 we might know that Schwanz (tail) is part of Pferd (horse). Similar terms to Pferd are both Kuh (cow) and Hund (dog) (result 2). Both of them have the term Schwanz collocation in their set of significant collocations (result 3). Hence we might correctly conjecture that both Kuh and Hund have a tail (Schwanz) as part of their body.
In contrast, Reiter (rider) is a strong collocation to Pferd and might (incorrectly) be conjectured to be another similar concept, but Reiter is no collocation with respect to Schwanz. Hence, the absence of result 3 prevents us from making an incorrect conclusion.
U. Quasthoff 48
Similarity Used to Infer a Strong Property
Let us call an property p important, if similarity respects this property. This strong property can be assured as follows:
Result 1: A has a certain important property p
Result 2: B is similar to A (i. e., B is a cohyponym of A)
Conclusion: B has the same property p
Example: We consider A and B as similar if they are in the set of right neighbor collocations of Hafenstadt (port town) (result 2). If we know that Hafenstadt is a property of its typical right neighbors (result 1) we may infer this property for more then 200 cities like Split, Sidon, Durban, Kismayo, Tyrus, Vlora, Karachi, Durres, [...].
U. Quasthoff 49
Subject Area Inferred from Collocation Sets
Result 1: A, B, C, ... are collocates of a certain term.
Result 2: Some of them belong to a certain subject area.
Conclusion: All of them belong to this subject area.
Example: Consider the following top entries in the collocation set of carcinoma: patients, cell, squamous, radiotherapy, lung, thyroid, treated, hepatocellular, metastases, adenocarcinoma, cervix, irradiation, breast, treatment, CT, therapy, renal, cases, bladder, cervical, tumor, cancer, metastatic, radiation, uterine, ovarian, chemotherapy, [...]
If we know that some of them belong to the subject area Medicine, we can add this subject area to the other members of the collocation set as well.
U. Quasthoff 50
Part 7
Part 1: Introduction to the Wortschatz lexicon
Part 2: Collocations and the collocation measure
Part 3: Applications of collocations
Part 4: Clustering Collocations
Part 5: The Vector Space Model
Part 6: Combining Simple Methods
Part 7: Temporal Analysis: Words of the Day
Part 8: Document Similarity
U. Quasthoff 51
U. Quasthoff 52
Overview Input: Today‘s news text
• Ca. 20.000 sentences
• Relative size compared to the large corus: Factor 1000
• # of sentence collocations: ca. 100.000
• # of next neighbor collocations: ca. 7000
• # of next neighbor collocations, both capitalized: ca. 300
• Size So und Mo: ca. 50% compared to weekdays.
Problem: Find important terms.
• Total number of 100-150 words
• Frequency data available:– total frequency today
– relative frequency compared to our large corpus
– total frequency in our large corpus
• Morphosyntactic criteria:– Words and multiwords should be capitalized
– No inflected forms
U. Quasthoff 53
Frequency Measures• Total frequency today
– Minimum frequency needed, otherwise too many words (cf. Zipf’s Law).
– Today: Minimum frequency of 12
– Today: Maximum frequency of 100 for relevant words
• Relative frequency compared to our large corpus– Large factor implies importance.
– Small variance appears by chance.
– Threshold for importence: factor>6. May be lowered for larger daily corpora.
• Total frequency in our large corpus– Words should be familiar
– Today: Wortschatz-frequency > 20
– What about totally new words? - Today: Minimum frequency of 12 as above
Question: Which measure is closest to importance as felt by humans?
Answer: Total frequency today
U. Quasthoff 54
Words of the Day (without human inspection)
U. Quasthoff 55
Words of the Day (after 5 minutes of inspection)
U. Quasthoff 56
Problem: Find the Message
We automatically find:• Jürgen Hart is rarely mentioned• We notice the words gestorben and tot and the phrase hörte sein Herz auf zu
schlagen.• Conclusion: We have an obituary.
U. Quasthoff 57
Relations between Words
Today‘s collocation graph: Connected words represent a strong relation
U. Quasthoff 58
Temporal Relations
We see wether collocations repeatedly appear together during the last 30 days.
U. Quasthoff 59
Part 8
Part 1: Introduction to the Wortschatz lexicon
Part 2: Collocations and the collocation measure
Part 3: Applications of collocations
Part 4: Clustering Collocations
Part 5: The Vector Space Model
Part 6: Combining Simple Methods
Part 7: Temporal Analysis: Words of the Day
Part 8: Document Similarity
U. Quasthoff 60
Document Similarity
The description of a document consists of all its terms which have been Word of the Day at any time.
Hence we use only approx. 5.000 for indexing.
Documents are compared just by counting their common terms, weighted by their frequencies.
U. Quasthoff 61
Sample similar Documents Doc.-Nr. 1 Doc.-Nr. 2 Ähnlichkeit 60910293 51923690 9558.00
Weltmeister Südkorea Michael_Ballack Oliver_Kahn Brasilien Yokohama Ballack DFB Rudi_Völler
60910293 552389133 7946.00Weltmeister Südkorea Dietmar_Hamann Thomas_Linke Weltmeisterschaft Carsten_Ramelow Rudi_Völler
60910293 588749685 7278.00Südkorea Michael_Ballack Weltmeisterschaft Ballack Elf Seoul
734389933 1313082725 11073.00Israel Arafat Hebron Jericho Palästinenser Terror Bush Frieden US-Präsident_George_W._Bush
734389933 1598295465 7344.00Israel Arafat Palästinenser Terror Bush US-Präsident_George_W._Bush
242550748 1598295465 7344.00Israel Arafat Palästinenser Terror Bush US-Präsident_George_W._Bush
242550748 734389933 12691.00Israel Autonomiebehörde Arafat Hebron Palästinenser Terror Bush Frieden Jassir_Arafat US-Präsident_George_W._Bush
U. Quasthoff 62
Topics of the DayIf we have sets of similar documents we can use clustering. The terms
describing the cluster can be viewed as Topic of the Day.
The clustering algorithm:Consider all documents (approx. 200 each day).For each pair of similar documents, consider their set of common Words
of the Day.Next we cluster these words using HACM:• In each step, the most similar sets are combined.• As similarity measure we use: sim(A,B)= |AB| / |B| (Which part of
A is contained in B?)• If sim(A,B)>0.4, then B is replaced by AB and A is dropped.• The algorithm stops if there are no sets with similarity >.4.
U. Quasthoff 63
Clusters of 25/6/2002(Titels are made by hand)NAHOST 1 Gaza-Streifen Arafat Außenminister_Schimon_Peres Jerusalem Terroristen Westjordanland
Hebron Israel Panzer Attentäter Palästinenser Ramallah Selbstmordanschläge Israelis Ausgangssperre Autonomiebehörde Tulkarem Bethlehem Dschenin Gazastreifen Terror US-Präsident_George_W._Bush Anschlägen Nablus
FORMEL1 2 Grand_Prix Rubens_Barrichello Ferrari Ralf_Schumacher McLaren-Mercedes Coulthard Barrichello Montoya Großbritannien Schumacher Michael_Schumacher Nürburgring Nürburgri Großen_Preis Stallorder Weltmeister Brasilien Fußball-WM
STOLPE 3 Potsdam SPD-Generalsekretär_Franz_Müntefering Lothar_Späth Stolpe Bundestagswahlkampf Matthias_Platzeck Müntefering Schönbohm Bundesrat Platzeck Brandenburg Manfred_Stolpe Cottbus Zuwanderungsgesetz PDS Wittenberge Ministerpräsident_Manfred_Stolpe Jörg_Schönbohm Bundestagswahl Schröder
WM 4 Korea Südkorea Skibbe Seoul Oliver_Kahn Südkoreaner Michael_Ballack Koreaner Spanier Hitze Nationalmannschaft Elfmeterschießen Viertelfinale Paraguay WM-Halbfinale Miroslav_Klose Völler Jens_Jeremies Karl-Heinz_Rummenigge Klose Golden_Goal Weltmeister Türken Senegal Fußball Verlängerung Brasilien Elf Weltmeisterschaft Entschuldigung Rudi_Völler Portugal Ronaldo Rivaldo Achtelfinale Argentinien Fifa Dietmar_Hamann
PISA 5 Nordrhein-Westfalen Gymnasien Pisa-E Naturwissenschaften Brandenburg Rheinland-Pfalz Sachsen-Anhalt
BÖRSE 6 T-Aktie Allzeittief Neuen_Markt DAX France_Télécom Moody's TarifrundeHARTZ 7 Hartz SPD-Generalsekretär_Franz_Müntefering Bundeswirtschaftsminister_Werner_Müller
Florian_Gerster Arbeitslosenzahl FDP-Chef_Guido_Westerwelle Hartz-Kommission
U. Quasthoff 64
Cluster des 26/6/2002WM 1 Fußball Ilhan_Mansiz Golden_Goal WM-Halbfinale Senegal Türken Schröder Weltmeister
Bundesinnenminister_Otto_Schily Völler Bundespräsident_Johannes_Rau Brasilien Bundeskanzler_Gerhard_Schröder Ballack Frings Neuville Bierhoff Jeremies Klose Ramelow Korea Südkorea Michael_Ballack Oliver_Kahn Beckham Weltmeisterschaft Zidane Pelé Ronaldo Rivaldo Miroslav_Klose Viertelfinale Paraguay Jens_Jeremies FC_Liverpool Seoul Christian_Ziege Spanier Sebastian_Kehl Elf Saudi-Arabien Thomas_Linke Nationalmannschaft Rudi_Völler Seo Carsten_Ramelow Christoph_Metzelder Foul WM-Finale Koreaner Südkoreaner Oliver_Bierhoff Dietmar_Hamann Yokohama Schiedsrichter Franz_Beckenbauer Portugal Guus_Hiddink DFB Oliver_Neuville Marco_Bode Gelbe_Karte Fifa Franzosen Yoo
NAHOST 2 Ariel_Scharon Israel Arafat Palästinenser Bush Nahen_Osten Autonomiebehörde Hebron Terror Frieden Jassir_Arafat US-Präsident_George_W._Bush Jericho Ramallah Israelis Scharon Weiße_Haus George_Bush Palästina Westjordanland Jerusalem Ministerpräsident_Ariel_Scharon Panzer Großbritannien Gewalt Palästinenserpräsident_Jassir_Arafat US-Regierung Anschläge Waffen Intifada
ERFURT 3 Schule Massaker Erfurt Lehrer Steinhäuser Rainer_Heise Robert_Steinhäuser STOLPE 4 Brandenburg Bundesrat Stolpe Bundespräsident_Johannes_Rau PDS Jörg_Schönbohm
Schönbohm Lothar_Späth Platzeck Matthias_Platzeck Manfred_StolpeFORMEL1 5 Weltmeister Rubens_Barrichello Nürburgring Großen_Preis McLaren-Mercedes Barrichello
Ralf_Schumacher Ferrari Michael_Schumacher Schumacher Jean_TodtBÖRSE 6 Moody's Neuen_Markt DAX ABN_Amro Goldman_Sachs France_Télécom BABCOCK 7 Babcock Nordrhein-Westfalen Oberhausen Bürgschaft Babcock_Borsig IG_Metall
Stellenabbau IndienHOLZMANN 8 Philipp_Holzmann_AG Ottmar_Hermann Baukonzern Philipp_Holzmann Holzmann
Insolvenz Niederländer
U. Quasthoff 65
Comparison
25.6.02:
WM
NAHOST
FORMEL1
STOLPE
BÖRSE
PISA
HARTZ
26.6.02:
WM
NAHOST
FORMEL1
STOLPE
BÖRSE
ERFURT
BABCOCK
HOLZMANN
Some topics appear repeatedly, either on consecutive days or after some interval.
If a topic once is introduced by hand, it will be detected automatically later on.
U. Quasthoff 66
Thank you.