learning semantic relations using very large corpora uwe quasthoff institut für informatik...

Learning semantic relations using very large corpora

Uwe QuasthoffInstitut für Informatik

Universität Leipzig

[email protected]

www.wortschatz.uni-leipzig.de

U. Quasthoff 2

Contents

Part 1: Introduction to the Wortschatz lexicon

Part 2: Collocations and the collocation measure

Part 3: Applications of collocations

Part 4: Clustering Collocations

Part 5: The Vector Space Model

Part 6: Combining Simple Methods

Part 7: Temporal Analysis: Words of the Day

Part 8: Document Similarity

U. Quasthoff 3

Language Data German English Type of data # of entries # of entries Number of word forms 6.539.214 1.240.002 Sentences 24.788.212 13.037.997 Grammar 2.772.369 - Pragmatics 33.948 - Description 135.914 - Morphology 3.189.365 - Subject areas 1.415.752 - Relations 449.619 - Collocations (sentence-based) 31.776.212 10.106.066 Collocations (NN) 4.701.917 1.382.286 Index 3.6743.996 17.994.102

U. Quasthoff 4

Große Abfrage für LeipzigAnzahl: 9967Beschreibung: Stadt in Deutschland (über 250000 E) Stadt in SachsenGrammatikangaben: Wortart: Eigenname

Form(en): Leipzig [9967], Leipzigs [276]-er-Adj. / Einwohner zu Stadt: Leipziger [3553]

Teilwort von: VfB Leipzig [403], SC Leipzig [183], ..., Erste Baugesellschaft Leipzig AG [8], ...

Beispiel: Auch Debütanten aus angrenzenden Sprachräumen, die sich über Leipzig den deutschen Buchmarkt erschließen möchten, bietet die Buchmesse ein geeignetes Forum. (Quelle: OTS-Newsticker)

Kollokationen im Satz: Dresden (1488), Berlin (694), Halle (470), Universität (266), Sachsen (265), ..., DDR-Bürger (5), DDR-Innenminister (5), DTSB (5), ...

Linke Nachbarn: Universität (392), Stadt (201), Reclam (102), Handelshochschule (51), Oper (50), Karl-Marx-Universität (48), Raum (36), ...

U. Quasthoff 5

Empirical Analysis of

Associations

Stimulus Human response # of persons collocations signifikance Butter Brot 60 Brot 51 weich 40 Käse 49 Milch 32 Zucker 29 Margarine 27 Milch 23 Käse 20 Margarine 22 Fett(e) 16 Mehl 18 gelb 14 Eier 16 Butterbrot 8 Pfund 14 Dose 6 zerlassener 13 essen 6 Fleisch 13

U. Quasthoff 6

Collocations for Schweine

On the right side we find a collection of similar animals (all in plural) Rinder, Hühner, Kühe, Schafe.

On the left side we find words describing the aspect of slaughtering.

U. Quasthoff 7

Collocations for Stich

Two groups for different meanings:• Tennis (Michael Stich, Boris Becker etc.).• The cards game Skatwith the tree players Vorhand (lead), Mittelhand, and Hinterhand.The thin connection between Becker and Vorhand is representing Beckers strong forehand.

U. Quasthoff 8

Funny collocation sets

Identifying English words in German text: Collocations for the:

of, and, to, The, on, for, is, from, you, with, that, it, world, are, be, not, We, at, World, we, have, this, by, they, when, You, can, When, into, what, your, or, But, time, And, like, over, Breaking, only, one, but, shall, which, has, What, road, as, On, same, people, out, our, This, It, way, best, who, no, my, more, his, up, their, ...

The same way we find dialect words. Berlin dialect is identified using the collocations for ick:

det, nich, Ick, Det, hab, is, ne, ooch, keene, wat, weeß, uff, de, ma, nu, keen, dat, aba, och, jing, jut, Nee, meen, Jöre, een, mach, inne, watt, wa, jenuch, kieke, janze, kumm, janz, tau, Mutta, janzen, hätt, sag, wieda, kleene, ha, hör, imma, un, habense, kriejen, ejal, zwee, nischt, nee, Wetta, jedacht, hebb, heff, ...

U. Quasthoff 9

Analysis of other languages

The above procedure was (without changes) applied to

• English

• French

• Dutch

• Sorbian

• Italian

U. Quasthoff 10

Dutch and French

U. Quasthoff 11

Sorbian and Italian

U. Quasthoff 12

U. Quasthoff 13

U. Quasthoff 14

Part 2









U. Quasthoff 15

The Common Birthday Problem

The common birthday problem: What is the probability p of the existence of (at least) two people with the same birthday given n people randomly chosen?

Modification: What is the probability p of the existence of k couples with the same birthday (different birthdays are allowed for different couples) a boys and b girls randomly chosen?

Reformulation as collocation problem:

Common Birthday Problem Collocation Problem

Number a of boys Number a of sentences containing A

Number b of girls Number b of sentences containing B

Number of days of a year (i.e. n=365) Total number n of sentences

Number k of couples with same birthday

Number k of sentences containing both A and B

U. Quasthoff 16

Introduction to Poisson Distribution

We can calculate the probability of multiple joint occurrences of independent events as follows:

Given two independent events to be observed in an experiment with probabilities pa and pb, resp. The probability of their joint occurrence is pa pb .

Next we repeat the experiment n times, and we are interested in k joint occurrences. Using λ= n pa pb we get the probability

.!

1 ek

p kk

For at least k joint occurrences we get the probability

To measure the surprise for the joint occurrence for non-independent events we just calculate the probability as if they were independent. Next we are surprised to see such a rare event.

.!

1

kl

l el

U. Quasthoff 17

The Collocation Measure

The collocation measure of the two words A and B is defined as the negative logarithm of the above probability divided by log n. For λ=ab/n we get

.log

!

11log

B)sig(A,

1

0

n

ie

k

i

i

Approximations: If (k+1)/λ>10 (this is typically true) and, moreover k>10 we get:

and n

kk

log

!loglog B)sig(A,

.log

)1log(log B)sig(A,

n

kk

U. Quasthoff 18

Comparison to log-Likelihood

Comparison of the formulae:

Translating the log-Likelihood formula into our notation and ignoring small terms we get

Compared to

.)log(log

B)Lgl(A,n

kk

.log

)1log(log B)sig(A,

n

kk

Note: This may only apply to the typical case (k+1) / λ > 10.

U. Quasthoff 19

Comparing Results: The sources

IDS Cosmas I (W-PUB)

Wortschatz(German)

Corpus Size 374 Mio 255 Mio

Sources Mainly Newspapers

Mainly Newspapers

Window size Fixed size (here: ±5 words

Sentence

Collocation Measure

Log Likelihood Poisson Distribution

U. Quasthoff 20

Comparing Results: Collocations for Bier

Rank IDS Cosmas I

Cosmas Rating

Wortschatz (German)

Sig-Rating

1 Wein 4351 trinken 12342 trinken 2745 Wein 6483 getrunken 1715 getrunken 478

4 kühles 1627 Liter 4605 Glas 1379 trinkt 4286 Liter 1318 Glas 3487 Faß 1236 Schnaps 3188 Fass 1139 Hektoliter 3009 Flasche 1071 Flaschen 27210 Hektoliter 899 gebraut 269

11 Trinkt 881 24212 Flaschen 873 Sekt 239

Kaffee

U. Quasthoff 21

Properties of sig(n,k,a,b) I

Simple co-occurrence: A and B occur only once, and they occur together:

sig(n,1,1,1) → 1.This should ensure that the minimum significance threshold is independent of the corpus size.

Independence: A and B occur statistically independent with probabilities p and q:

sig(n,npq,np,nq) → 0.

Enlarging the corpus by a factor m:sig(mn, mk, ma, mb) = m sig(n, k, a, b)

This is useful for comparing corpora of different size.

U. Quasthoff 22

Properties of sig(n,k,a,b) II

Additivity: The unification of the words B and B‘ just adds the corresponding significances. For k/b≈k‘/b‘ we have

sig(n,k,a,b) + sig(n,k‘,a,b‘) ≈ sig(n,k+k‘,a,b+b‘)

This has applications for grouping words due to various methods.

Maximum:

max sig(A, B) ≈ a B

It might be useful to know how strong a collocation is compared to the possible maximum.

U. Quasthoff 23

Part 3









U. Quasthoff 24

Applications

• Collocations of inflected forms or basic forms?• A numeric measure for polysemy• Identification of Proper Names and Phrases• Compound Analysis

U. Quasthoff 25

Collocations of inflected forms or basic forms?

• Collocations of basic forms will give more results because of higher frequency.

• But: Collocations of basic forms and inflected forms may differ strongly.

Example:

Collocations for As:

Karo (488), K (393), Pik (391), Treff (307), Coeur (296), D (258), Herz (190), Karte (189), Kreuz (178), As (166), As (166), Süd (145), Matchball (113), Hinterhand (110), West (110), Karo-Bube (101), Ärmel (95), Ost (94), Vorhand (86), …

Collocations for Asse:

Ivanisevic (72), Becker (63), schlug (62), Aufschlag (56), servierte (47), Sampras (40), Goran Ivanisevic (32), Spiel (25), Stich (24), gewann (24), Wolfenbüttel (23), Kroate (22), schlagen (22), Asse (21), Asse (21), Match (21), Satz (21), …

U. Quasthoff 26

Application of additivity

First calculate collocations for inflected forms, then use additivity to calculate the measure for basic forms, if you want.

Example:

Collocations for Bundeskanzler: ..., betonte (46), ..., betont (21), ..., betonten (7), ...

Additivity gives significance 74 for the pair Bundeskanzler, {betonen, betont, betonte, betonten}.

U. Quasthoff 27

A numeric measure for polysemy: Space

The collocations of space taken from our general language corpus of English fall mainly into three classes: The subject areas computer, real estate and outer space. The corresponding senses of space are denoted with space1, space2, and space3.

Assigning the top 30 collocations of space (disk, shuttle, square, station, NASA, feet, …) to these three senses we get an qualitative estimate of these senses:

space1 28.2%: disk (2629), memory (718), storage (479), program (308), RAM

(307), free (300), hard (336)

space2 53.2%: shuttle (2618), station (991), NASA (920), Space (602), launch

(505), astronauts (473), Challenger (420), manned (406), NASA's (297), flight (293), Atlantis (291) Mir (335), rocket (329), orbit (326), Discovery (341), mission (385)

space3: 18.6%: square (1163), feet (822), leased (567), office (382), lessor (390)

U. Quasthoff 28

Proper Names and Phrases

A large relative collocation measure sigC(A) indicates that a reasonable

part of all occurrences of the word C is together with A. Hence, C might be the head with respect to A.

Left Word Right Word “head”

Alzheimersche Krankheit left

AQA total left

Anorexia nervosa left and right

Algighiero Boetti left and right

30jährige US-Bond right

André Lussi right

U. Quasthoff 29

Compound analysis using multi-word collocations

Assume we know that Geschwindigkeitsüberschreitung has the parts Geschwindigkeit and Überschreitung. If a multi-word collocation (here: Überschreitung der Geschwindigkeit) is of some predefined form we accept this collocation as a semantic description.

Pattern Word A Word B CompoundA aus B Orgie Farben FarbenorgieA der B Bebauung Insel Inselbebauung

A mit B Feld Getreide Getreidefeld

A in der B Feldbau Regenzeit Regenzeitfeldbau

A für B Übung Anfänger Anfängerübung

A für die B Gebäude Flugsicherung Flugsicherungsgebäude

A von B Anbau Kaffee KaffeeanbauA zur B Andrang Eröffnung Eröffnungsandrang

U. Quasthoff 30

Part 4









U. Quasthoff 31

Clustering

So far, we are able to find relations between words. They are still of unknown type. Moreover, different types are mixed.

Problem 1: How to construct sets of relations of a fixed type?

Problem 2: How to identify the type of a relation using background information?

U. Quasthoff 32

Collocations of collocation sets

The production of collocations is now applied on sets of (next neighbour or sentence) collocations instead of sentences.

The collection of 500.000 sentence collocations has the following ‚sentence‘ for Hemd:

Hemd Krawatte Hose weißes Anzug weißem Jeans trägt trug bekleidet weißen Jacke schwarze Jackett schwarzen Weste kariertes Schlips Mann

The collection of 250.000 next neighbor collocations has the following two ‚sentences‘ for Hemd:

weißes weißem weißen blaues kariertes kariertem offenem aufs karierten gestreiftes letztes ...

näher bekleidet ausgezogen spannt trägt aufknöpft ausgeplündert auszieht wechseln aufgeknöpft ausziehen ...

U. Quasthoff 33

Erklärte (declaired) using sentences

Sprecher (2581), werde (2302), gestern (1696), seien (1440), Wir (1187), bereit (929), wolle (839), Vorsitzende (807), Anfrage (775), Präsident (721)

U. Quasthoff 34

Erklärte (declared):Using NN-collocations

sagte (137), betonte (59), sprach (55), kündigte (44), wies (37), nannte (36), warnte (27), bekräftigte (24), meinte (24),kritisierte (23)

U. Quasthoff 35

Collocation set for Leipzig (other cities in black)

in, Dresden, Berlin, Halle, Leipzig, Leipzig, und, Universität, Sachsen, Erfurt, Chemnitz, UM, Frankfurt, Hamburg, Rostock, Magdeburg, München, Leipziger, Hannover, Messe, Zwickau, studierte, nach, aus, Stadt, Stuttgart, Jena, Düsseldorf, Nürnberg, Reclam, Messestadt, sächsischen, DDR, Kischko, am, Köln, Däbritz, Karl-Marx-Universität, In, Rische, ostdeutschen, geboren, sächsische, bewölkt, Völkerschlacht, Bredow, Taucha, VEB, Edmond, Verlag, Buchmesse, Gewandhausorchester, Städten, Strombörse, Deutschen, Institut, GmbH, Lindner, Wurzen, GV, Verbundnetz, Ampler, Frankfurt am Main, Potsdam, Reclam Verlag, Städte, Cottbus, Versandzentrum, Handelshochschule, Hinrich Lehmann-Grube, Gera, Kirchentag, Völkerschlachtdenkmal, Buchstadt, Thomanerchor, Unterhaching, Lübeck, Oper, Dessau, Meppen, Studium, MDR, Philosophie, eröffnet, wurde, Anke Huber, Jens Lehmann, Turowski, Uwe Ampler, Weimar, ostdeutsche, Hecking, IAT, Boomtown, Buchkunst, Engelmann, Freistaat, Liebers, Dortmund, Mai, Mannheim, Schwerin, neuen Bundesländern, Grischin, VNG, Wende, bei, AG, Auto Mobil International, Cindy Klemrath, Gewandhaus, Messegelände, Parteitag, Bremen, Montagsdemonstrationen, Neubrandenburg, Gustav Kiepenheuer Verlag, Karl-Marx-Stadt, Journalistik, Ostdeutschland, Thomas Liese, Essen, Heidenreich, Udo Zimmermann, Umweltforschungszentrum, DHFK, Hochschule, Mainz, Oktober, Wolfgang Engel, Deutschen Hochschule für Körperkultur, Frankfurt/Main, Heldenstadt, Trommer, Wolfsburg, EBL, Universitäten, Wien, Bautzen, ...

U. Quasthoff 36

First Iteration for Leipzig (other cities in black)

Frankfurt, Berlin, München, Stuttgart, Köln, Dresden, Hamburg, Hannover, Düsseldorf, Bremen, Karlsruhe, Potsdam, Wien, Paris, Magdeburg, Halle, Tübingen, Bonn, Freiburg, New York, Chemnitz, Darmstadt, Augsburg, Erfurt, Mannheim, Schweiz, Ulm, Bochum, Wiesbaden, Hanau, Braunschweig, Schwerin, Münster, Frankfurt am Main, London, USA, Regensburg, Cottbus, Göttingen, Kassel, Moskau, Passau, Rostock, Straßburg, Deutschland, Konstanz, Ausland, Dortmund, Heidelberg, Mainz, Würzburg, Zürich, Aachen, Offenbach, Weimar, Gießen, Koblenz, Italien, Chicago, Mailand, Osnabrück, Prag, Rom, Saarbrücken, Wuppertal, Niederlanden, Gera, Basel, Lyon, Nürnberg, Holland, Marburg, St. Petersburg, Amerika, Genf, Kaiserslautern, Tel Aviv, Woche, September, Tiergarten, dort, eröffnet, Budapest, Essen, Jena, Jerusalem, Neubrandenburg, Athen, Frankreich, Vereinigten Staaten, Amsterdam, Baden-Württemberg, Februar, Tempelhof, Trier, Venedig, Bayreuth, England, Erlangen, Indien, Belgrad, Duisburg, Heilbronn, Kairo, Ludwigsburg, Oldenburg, Oxford, Stockholm, Washington, Großbritannien, Görlitz, Kreuzberg, Lausanne, Lübeck, Mitte, Wochenende, April, Australien, Griechenland, Singapur, Florenz, Kanada, Kiel, Kopenhagen, Madrid, Mai, Südafrika, Tegel, Türkei, soeben, Bad Homburg, Bundesrepublik, Göppingen, Heute, Hongkong, Ingolstadt, Japan, Lande, Miami, Mittwoch, Oder, Sarajewo, Afghanistan, Argentinien, Baden-Baden, Bayern, Deutschlands, Europa, Haus, Iran, Istanbul, Peking, Rußland, neu, ...

U. Quasthoff 37

Second Iteration for Leipzig (other cities in black)

Stuttgart, München, Frankfurt, Hamburg, Hannover, Köln, Berlin, Dresden, Bremen, Darmstadt, Karlsruhe, Freiburg, Potsdam, Mannheim, Wiesbaden, Düsseldorf, Tübingen, Magdeburg, Gießen, Augsburg, Rostock, Kassel, Halle, Ulm, Hanau, Heidelberg, Ludwigsburg, Konstanz, Nürnberg, Bonn, Schwerin, Münster, Wien, Dortmund, Würzburg, Chemnitz, Passau, Göttingen, Erfurt, Mitte, Aachen, Mainz, Friedberg, Nord, Regensburg, Braunschweig, Cottbus, New York, Kreuzberg, Frankfurt am Main, Göppingen, Tiergarten, Esslingen, Ravensburg, II, Hessen, Ost, Lübeck, Charlottenburg, Böblingen, Offenbach, Oldenburg, Osnabrück, Traunstein, Paris, Bad Homburg, London, Prenzlauer Berg, Neukölln, Tempelhof, Hellersdorf, Koblenz, Essen, Fulda, Trier, Lüneburg, Prag, Chicago, Landshut, Reinickendorf, USA, Wilmersdorf, Kiel, Bochum, Deutschland, Mittelfranken, Schöneberg, Marzahn, Oberbayern, Eimsbüttel, Niederrhein, Unterfranken, Wuppertal, Friedrichshain, Spandau, Oberfranken, Lichtenberg, Moskau, Oberpfalz, Bielefeld, Schweiz, Kaiserslautern, Kempten, Bayreuth, Schwaben, Zürich, Bamberg, Ingolstadt, Mailand, Oder, Heilbronn, Altona, Sarajewo, Marburg, Ansbach, Harburg, Berlin-Mitte, Jena, Steglitz, Suhl, Görlitz, Baden-Württemberg, Hessen-Süd, dort, Italien, Weimar, West, Saarbrücken, Ausland, Bayern, Ostwestfalen-Lippe, Moabit, Offenburg, Main, Polen, Amsterdam, Westliches, Mittlerer, eröffnet, ...

U. Quasthoff 38

Part 5









U. Quasthoff 39

Feature Vectors Given by Collocations

• If two words A and B have similar contexts, that is, they are alike in their use, this indicates that there is a semantic relation between A and B of some kind.

• A kind of average context for every word A is formed by all collocations for A above a certain significance threshold.

• This average context of A is transferred into a feature vector of A of dimension n (the total number of words) as usual. The feature vector of word A is a description of the meaning of A, because the most important words of the contexts of A are included.

• Clustering of feature vectors can be used to investigate the relations between a group of similar words and to figure out whether or not all the relations are of the same kind.

U. Quasthoff 40

Clustering Months and Days

Jahres _____________________ Uhr, Ende, abend, vergangenen, Anfang, Jahres, Samstag, Freitag, Mitte, Sonntag

Donnerstag _ | Uhr, abend, heutigen, Nacht, teilte, Mittwoch, Freitag, worden, mitteilte, sagte

Dienstag _|_ | Uhr, abend, heutigen, teilte, Freitag, worden, kommenden, sagte, mitteilte, Nacht

Montag _ | | Uhr, abend, heutigen, Dienstag, kommenden, teilte, Freitag, worden, sagte, morgen

Mittwoch _|_|_ | Uhr, abend, heutigen, Nacht, Samstag, Freitag, Sonntag, kommenden, nachmittag

Samstag ___ | | Uhr, abend, Samstag, Nacht, Sonntag, Freitag, Montag, nachmittag, heutigen

Sonntag _ | | | Uhr, abend, Samstag, Nacht, Montag, kommenden, morgen, nachmittag, vergangenen

Freitag _|_|_|_____________ | Uhr, abend, Ende, Jahres, Samstag, Anfang, Freitag, Sonntag, heutigen, worden

Januar _________________ | | Uhr, Ende, Jahres, Anfang, Mitte, Samstag, Mai, August, März, Januar

August _______________ | | | Uhr, Ende, Jahres, Anfang, Mitte, Samstag, Mai, August, Januar, März

Juli _____________ | | | | Uhr, Jahres, Ende, Anfang, Mitte, Mai, Samstag, August, Januar, März

März ___________ | | | | | Uhr, Ende, Jahres, Anfang, Mitte, Samstag, Mai, Januar, März, April

Mai _________ | | | | | | Uhr, Ende, Jahres, Anfang, Mitte, Samstag, März, Januar, Mai, vergangenen

September _______ | | | | | | | Uhr, Ende, Jahres, Anfang, Mitte, Mai, Januar, März, Samstag, vergangenen

Februar _ | | | | | | | | Uhr, Januar, Jahres, Anfang, Mitte, Ende, März, November, Samstag, vergangenen

Dezember _|___ | | | | | | | | Uhr, Jahres, Ende, Anfang, Mitte, Mai, Januar, März, Samstag, vergangenen

November _ | | | | | | | | | Uhr, Jahres, Ende, Anfang, Mitte, September, vergangenen, Dezember, Samstag

Oktober _|_ | | | | | | | | | Uhr, Ende, Jahres, Anfang, Mai, Mitte, Samstag, September, März, vergangenen

April _ | | | | | | | | | | Uhr, Ende, Jahres, Mai, Anfang, März, Mitte, Prozent, Samstag, Hauptversammlung

Juni _|_|_|_|_|_|_|_|_|_|_|_

Example (1):

U. Quasthoff 41

Clustering Leaders and Verbs of Utterance

Example (2): Clustering Leaders Präsident _________ sagte, Boris Jelzin, erklärte, stellvertretende, Bill Clinton, stellvertretender, Richter

Vorsitzender _______ | sagte, erklärte, stellvertretende, stellvertretender, Richter, Abteilung, bestätigte

Vorsitzende ___ | | sagte, erklärte, stellvertretende, Richter, bestätigte, Außenministeriums, teilte, gestern

Sprecher _ | | | sagte, erklärte, Außenministeriums, bestätigte, teilte, gestern, mitteilte, Anfrage

Sprecherin _|_|_ | | sagte, erklärte, stellvertretende, Richter, Abteilung, bestätigte, Außenministeriums, sagt

Chef _ | | | Abteilung, Instituts, sagte, sagt, stellvertretender, Professor, Staatskanzlei, Dr.

Leiter _|___|_|_|_

Example (3): Clustering Verbs of Utteranceverwies _____________ Sprecher, werde, gestern, Vorsitzende, Polizei, Sprecherin, Anfrage, Präsident, gebe

mitteilte ___________ | Sprecher, werde, gestern, Vorsitzende, Polizei, Sprecherin, Anfrage, Präsident, Montag

meinte _______ | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Anfrage, Präsident, gebe, Interview

bestätigte_____ | | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Anfrage, Präsident, gebe, Interview

betonte ___ | | | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Präsident, gebe, Interview, würden, Bonn

sagte _ | | | | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Präsident, gebe, Interview, würden

erklärte _|_|_|_|_ | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Präsident, Anfrage, gebe, Interview

warnte _ | | | Präsident, Vorsitzende, SPD, eindringlich, Ministerpräsident, CDU, Außenminister, Zugleich

sprach _|_______|_|_|_

U. Quasthoff 42

The Clustering Algorithm

The Single Link Hierarchical Agglomerative Clustering Method (HACM) works bottom up like this:

• All words are treated as (basic) items. Each item has a description (feature vector).

• In each step of the clustering algorithm the two items A and B with the most similar descriptions are searched and fitted together to create a new complex item C combining the words in A and B. Each step of the clustering algorithm reduces the number of items by one.

• The feature vector for C is constructed from the feature vectors of A and B by „identifying“ the words A and B and calculating their joint collocations.

• The algorithm stops if only one item is left or or if all remaining feature vectors are orthogonal. This results usually in a very natural clustering if the threshold for constructing the feature vectors is suitably chosen.

U. Quasthoff 43

Part 6









U. Quasthoff 44

Part 3: Combining Non-contradictory Partial Results

The results of these combination either give more and / or better results.

Identical Results

Two or more of the above algorithms may suggest a certain relation between two words, for instance, cohyponymy.

Example: If both the second order collocations and clustering by feature vectors independently yield similar sets of words as a result, this may be taken as an indication of cohyponymy between the words, e. g. sagte, betonte, kündigte, wies, nannte, warnte, bekräftigte, meinte [...] (German verbs of utterance).

U. Quasthoff 45

Types of Relations

Symmetric Relations: A relation r is called symmetric if r(A, B) always implies r(A, B). Examples of symmetric relations are– synonymy,– cohyponymy (or similarity),– elements of a certain subject area, and– relations of unknown type.

Usually, sentence collocations express symmetric relations.Anti-symmetric Relations: Let us call a relation r anti-symmetric if r(A, B) never implies r(A,

B). Examples of anti-symmetric relations are hyponymy, relations between properties and its owners like action and actor, or class and instance.

Usually, next neighbor collocations of two words express anti-symmetric relations. In the case of next neighbor collocations consisting of more than two words (like A prep/det/conj B e. g. Samson and Delilah), the relation might be symmetric, for instance in the case of conjunctions like and or or.

Transitivity: Transitivity of a relation means that r(A, B) and r(B, C) always implies r(A, C). In general, a relation found experimentally will not be transitive, of course. But there may be a part where transitivity holds.

Some of the most prominent transitive relations are the cohyponymy, hyponymy, synonymy relations.

U. Quasthoff 46

Supporting Second Results

In second combination type a known relation given by one method of extraction is verified by an identical but unnamed second result as follows:

Result 1: There is certain relation r between A and B

Result 2: There is some strong (but unknown) relation between A and B (e. g. given by a collocation set)

Conclusion: Result 1 holds with more evidence.

Example

Result 2: The German compound Entschädigungsgesetz can be divided into Gesetz and Entschädigung with an unknown relation.

Result 1 is given by the four word next neighbor collocation Gesetz über die Entschädigung. Similarly Stundenkilometer is analyzed as Kilometer pro Stunde.

In these examples, result 1 is not enough because there are collocations like Woche auf dem Tisch which do not describe a meaningful semantic relation.

U. Quasthoff 47

Combining Three Results

Result 1: There is relation r between A and B

Result 2: B is similar to B' (cohyponymy)

Result 3: There is some strong but unknown relation between A and B'

Conclusion: There is a relation r between A and B'‚

Example: As result 1 we might know that Schwanz (tail) is part of Pferd (horse). Similar terms to Pferd are both Kuh (cow) and Hund (dog) (result 2). Both of them have the term Schwanz collocation in their set of significant collocations (result 3). Hence we might correctly conjecture that both Kuh and Hund have a tail (Schwanz) as part of their body.

In contrast, Reiter (rider) is a strong collocation to Pferd and might (incorrectly) be conjectured to be another similar concept, but Reiter is no collocation with respect to Schwanz. Hence, the absence of result 3 prevents us from making an incorrect conclusion.

U. Quasthoff 48

Similarity Used to Infer a Strong Property

Let us call an property p important, if similarity respects this property. This strong property can be assured as follows:

Result 1: A has a certain important property p

Result 2: B is similar to A (i. e., B is a cohyponym of A)

Conclusion: B has the same property p

Example: We consider A and B as similar if they are in the set of right neighbor collocations of Hafenstadt (port town) (result 2). If we know that Hafenstadt is a property of its typical right neighbors (result 1) we may infer this property for more then 200 cities like Split, Sidon, Durban, Kismayo, Tyrus, Vlora, Karachi, Durres, [...].

U. Quasthoff 49

Subject Area Inferred from Collocation Sets

Result 1: A, B, C, ... are collocates of a certain term.

Result 2: Some of them belong to a certain subject area.

Conclusion: All of them belong to this subject area.

Example: Consider the following top entries in the collocation set of carcinoma: patients, cell, squamous, radiotherapy, lung, thyroid, treated, hepatocellular, metastases, adenocarcinoma, cervix, irradiation, breast, treatment, CT, therapy, renal, cases, bladder, cervical, tumor, cancer, metastatic, radiation, uterine, ovarian, chemotherapy, [...]

If we know that some of them belong to the subject area Medicine, we can add this subject area to the other members of the collocation set as well.

U. Quasthoff 50

Part 7









U. Quasthoff 51

U. Quasthoff 52

Overview Input: Today‘s news text

• Ca. 20.000 sentences

• Relative size compared to the large corus: Factor 1000

• # of sentence collocations: ca. 100.000

• # of next neighbor collocations: ca. 7000

• # of next neighbor collocations, both capitalized: ca. 300

• Size So und Mo: ca. 50% compared to weekdays.

Problem: Find important terms.

• Total number of 100-150 words

• Frequency data available:– total frequency today

– relative frequency compared to our large corpus

– total frequency in our large corpus

• Morphosyntactic criteria:– Words and multiwords should be capitalized

– No inflected forms

U. Quasthoff 53

Frequency Measures• Total frequency today

– Minimum frequency needed, otherwise too many words (cf. Zipf’s Law).

– Today: Minimum frequency of 12

– Today: Maximum frequency of 100 for relevant words

• Relative frequency compared to our large corpus– Large factor implies importance.

– Small variance appears by chance.

– Threshold for importence: factor>6. May be lowered for larger daily corpora.

• Total frequency in our large corpus– Words should be familiar

– Today: Wortschatz-frequency > 20

– What about totally new words? - Today: Minimum frequency of 12 as above

Question: Which measure is closest to importance as felt by humans?

Answer: Total frequency today

U. Quasthoff 54

Words of the Day (without human inspection)

U. Quasthoff 55

Words of the Day (after 5 minutes of inspection)

U. Quasthoff 56

Problem: Find the Message

We automatically find:• Jürgen Hart is rarely mentioned• We notice the words gestorben and tot and the phrase hörte sein Herz auf zu

schlagen.• Conclusion: We have an obituary.

U. Quasthoff 57

Relations between Words

Today‘s collocation graph: Connected words represent a strong relation

U. Quasthoff 58

Temporal Relations

We see wether collocations repeatedly appear together during the last 30 days.

U. Quasthoff 59

Part 8









U. Quasthoff 60

Document Similarity

The description of a document consists of all its terms which have been Word of the Day at any time.

Hence we use only approx. 5.000 for indexing.

Documents are compared just by counting their common terms, weighted by their frequencies.

U. Quasthoff 61

Sample similar Documents Doc.-Nr. 1 Doc.-Nr. 2 Ähnlichkeit 60910293 51923690 9558.00

Weltmeister Südkorea Michael_Ballack Oliver_Kahn Brasilien Yokohama Ballack DFB Rudi_Völler

60910293 552389133 7946.00Weltmeister Südkorea Dietmar_Hamann Thomas_Linke Weltmeisterschaft Carsten_Ramelow Rudi_Völler

60910293 588749685 7278.00Südkorea Michael_Ballack Weltmeisterschaft Ballack Elf Seoul

734389933 1313082725 11073.00Israel Arafat Hebron Jericho Palästinenser Terror Bush Frieden US-Präsident_George_W._Bush

734389933 1598295465 7344.00Israel Arafat Palästinenser Terror Bush US-Präsident_George_W._Bush

242550748 1598295465 7344.00Israel Arafat Palästinenser Terror Bush US-Präsident_George_W._Bush

242550748 734389933 12691.00Israel Autonomiebehörde Arafat Hebron Palästinenser Terror Bush Frieden Jassir_Arafat US-Präsident_George_W._Bush

U. Quasthoff 62

Topics of the DayIf we have sets of similar documents we can use clustering. The terms

describing the cluster can be viewed as Topic of the Day.

The clustering algorithm:Consider all documents (approx. 200 each day).For each pair of similar documents, consider their set of common Words

of the Day.Next we cluster these words using HACM:• In each step, the most similar sets are combined.• As similarity measure we use: sim(A,B)= |AB| / |B| (Which part of

A is contained in B?)• If sim(A,B)>0.4, then B is replaced by AB and A is dropped.• The algorithm stops if there are no sets with similarity >.4.

U. Quasthoff 63

Clusters of 25/6/2002(Titels are made by hand)NAHOST 1 Gaza-Streifen Arafat Außenminister_Schimon_Peres Jerusalem Terroristen Westjordanland

Hebron Israel Panzer Attentäter Palästinenser Ramallah Selbstmordanschläge Israelis Ausgangssperre Autonomiebehörde Tulkarem Bethlehem Dschenin Gazastreifen Terror US-Präsident_George_W._Bush Anschlägen Nablus

FORMEL1 2 Grand_Prix Rubens_Barrichello Ferrari Ralf_Schumacher McLaren-Mercedes Coulthard Barrichello Montoya Großbritannien Schumacher Michael_Schumacher Nürburgring Nürburgri Großen_Preis Stallorder Weltmeister Brasilien Fußball-WM

STOLPE 3 Potsdam SPD-Generalsekretär_Franz_Müntefering Lothar_Späth Stolpe Bundestagswahlkampf Matthias_Platzeck Müntefering Schönbohm Bundesrat Platzeck Brandenburg Manfred_Stolpe Cottbus Zuwanderungsgesetz PDS Wittenberge Ministerpräsident_Manfred_Stolpe Jörg_Schönbohm Bundestagswahl Schröder

WM 4 Korea Südkorea Skibbe Seoul Oliver_Kahn Südkoreaner Michael_Ballack Koreaner Spanier Hitze Nationalmannschaft Elfmeterschießen Viertelfinale Paraguay WM-Halbfinale Miroslav_Klose Völler Jens_Jeremies Karl-Heinz_Rummenigge Klose Golden_Goal Weltmeister Türken Senegal Fußball Verlängerung Brasilien Elf Weltmeisterschaft Entschuldigung Rudi_Völler Portugal Ronaldo Rivaldo Achtelfinale Argentinien Fifa Dietmar_Hamann

PISA 5 Nordrhein-Westfalen Gymnasien Pisa-E Naturwissenschaften Brandenburg Rheinland-Pfalz Sachsen-Anhalt

BÖRSE 6 T-Aktie Allzeittief Neuen_Markt DAX France_Télécom Moody's TarifrundeHARTZ 7 Hartz SPD-Generalsekretär_Franz_Müntefering Bundeswirtschaftsminister_Werner_Müller

Florian_Gerster Arbeitslosenzahl FDP-Chef_Guido_Westerwelle Hartz-Kommission

U. Quasthoff 64

Cluster des 26/6/2002WM 1 Fußball Ilhan_Mansiz Golden_Goal WM-Halbfinale Senegal Türken Schröder Weltmeister

Bundesinnenminister_Otto_Schily Völler Bundespräsident_Johannes_Rau Brasilien Bundeskanzler_Gerhard_Schröder Ballack Frings Neuville Bierhoff Jeremies Klose Ramelow Korea Südkorea Michael_Ballack Oliver_Kahn Beckham Weltmeisterschaft Zidane Pelé Ronaldo Rivaldo Miroslav_Klose Viertelfinale Paraguay Jens_Jeremies FC_Liverpool Seoul Christian_Ziege Spanier Sebastian_Kehl Elf Saudi-Arabien Thomas_Linke Nationalmannschaft Rudi_Völler Seo Carsten_Ramelow Christoph_Metzelder Foul WM-Finale Koreaner Südkoreaner Oliver_Bierhoff Dietmar_Hamann Yokohama Schiedsrichter Franz_Beckenbauer Portugal Guus_Hiddink DFB Oliver_Neuville Marco_Bode Gelbe_Karte Fifa Franzosen Yoo

NAHOST 2 Ariel_Scharon Israel Arafat Palästinenser Bush Nahen_Osten Autonomiebehörde Hebron Terror Frieden Jassir_Arafat US-Präsident_George_W._Bush Jericho Ramallah Israelis Scharon Weiße_Haus George_Bush Palästina Westjordanland Jerusalem Ministerpräsident_Ariel_Scharon Panzer Großbritannien Gewalt Palästinenserpräsident_Jassir_Arafat US-Regierung Anschläge Waffen Intifada

ERFURT 3 Schule Massaker Erfurt Lehrer Steinhäuser Rainer_Heise Robert_Steinhäuser STOLPE 4 Brandenburg Bundesrat Stolpe Bundespräsident_Johannes_Rau PDS Jörg_Schönbohm

Schönbohm Lothar_Späth Platzeck Matthias_Platzeck Manfred_StolpeFORMEL1 5 Weltmeister Rubens_Barrichello Nürburgring Großen_Preis McLaren-Mercedes Barrichello

Ralf_Schumacher Ferrari Michael_Schumacher Schumacher Jean_TodtBÖRSE 6 Moody's Neuen_Markt DAX ABN_Amro Goldman_Sachs France_Télécom BABCOCK 7 Babcock Nordrhein-Westfalen Oberhausen Bürgschaft Babcock_Borsig IG_Metall

Stellenabbau IndienHOLZMANN 8 Philipp_Holzmann_AG Ottmar_Hermann Baukonzern Philipp_Holzmann Holzmann

Insolvenz Niederländer

U. Quasthoff 65

Comparison

25.6.02:

WM

NAHOST

FORMEL1

STOLPE

BÖRSE

PISA

HARTZ

26.6.02:

WM

NAHOST

FORMEL1

STOLPE

BÖRSE

ERFURT

BABCOCK

HOLZMANN

Some topics appear repeatedly, either on consecutive days or after some interval.

If a topic once is introduced by hand, it will be detected automatically later on.

U. Quasthoff 66

Thank you.