using diacollo for historical research · 2019. 10. 1. · j. r. firth. papers in linguistics...

Using DIACOLLO for Historical Research

Bryan Jurish Maret NielanderBerlin-Brandenburg Academy of

Sciences and Humanities, Berlin

Georg Eckert Institute for International

Textbook Research, Braunschweig

[email protected] [email protected]

CLARIN Annual Conference 2019

Leipzig, Germany

1st October, 2019

mailto:[email protected]

mailto:[email protected]

2019-10-01 / CAC-2019 / Jurish & Nielander / 1

Overview

p

Collaborative software development

p

Corpora & collocations

p

DiaCollo: diachronic collocation profiling

p

Use case: Education policy in Die Grenzboten

p

Summary & conclusion


Software Development Cycle

?? Planning‣ identify desiderata & bugs

‣ sketch next steps

Implementation‣ coding & documentation

‣ release & deployment


Corpora & Collocations

Diachronic Text Corpora

p

heterogeneous with respect to to date of origin

p

should expose temporal effects of e.g. semantic shift, discourse trends

p

problematic for conventional NLP tools (which assume homogeneity)

Collocation Profiling (Church & Hanks 1990; Manning & Schutze 1999; Evert 2005)

“You shall know a word by the company it keeps” — J. R. Firth

p

prompt user for target collocant term(s) of interest (w1)

p

lookup all candidate collocates (w2) co-occurring with w1

p

rank candidates by association score

t score function ϕ(f1, f2, f12, N) approximates relevance of w2 to w1

t “chance” co-occurrences with high-frequency w2 should be filtered out!

t statistical method requires large data sample


Diachronic Collocation Profiling

The Problem: (temporal) heterogeneity

p

conventional collocation extractors assume corpus homogeneity

p

co-occurrence frequencies are computed only for word-pairs (w1, w2)

p

influence of occurrence date (and other document properties) is irrevocably lost

A Solution (sketch)

p

represent terms as n-tuples of independent attributes, including occurrence date

p

partition corpus on-the-fly into user-specified intervals (“date slices”, “epochs”)

p

collect independent epoch-wise profiles into final result set

Advantages Drawbacks

t full support for diachronic axis t sparse data requires larger corpora

t variable query-level granularity t computationally expensive

t flexible attribute selection t large index size

t multiple association scores t no syntactic relations (yet)


DiaCollo: Development

Planning & Evaluation

p

in collaboration with DWDS lexicographers & CLARIN-D historians

Implementation

p

Perl+PDL API, CLI, client/server

t RESTful D* web-service + GUI

p

various output & visualization formats, e.g.

t TSV, JSON , HTML, Highcharts, d3-cloud, . . .

p

batteries not included

t tokenization, annotation, full-text search, . . .

p

garbage in garbage out

t “messy” corpora unsatisfying results

Deployment

p

successfully applied to 70 distinct curated corpora at the BBAW, including:

t Royal Society Philosophical Transactions (1665–1869, 9.8K documents, 35M tokens)

t Deutsches Textarchiv (1600–1900, 3.6K documents, 205M tokens)

t DWDS Zeitungen (1946–2019, 16M documents, 6.3G tokens)

https://metacpan.org/release/DiaColloDB

http://en.wikipedia.org/wiki/Tab-separated_values

http://json.org/

http://www.w3.org/html/

http://www.highcharts.com/

http://github.com/jasondavies/d3-cloud

https://en.wikipedia.org/wiki/Garbage_in,_garbage_out

http://fedora.clarin-d.uni-saarland.de/rsc/

http://www.deutschestextarchiv.de

https://www.dwds.de/d/k-zeitung#zeitungen


DiaCollo: Scoring & Comparison Functions

Selected Score Functions

p

f raw collocation frequency = f12

p

lf collocation log-frequency = log2(f12 + ε)

p

mi pointwise MI × log-frequency ≈ log2f12×N

f1×f2× log2 f12

p

ll log-likelihood (Dunning 1993) ≈ sgn(f12|f1, f2) × log L(H0)L(H1)

p

ld log-Dice coefficient (Rychly 2008) ≈ 14 + log22×f12

f1+f2

Selected Diff Operations

p

diff raw score difference = sa − sb

p

adiff absolute score difference = |sa − sb|

p

avg arithmetic average = sa+sb

2

p

max maximum = max{sa, sb}

p

min minimum = min{sa, sb}

p

havg harmonic average ≈ 2sasb

sa+sb

Use Case: Education Policy in Die Grenzboten


‘Schule’: DiaCollo Query (DTA)


‘Schule’: DiaCollo Collocates (DTA: HTML)

1560–1569

p

association with religious institutions

t Kloster (“cloister”)

t Pfarrherr (“pastor”)

t Kirche (“church”)

http://kaskade.dwds.de/dstar/grenzboten/diacollo/?q=Schule&p=2&d=1560%3A1719&f=html#1560


‘Schule’: DiaCollo Collocates (DTA: HTML)

1560–1569 1710–1719

p

association with religious institutions

t Kloster (“cloister”)

t Pfarrherr (“pastor”)

t Kirche (“church”)

p

stronger secular associations

t Inspektor (“inspector”)

t preußisch (“prussian”)

t Universitat (“university”)

p

trend continues as time progresses




‘Schule’: DiaCollo Collocates (DTA: lemma-cloud)

1560s:

1560 1570 1580 1590 1600 1610 1620 1630 1640 1650 1660 1670 1680 1690 1700 1710

1560

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

Kirche

Schule

Kloster

partikular

Schulmeister

Pfarrherr

OrdnungFlecken

Knabe

Fleiß

1710s:

1560 1570 1580 1590 1600 1610 1620 1630 1640 1650 1660 1670 1680 1690 1700 1710

1710

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

Kirche

Schule

Jugend

Universität

Lehrer

mechanisch

preußisch

Inspektor

Besserung

Besuchung

http://kaskade.dwds.de/dstar/grenzboten/diacollo/?q=Schule&p=2&d=1560%3A1719&f=cld#1560

http://kaskade.dwds.de/dstar/grenzboten/diacollo/?q=Schule&p=2&d=1560%3A1719&f=cld#1710


Die Grenzboten Corpus

http://brema.suub.uni-bremen.de/grenzboten

http://www.deutschestextarchiv.de/doku/textquellen#grenzbotenImage: SuUB Bremen

p

Die Grenzboten (“the messengers from the border(s)”) was a bi-weekly

national-liberal German language periodical published 1841–1922

p

covered a wide range of politics, literature, and the arts throughout the

‘long’ nineteenth Century

p

270 volumes (ca. 187,000 pages) digitized, OCR’ed, and structured by the

SuUB Bremen in the context of a DFG-Project

t integrated into the corpus research infrastructure of the Deutsches Textarchiv

at the BBAW CLARIN Service Center



http://www.deutschestextarchiv.de/doku/textquellen#grenzboten

http://www.deutschestextarchiv.de/doku/textquellen#grenzboten


http://brema.suub.uni-bremen.de

http://www.suub.uni-bremen.de/ueber-uns/projekte/grenzboten/

http://www.deutschestextarchiv.de

http://clarin.bbaw.de


Are Die Grenzboten concerned with education?

Step 1: query corpus vocabulary database (LexDB)

p

identify relevant terms in the corpus, e.g. Schule (“school”), 1840–1899

t . . . in the Deutsches Textarchiv : 101.52 per million tokens

t . . . in Die Grenzboten : 237.29 per million tokens

Step 2: query DiaCollo

p

identify strong collocates for Schule (“school”)

p

identify possible debates in the corpus via query results

p

close reading in the texts via “keyword-in-context” (KWIC) hyperlinks

http://kaskade.dwds.de/dstar/grenzboten/lexdb

http://kaskade.dwds.de/dstar/grenzboten/diacollo


Education Policy & Religion

Collocate ‘Kirche’ (“church”)

p

persistently prominent throughout the entire Grenzboten corpus

p

1850s–1880s: konfessionell (“confessional”)

p

1890s–1910s: Religionsunterricht (“religious education”)

Refining the Search

p

restrict to attributive adjective collocates (groupby: l,p=ADJA)

t protestantisch (“protestant”) 1860s

t katholisch (“Catholic”) 1860s-1870s

t evangelisch (“Protestant, Evangelical”) 1860s-1870s

t konfessionell (“confessional”) 1860s-1880s

t kirchlich (“churchly”) 1870s

p

collocates related to church & religious confession peak in the 1860s–1870s

p

also prominent: offentlich (“public”; 1840s, 1870s–1900s)

t KWIC stance of publicly funded schools w.r.t. church influence in education

http://kaskade.dwds.de/dstar/grenzboten/diacollo/?q=Schule&d=1840%3A1899&ds=10&gb=l%2Cp%3DADJA


Education Policy: Kulturkampf

Kulturkampf (“cultural struggle”)

p

rights & influences of state (Prussia) vs. church (Pope Pius IX)

p

ultramontan (“ultramontane”) staunch supporters of the Catholic ChurchR

aw

Fre

quency

ultramontan

Kulturkampf

0

100

25

50

75

Date (Year)

1845 1850 1855 1860 1865 1870 1875 1880 1885 1890 1895 1900 1905 1910 1915 1920

Refining the Search: GermaNet thesaurus + paragraph search window(Hamp & Feldweg 1997; Henrich & Hinrichs 2010)

p

corpus hits show evidence for anti-Catholic opinions in debates on education

t who should be in charge of education and curricula?

t how to deal with different religious denominations in schools?

Upshot

p

some important aspects of debate are not apparent from initial naıve DiaCollo queries

p

informed curiosity & focused investigation leads to very satisfying results

http://kaskade.dwds.de/dstar/grenzboten/diacollo/?q=%7Bultramontan%2CKulturkampf%7D%3D2+%23fmin+1&ds=1&sf=f&p=ddc&f=hichart&gb=l

http://www.sfs.uni-tuebingen.de/GermaNet/

http://kaskade.dwds.de/dstar/grenzboten/?q=%2F%5Eultramontan%2Fi+%28%2F%5E%28%3F%3ASchul%28%3F%21%5Bdtz%5D%29%7CSch%C3%BCler%7CGymnasi*%29%2F+%26%3D+%24p%3DNN%29+%7C%3D+%28Bildungseinrichtung%7Cgn-sub+%21%3D+Hochschule%7Cgn-sub%29+%23in+p&fmt=html


Summary & Conclusion

Collaborative Development

p

cyclic process feedback loop

p

elusive common ground terminology, research methodology

DiaCollo

p

diachronic text corpora semantic shift, discourse trends

p

conventional tools implicit assumptions of homogeneity

p

diachronic profiling date-dependent lexemes

. . . as a tool for historical research

p

fluent “blended”/“scalable” reading distant ↔ close reading

p

digital corpora (sources) quantity, quality, legal issues

— The End —

Thank you for listening!

http://kaskade.dwds.de/˜jurish/diacollo

http://metacpan.org/release/DiaColloDB

http://kaskade.dwds.de/~jurish/diacollo

http://metacpan.org/release/DiaColloDB

References


References

K. W. Church and P. Hanks. Word association norms, mutual information, and lexicography. Computational

Linguistics, 16(1):22–29, 1990.

S. Evert. The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD thesis, Institut fur

maschinelle Sprachverarbeitung, Universitat Stuttgart, 2005. URL

http://elib.uni-stuttgart.de/opus/volltexte/2005/2371/.

J. R. Firth. Papers in Linguistics 1934–1951. Oxford University Press, London, 1957.

B. Hamp and H. Feldweg. GermaNet – a lexical-semantic net for German. In Proceedings of the ACL

workshop Automatic Information Extraction and Building of Lexical Semantic Resources for NLP

Applications, 1997.

V. Henrich and E. Hinrichs. GernEdiT – the GermaNet editing tool. In Proceedings LREC 2010, pages

2228–2235, 2010. URL http://www.lrec-conf.org/proceedings/lrec2010/pdf/264_Paper.pdf.

B. Jurish. DiaCollo: On the trail of diachronic collocations. In K. De Smedt, editor, CLARIN Annual

Conference 2015 (Wroc law, Poland, October 14–16 2015), pages 28–31, 2015. URL

http://www.clarin.eu/sites/default/files/book%20of%20abstracts%202015.pdf.

B. Jurish, A. Geyken, and T. Werneke. DiaCollo: diachronen Kollokationen auf der Spur. In Proceedings

DHd 2016: Modellierung – Vernetzung – Visualisierung, pages 172–175, March 2016. URL

http://dhd2016.de/boa.pdf#page=172.

H. Kermes, S. Degaetano, A. Khamis, J. Knappen, and E. Teich. The Royal Society corpus: From uncharted

data to corpus. In Proceedings of LREC 2016, Portoroz, Slovenia, 2016. URL

http://www.lrec-conf.org/proceedings/lrec2016/summaries/792.html.

http://elib.uni-stuttgart.de/opus/volltexte/2005/2371/

http://www.lrec-conf.org/proceedings/lrec2010/pdf/264_Paper.pdf

http://www.clarin.eu/sites/default/files/book%20of%20abstracts%202015.pdf

http://dhd2016.de/boa.pdf#page=172

http://www.lrec-conf.org/proceedings/lrec2016/summaries/792.html


ReferencesC. D. Manning and H. Schutze. Foundations of Statistical Natural Language Processing. MIT Press,

Cambridge, MA, 1999.

A. Stulpe and M. Lemke. Blended reading. In M. Lemke and G. Wiedemann, editors, Text Mining in den

Sozialwissenschaften: Grundlagen und Anwendungen zwischen qualitativer und quantitativer

Diskursanalyse, pages 17–61. Springer, 2016. doi:10.1007/978-3-658-07224-7 2.

http://dx.doi.org/10.1007/978-3-658-07224-7_2

using diacollo for historical research · 2019. 10. 1. · j. r. firth. papers in linguistics...

Documents