exploring archives with tm

17
Exploring archives with probabilistic models: Topic modelling for the European Commission Archives Simon Hengchen , Mathias Coeckelbergs , Seth van Hooland , Ruben Verborgh & Thomas Steiner Université libre de Bruxelles - ReSIC Ghent University - iMinds Google Germany {shengche;mcoeckel;svhoolan}@ulb.ac.be [email protected];[email protected] hengchen.net

Upload: dangdung

Post on 13-Feb-2017

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploring archives with TM

Exploring archiveswith probabilistic models:

Topic modelling for the European Commission Archives

Simon Hengchen♣, Mathias Coeckelbergs ♣, Seth van Hooland ♣, Ruben Verborgh♡ & Thomas Steiner ♠

♣ Université libre de Bruxelles - ReSIC♡ Ghent University - iMinds

♠ Google Germany

{shengche;mcoeckel;svhoolan}@ulb.ac.be

[email protected];[email protected]

hengchen.net

Page 2: Exploring archives with TM

Exploring archives with TM

- Digitisation initiatives for archives have created huge textual corpora

Page 3: Exploring archives with TM

Exploring archives with TM

- Digitisation initiatives for archives have created huge textual corpora

- Those corpora are often of bad quality (OCR), and of unknown content (no metadata)

Page 4: Exploring archives with TM

Exploring archives with TM

- Digitisation initiatives for archives have created huge textual corpora

- Those corpora are often of bad quality (OCR), and of unknown content (no metadata)

- As such, they are useless and only serve for data preservation

Page 5: Exploring archives with TM

Exploring archives with TM

- Digitisation initiatives for archives have created huge textual corpora

- Those corpora are often of bad quality (OCR), and of unknown content (no metadata)

- As such, they are useless and only serve for data preservation

- We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents

Page 6: Exploring archives with TM

Topic Modelling

Blei, D.M., Ng, A.Y. and Jordan, M.I., 2003. Latent dirichlet allocation. the Journal of machine Learning research, 3, pp.993-1022.

Page 7: Exploring archives with TM

Exploring archives with TM

We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How?- Use LDA to generate “representative tokens”

Page 8: Exploring archives with TM

Exploring archives with TM

We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How?- Use LDA to generate “representative tokens”- Intellectually deduce the topics present in the corpus

Page 9: Exploring archives with TM

Exploring archives with TM

We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How?- Use LDA to generate “representative tokens”- Intellectually deduce the topics present in the corpus- Match the topics with EuroVoc

Page 10: Exploring archives with TM

Exploring archives with TM

We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How?- Use LDA to generate “representative tokens”- Intellectually deduce the topics present in the corpus- Match the topics with EuroVoc

http://eurovoc.europa.eu/852

Page 11: Exploring archives with TM

Exploring archives with TM

We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How?- Use LDA to generate “representative tokens”- Intellectually deduce the topics present in the corpus- Match the topics with EuroVoc- Manually inspect the documents

Page 12: Exploring archives with TM

Exploring archives with TM

Results:- 100% agreement between non-expert annotators

Page 13: Exploring archives with TM

Exploring archives with TM

Results:- 100% agreement between non-expert annotators- All documents matched to a topic are correctly matched

Page 14: Exploring archives with TM

Exploring archives with TM

Results:- 100% agreement between non-expert annotators- All documents matched to a topic are correctly matched- No specific terms could be attributed to 30% of the

clusters of salient tokens

Page 15: Exploring archives with TM

Exploring archives with TM

Discussion and improvement:- No specific terms could be attributed to 30% of the

clusters of salient tokens, because :

- OCR noise ✖- Too large k-parameter in LDA ✓- Non-expert knowledge of EU-related matters ✓

Page 16: Exploring archives with TM

Exploring archives with TM

Future work:- Experiment with smaller k-parameters- Expert annotation- Harvesting the multilingual component- … implementation

Page 17: Exploring archives with TM

Acknowledgments

Simon Hengchen is supported by Belgian Science Policy (BELSPO) grant n° BR/121/A3/TIC-BELGIUM.