exploring archives with tm

Exploring archiveswith probabilistic models:

Topic modelling for the European Commission Archives

Simon Hengchen♣, Mathias Coeckelbergs ♣, Seth van Hooland ♣, Ruben Verborgh♡ & Thomas Steiner ♠

♣ Université libre de Bruxelles - ReSIC♡ Ghent University - iMinds

♠ Google Germany

{shengche;mcoeckel;svhoolan}@ulb.ac.be

[email protected];[email protected]

hengchen.net

Exploring archives with TM

- Digitisation initiatives for archives have created huge textual corpora



- Those corpora are often of bad quality (OCR), and of unknown content (no metadata)




- As such, they are useless and only serve for data preservation




- As such, they are useless and only serve for data preservation

- We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents

Topic Modelling

Blei, D.M., Ng, A.Y. and Jordan, M.I., 2003. Latent dirichlet allocation. the Journal of machine Learning research, 3, pp.993-1022.


We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How?- Use LDA to generate “representative tokens”


We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How?- Use LDA to generate “representative tokens”- Intellectually deduce the topics present in the corpus


We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How?- Use LDA to generate “representative tokens”- Intellectually deduce the topics present in the corpus- Match the topics with EuroVoc


We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How?- Use LDA to generate “representative tokens”- Intellectually deduce the topics present in the corpus- Match the topics with EuroVoc

http://eurovoc.europa.eu/852


We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How?- Use LDA to generate “representative tokens”- Intellectually deduce the topics present in the corpus- Match the topics with EuroVoc- Manually inspect the documents


Results:- 100% agreement between non-expert annotators


Results:- 100% agreement between non-expert annotators- All documents matched to a topic are correctly matched


Results:- 100% agreement between non-expert annotators- All documents matched to a topic are correctly matched- No specific terms could be attributed to 30% of the

clusters of salient tokens


Discussion and improvement:- No specific terms could be attributed to 30% of the

clusters of salient tokens, because :

- OCR noise ✖- Too large k-parameter in LDA ✓- Non-expert knowledge of EU-related matters ✓


Future work:- Experiment with smaller k-parameters- Expert annotation- Harvesting the multilingual component- … implementation

Acknowledgments

Simon Hengchen is supported by Belgian Science Policy (BELSPO) grant n° BR/121/A3/TIC-BELGIUM.

exploring archives with tm

Documents