exploring archives with tm
TRANSCRIPT
Exploring archiveswith probabilistic models:
Topic modelling for the European Commission Archives
Simon Hengchen♣, Mathias Coeckelbergs ♣, Seth van Hooland ♣, Ruben Verborgh♡ & Thomas Steiner ♠
♣ Université libre de Bruxelles - ReSIC♡ Ghent University - iMinds
♠ Google Germany
{shengche;mcoeckel;svhoolan}@ulb.ac.be
[email protected];[email protected]
hengchen.net
Exploring archives with TM
- Digitisation initiatives for archives have created huge textual corpora
Exploring archives with TM
- Digitisation initiatives for archives have created huge textual corpora
- Those corpora are often of bad quality (OCR), and of unknown content (no metadata)
Exploring archives with TM
- Digitisation initiatives for archives have created huge textual corpora
- Those corpora are often of bad quality (OCR), and of unknown content (no metadata)
- As such, they are useless and only serve for data preservation
Exploring archives with TM
- Digitisation initiatives for archives have created huge textual corpora
- Those corpora are often of bad quality (OCR), and of unknown content (no metadata)
- As such, they are useless and only serve for data preservation
- We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents
Topic Modelling
Blei, D.M., Ng, A.Y. and Jordan, M.I., 2003. Latent dirichlet allocation. the Journal of machine Learning research, 3, pp.993-1022.
Exploring archives with TM
We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How?- Use LDA to generate “representative tokens”
Exploring archives with TM
We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How?- Use LDA to generate “representative tokens”- Intellectually deduce the topics present in the corpus
Exploring archives with TM
We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How?- Use LDA to generate “representative tokens”- Intellectually deduce the topics present in the corpus- Match the topics with EuroVoc
Exploring archives with TM
We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How?- Use LDA to generate “representative tokens”- Intellectually deduce the topics present in the corpus- Match the topics with EuroVoc
http://eurovoc.europa.eu/852
Exploring archives with TM
We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How?- Use LDA to generate “representative tokens”- Intellectually deduce the topics present in the corpus- Match the topics with EuroVoc- Manually inspect the documents
Exploring archives with TM
Results:- 100% agreement between non-expert annotators
Exploring archives with TM
Results:- 100% agreement between non-expert annotators- All documents matched to a topic are correctly matched
Exploring archives with TM
Results:- 100% agreement between non-expert annotators- All documents matched to a topic are correctly matched- No specific terms could be attributed to 30% of the
clusters of salient tokens
Exploring archives with TM
Discussion and improvement:- No specific terms could be attributed to 30% of the
clusters of salient tokens, because :
- OCR noise ✖- Too large k-parameter in LDA ✓- Non-expert knowledge of EU-related matters ✓
Exploring archives with TM
Future work:- Experiment with smaller k-parameters- Expert annotation- Harvesting the multilingual component- … implementation
Acknowledgments
Simon Hengchen is supported by Belgian Science Policy (BELSPO) grant n° BR/121/A3/TIC-BELGIUM.