topic modeling and visualisation of diachronic trends in...
TRANSCRIPT
Masterarbeit
zur Erlangung des akademischen Grades
Master of Arts
der Philosophischen Fakultat der Universitat Zurich
Topic Modeling and Visualisation ofDiachronic Trends in Biomedical
Academic Articles
Verfasserin: Parijat Ghoshal
Matrikel-Nr: 09-716-010
Referent: Prof. Dr. Martin Volk
Betreuer: Dr. Fabio Rinaldi
Institut fur Computerlinguistik
Abgabedatum: 24.06.2017
Abstract
In the biomedical domain, there is an abundance of texts making the task of having
a thematic overview about them a challenging endeavour. This is also due to the
fact that many of these texts are unlabelled and one simply cannot always assign
them to a certain thematic domain. Some texts remain thematically ambiguous and
sorting them neatly into thematic domains is impossible. Thus, it could be helpful
to implement an unsupervised algorithm to sort into topics a corpus of unlabelled
data. In this Master’s thesis, latent Dirichlet allocation will be used on the corpus
to automatically generate topics. Throughout the course of this work, I will create
topic models based on articles from PubMed Central’s Open Access Subset. Then I
will observe diachronic trends in them on three different levels with the help of the
topic model. On the first level, I will observe diachronic changes in the popularity
of the topics themselves. Then I will check how the popularity of the topic words
within a topic evolve throughout the corpus. On the third level, I will observe
the popularity of common words that belong to documents about a certain topic.
Moreover, a companion website and a topic modeling pipeline is also created as an
output of this project.
Acknowledgement
I would like to thank Dr. Fabio Rinaldi for his guidance, patience and understanding,
and my parents for their unrelenting support. Finally, I would also like to thank Jo
who put up with everything else.
ii
Contents
Abstract i
Acknowledgement ii
Contents iii
List of Figures vii
List of Tables ix
List of Acronyms x
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Theoretical background 3
2.1 Topic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Precursors of Latent Dirichlet Allocation . . . . . . . . . . . . . 3
2.1.2 Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . . . 4
2.2 Machine Learning Problem . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Issues with Topic Models . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.1 Categories of Bad Topics . . . . . . . . . . . . . . . . . . . . . . 6
2.3.1.1 General and Specific Words . . . . . . . . . . . . . . . . . 6
2.3.1.2 Mixed and Chained Topics . . . . . . . . . . . . . . . . . . 6
2.3.1.3 Identical Topics . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1.4 Incomplete Stopword List . . . . . . . . . . . . . . . . . . 7
2.3.1.5 Nonsensical Topics . . . . . . . . . . . . . . . . . . . . . . 7
2.3.2 Topic Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.3 Topic Quality Evaluation . . . . . . . . . . . . . . . . . . . . . 8
2.3.3.1 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . 8
2.3.3.2 Topic Size . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.3.3 Topic Word Length . . . . . . . . . . . . . . . . . . . . . . 9
iii
Contents
2.4 Improving Topic Models . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.1 Automatic Topic Model Labelling . . . . . . . . . . . . . . . . . 10
2.4.1.1 Information Retrieval . . . . . . . . . . . . . . . . . . . . . 10
2.4.1.2 Neural Embeddings . . . . . . . . . . . . . . . . . . . . . . 10
2.4.2 Text Preprocessing to Acquire Meaningful Topics . . . . . . . . 11
3 Previous work 12
3.1 Biomedical Topic Modeling . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 Ontology Term Mapping . . . . . . . . . . . . . . . . . . . . . . 12
3.1.2 Enriching LDA Output with External Data . . . . . . . . . . . 12
3.1.3 Discover Relationships between Diseases and Genes with Topic
Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.4 Comprehensive Biomedical LDA Topics Example Source . . . . 13
3.2 Diachronic Topic Modeling . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 Early Work in Modern Non-biomedical Domain . . . . . . . . . 14
3.2.2 Observe Diachronic Changes and Author Influence in Biomed-
ical Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Brief Overview of Available Tools . . . . . . . . . . . . . . . . . . . . 15
3.3.1 MAchine Learning for LanguagE Toolkit (MALLET) . . . . . . 15
3.3.2 Stanford Topic Modeling Toolbox . . . . . . . . . . . . . . . . . 15
3.3.3 Spark Machine Learning Library (MLlib) . . . . . . . . . . . . . 16
3.3.4 R Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.5 Gensim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Methodology 17
4.1 Source of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Extracting Topic Models . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.1 Experiment 1 : Exploring the Topics in the Corpus . . . . . . . 18
4.2.1.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.2 LDA Model Creation Parameters . . . . . . . . . . . . . . . . . 19
4.2.2.1 Evaluation: 10 Topics Model . . . . . . . . . . . . . . . . . 19
4.2.2.2 Evaluation: 20 Topics Model . . . . . . . . . . . . . . . . . 20
4.2.2.3 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.2.4 Evaluation: 50 Topics Model . . . . . . . . . . . . . . . . . 21
4.2.2.5 Evaluation: 100 Topics Model . . . . . . . . . . . . . . . . 21
4.2.2.6 Evaluation Experiment 1: All Models . . . . . . . . . . . . 22
4.2.3 Experiment 2 : Edited Corpus and Modified Model Update
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.3.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
iv
Contents
4.2.4 Experiment 3: Online Learning with Different Batch Sizes . . . 24
4.2.4.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.5 Experiment 4: Reduced Vocabulary . . . . . . . . . . . . . . . . 26
4.2.5.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.5.2 10 Topics Model . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.5.3 20 Topics Model . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.5.4 50 Topics Model . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.5.5 100 Topics Model . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.6 Experiment 5: Influence of POS Tags . . . . . . . . . . . . . . . 29
4.2.6.1 Noun-Verb Corpus . . . . . . . . . . . . . . . . . . . . . . 30
4.2.6.2 Noun-Adjective Corpus . . . . . . . . . . . . . . . . . . . . 31
4.2.6.3 Noun-Verb-Adjective Corpus . . . . . . . . . . . . . . . . . 32
4.2.7 Experiment 6: Extracting Models with Distinct Topics Using
Topic Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.8 Experiment 7: Extracting Stable Models . . . . . . . . . . . . . 35
4.2.9 Topic Labelling . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5 Data and Topic Exploration 39
5.1 Document Topics Distribution . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2.1 Topic Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2.2 Average Topic Probability . . . . . . . . . . . . . . . . . . . . . 41
5.3 Observing Diachronic Trends Using Topic Models . . . . . . . . . . . 45
5.4 Topic Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.4.1 Frequency of Topic Words in the Corpus . . . . . . . . . . . . . 46
5.4.2 Diachronic Shifts within a Topic . . . . . . . . . . . . . . . . . . 48
5.5 Frequency of Popular Words within a Topic . . . . . . . . . . . . . . 49
5.5.1 Diachronic Popularity of Non-topic Word Related Terms . . . . 51
6 Results and Discussion 52
6.1 Research question Nr. 1 . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.2 Research question Nr. 2 . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.3 Research question Nr. 3 . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7 Website 55
7.1 Generating the charts . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.2 Website sections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.2.1 Observing diachronic trends in topics . . . . . . . . . . . . . . . 56
v
Contents
7.2.2 Generate frequency of topic words in the corpus . . . . . . . . . 57
7.2.3 Frequency of popular words within a topic . . . . . . . . . . . . 57
8 Diachronic topic modeling pipeline 59
8.1 Data extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.1.1 Extract metadata . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.1.2 Extract text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.1.2.1 Text preprocessing . . . . . . . . . . . . . . . . . . . . . . 59
8.1.2.1.1 POS tagging of the corpus . . . . . . . . . . 60
8.1.2.2 Token filtering . . . . . . . . . . . . . . . . . . . . . . . . . 60
8.1.3 Corpus creation . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
8.2 LDA topic modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
8.2.1 Dictionary creation . . . . . . . . . . . . . . . . . . . . . . . . . 61
8.2.2 Editing the original dictionary . . . . . . . . . . . . . . . . . . . 61
8.2.3 LDA corpus creation . . . . . . . . . . . . . . . . . . . . . . . . 62
8.2.4 LDA model creation . . . . . . . . . . . . . . . . . . . . . . . . 62
8.3 Data mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.3.1 Mapping, creating yearly average topic probability . . . . . . . 62
8.3.2 Mapping: generating reality frequencies for topic words . . . . . 63
8.3.3 Mapping: generating relative frequencies for popular words in
topic subcorpora . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8.4 Other functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
9 Conclusion 65
9.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
References 67
A Tables 71
Curriculum Vitae 72
vi
List of Figures
1.1 Plate notation representing the LDA model (from Blei et al. [2003]) . 4
4.1 Number of articles published per year from 1950-2016 in the corpus
of 150 thousand articles . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Percentage of articles from the original corpus (1.5 million articles)
per year from 1950-2016 that are in the corpus of 150 thousand articles 26
4.3 Average inter-topic similarity . . . . . . . . . . . . . . . . . . . . . . 35
5.1 Topic words in article (green : topic 34) . . . . . . . . . . . . . . . . 40
5.2 Topic words in article (green: topic 19, yellow: topic 28) . . . . . . . 40
5.3 Topic probability distribu- tion of documents of topic 19 . . . . . . . 41
5.4 Topic probability distribution of documents of topic 19, where topic
probability >0.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.5 Topic probability distribution of documents of topics 6,10, 39, and
41, where topic probability >0.1 . . . . . . . . . . . . . . . . . . . . 42
5.6 Topic probability distribution of documents of topic 25, where topic
probability >0.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.7 Average topic probability of documents from 2000 to 2015 of topics
10, 23, and 33 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.8 Average topic probability of documents from 2000 to 2015 of topics
11,12,17,21,28, and 43 . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.9 Average topic probability of documents from 2000 to 2015 of topics
2,5, and 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.10 Average topic probability of documents from 1980 to 2005 of topic 50 45
5.11 Average topic probability of documents from 2000 to 2015 of topics
13,22,34, and 38 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.12 Relative frequency of topic words for topic 13-woman-heart-pregnancy
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.13 Relative frequency of pregnancy related words from topic 13 . . . . . 48
5.14 Relative frequency of heart disease related words from topic 13 . . . . 48
5.15 Relative frequency of topic words for topic 22-infection-virus-vaccine 49
5.16 Relative frequency of immunology related words from topic 22 (group
2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
vii
List of Figures
5.17 Relative frequency of immunology related words from topic 22 (group
3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.18 Relative frequency of words from topic 13 (top 1-5 words) . . . . . . 50
5.19 Relative frequency of words from topic 13 (top 6-10 words) . . . . . . 50
5.20 Relative frequency of words from topic 22 (top 1-5 words) . . . . . . 50
5.21 Relative frequency of words from topic 22 (top 6-9 words) . . . . . . 50
7.1 Website: Part 1 User options . . . . . . . . . . . . . . . . . . . . . . . 56
7.2 Website: Part 1 Example output for topics 2,3,4,5 . . . . . . . . . . . 56
7.3 Website: Part 2 Example output for topic 13 (topics shown partially) 57
7.4 Website: Part 3 Example output for topic 13 (top 2-5 words shown) . 58
8.1 Diachronic topic modeling pipeline . . . . . . . . . . . . . . . . . . . 60
viii
List of Tables
4.1 10 topics generated from a corpus of 1.5 million articles . . . . . . . . 19
4.2 A selection from the 20 topics generated from a corpus of 1.5 million
articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 A selection from the 50 topics generated from a corpus of 1.5 million
articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.4 A selection from the 100 topics generated from a corpus of 1.5 million
articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.5 Identical topics generated from multiple topic models with different
topic sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.6 Topics from 10 topic model from noun corpus . . . . . . . . . . . . . 27
4.7 Topics from 20 topic model from noun corpus . . . . . . . . . . . . . 28
4.8 A selection from the 50 topics generated from noun corpus . . . . . . 28
4.9 Focus on breast-cancer related topics from 100 topics models from
noun corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.10 15 topics from 50 topic model from Noun-Corpus . . . . . . . . . . . 30
4.11 15 topics from 50 topic model from Noun-Verb corpus . . . . . . . . . 31
4.12 15 topics from 50 topic model from Noun-Adjective corpus . . . . . . 32
4.13 15 topics from 50 topic model from Noun-Verb-Adjective corpus . . . 32
4.14 Percentage of identical terms in between the models . . . . . . . . . . 33
4.15 Number of unique words in found in all the topics . . . . . . . . . . 33
4.16 Topic similarity based on number similar words over multiple passes 37
5.1 Topics 19, 28, 34 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Fictitious topic probability distribution over multiple topics and doc-
uments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3 Four topics selected for data exploration . . . . . . . . . . . . . . . . 46
A.1 All 50 topics from final model . . . . . . . . . . . . . . . . . . . . . . 71
ix
List of Acronyms
HTML HyperText Markup Language
LDA latent Dirichlet allocation
NLP Natural Language Processing
NLTK Natural Language Toolkit
OCR Optical Character Recognition
POS Part-Of-Speech
XML eXtensible Markup Language
x
1 Introduction
In this section, I will mention the motivation for writing this Master’s thesis. I will
also mention my research questions that will be tackled in this work. Finally, I will
give an overview of this work stating the themes of the upcoming chapters.
1.1 Motivation
In the field of biomedical literature thousands of articles are published every day.
This is by no means a mere exaggeration; between 2012 and 2015, approximately
800’000 articles were annually published1. Research and discoveries in the biomed-
ical field are primarily found in scholarly publications; however, due to the afore-
mentioned amount of literature being published, the task of finding trends in the
biomedical domain can be a challenge. Natural language processing (NLP), as a
consequence, can be of great use because these academic publications are often pub-
lished in machine-readable text formats. Huang and Lu [2016] mention in their
article about community challenges in the biomedical field that collaborations be-
tween the biomedical and NLP researchers have become commonplace forming the
field of research known as biomedical natural language processing (BioNLP). They
also mention that NLP methods and text mining can be used for a multitude of
tasks, such as constructing ontologies and curating databases etc.
For those interested in machine-learning approaches, the field of biomedical research
publication is well suited for finding patterns in large amounts of data, due to the
abundance of publications available.
PubMed offers full articles, as well as abstracts of scientific articles written in the
biomedical domain. The data provided on PubMed is quite useful for machine-
learning approaches. Firstly, the data is machine-readable, and secondly, there are
metadata including authors and date of publication in addition to the scientific
texts.
1Based on MEDLINE citation counts by year of publication. https://www.nlm.nih.gov/bsd/medline_cit_counts_yr_pub.html
1
Chapter 1. Introduction
Exploring diachronic trends in large corpora has been a particular interest of mine
for quite a while and I have worked on related projects within the constraints of
academia as well as for professional projects. Furthermore, working with biomedical
data is undeniably fascinating in its own regard. For these reasons, I decided to
embark on this project with the aim of discovering diachronic trends in a large
corpus of biomedical publications.
1.2 Research Questions
The research questions that shall be answered in this Master’s thesis are:
1. Is it possible to detect temporal trends in a corpus using topics generated from
a topic modeling algorithm?
2. Using topic modeling, can one detect diachronic changes within the words of
a given topic throughout the entire corpus?
3. Can one use topic modeling to detect diachronic changes in term/word usage
within documents that fall into a specific topic?
1.3 Thesis Structure
This Master’s thesis is structured as follows: at first I will provide a brief overview
of the theoretical background for topic modeling in Chapter 2 and explain the topic
modeling algorithm that I will use to create the models. Furthermore, I will men-
tion the issues of this algorithm and probable strategies of circumventing them. In
Chapter 3, I will mention the previous work that has been done in the domain of
diachronic and biomedical topic modeling. At the same time, I will give a brief
overview of the off-the-shelf available tools for topic modeling. In Chapter 4, the
entire methodology to create the topic model from the initial corpus is provided.
Then in Chapter 5, which is about the data and topic exploration, I look into the
topic model and answer the research questions. In Chapter 6, I discuss the results of
the research questions. In Chapter 7, I introduce the companion website and explain
its functionalities. In Chapter 8, I introduce the diachronic topic modeling pipeline
that has been used to create the topic model. Finally, in Chapter 9, I conclude the
findings of my Master’s thesis and mention possibilities of future work to be done
in this field.
2
2 Theoretical background
In this section, we look at the theoretical framework behind topic modeling. I give a
brief overview of the precursors of latent Dirichlet allocation (LDA). Then I provide
a brief theoretical overview of LDA, and the issues that one could face when using
topic models. The problems of topic models are explained in detail, as I will be
referring to them to justify my methodology.
2.1 Topic Models
One can define topic models as statistical models that are used to learn about the
latent structures that exist within a corpus of documents. These models can have
many uses; however, discovering patterns is one of the key reasons for building topic
models (Boyd-Graber et al. [2014]).
2.1.1 Precursors of Latent Dirichlet Allocation
There are many different kinds of statistical models that are currently in use to
discover topics within documents. In this section, I will quickly list a few of these
methods and then explain in detail the one which is used by me. Latent Semantic
Analysis (LSA) uses vector-based models for finding coherence between texts. The
main methods used in this model are term frequency – inverse document frequency
(tf-idf) and singular value decomposition (SVD)1 (Deerwester et al. [1990]). There
are certain advantages to using LSA, namely one can find the latent topics that
exist within the corpus. However, due to SVD, the mathematical complexity of this
model is extremely high.
Another type of topic modeling method is Probabilistic Latent Semantic Analysis
(PLSA). It uses a simple two level generative model, where it calculates a probability
model for the documents, the topics and the words ( Hofmann [1999]). It has certain
advantages, such as the topics can be easily interpreted and the model is based on a
1https://en.wikipedia.org/wiki/Singular_value_decomposition
3
Chapter 2. Theoretical background
solid statistical foundation(Holzinger et al. [2014]). However, this model has also a
disadvantage because the expectation–maximization algorithm1 used by PLSA has a
tendency to find a solution which is not always the global optimum (Leopold [2007]).
2.1.2 Latent Dirichlet Allocation
One of the key concepts for the models used for this Master’s thesis is latent Dirichlet
allocation (LDA) by Blei et al. [2003]. They explain LDA as follows:
“Latent Dirichlet allocation (LDA) is a generative probabilistic model of a corpus.
The basic idea is that documents are represented as random mixtures over latent
topics, where each topic is characterized by a distribution over words.”
The underlying logic behind LDA is that similar groups of words will occur in doc-
uments with similar topics in them. Thus, latent topics are groups of words that
frequently occur in a document. Documents, in this case, are simply probability
distribution of latent topics, and the topics can be defined as the probability dis-
tributions over words. A key point here is that in this model one is working with
probability distributions and not word frequencies. Hence, the syntax of the text
within the document does not matter; only the distribution of the words is of im-
portance.
Figure 1.1: Plate notation representing the LDA model (from Blei et al. [2003])
The plate notation (see Figure 1.1 ) represents the overall architecture of the LDA
model from Blei et al. [2003].
M: total number of documents in the corpus (1...m)
N: number of words in a document (1...n)
α: the Dirichlet prior parameter for the per-document topic distributions
β: the Dirichlet prior parameter for the per-topic word distribution
θm: distribution of topics in a document
1https://en.wikipedia.org/wiki/Expectation-maximization_algorithm
4
Chapter 2. Theoretical background
zmn: denotes the topic for the n-th word in document m
wmn: denotes a given word, m: specific document, n: specific word
The generative processes give an insight into how the LDA model assumes the
document is created. As the first step, the model calculates the number of words
for a given document. Then it determines the document as a mixture of a given set
of topics. For example, if the number of topic is set to four, then it would decide
that the document m consists of 10% topic 1, 20% topic 2, 40% topic 3, and 30%
topic 4. The model tries to generate the words in the document. This is done
by choosing a topic for the document, which is the multinomial representation of
the topics assigned to the document (40% topic 3, 30% topic 3 etc.). In the next
step, it chooses the topic word, which is calculated using the aforementioned topic’s
multinomial distribution.
2.2 Machine Learning Problem
The techniques of machine learning can be divided into approximately three cate-
gories, namely: supervised, semi-supervised and unsupervised learning. A machine
learning problem falls into one of these categories based on the full, partial, or
absence of ground truth that could be applied to the model during the training pro-
cedure. One uses unsupervised machine learning when there is a complete absence
of ground truth. The aim of unsupervised learning methods is to find structures and
patterns from the input data, based on the type of machine learning algorithm that
is being implemented (Bonissone [2015]). LDA falls under the category of unsuper-
vised machine learning algorithm, as the input data is not labelled and the model
tries to infer structures within the data, based on predefined parameters. This could
be problematic at a later stage, as it could be challenging to judge the quality of the
generated topic model as there does not exist any reference with which one could
compare it.
2.3 Issues with Topic Models
Boyd-Graber et al. [2014] give a comprehensive overview of the issues that could
occur with topics generated by an LDA model. They mention five categories that
could be used to judge if a topic is of good quality. As I will be using these metrics to
judge the quality of the topics generated by the model (see Chapter 4), I will mention
here the aspects of good and bad topics, as proposed by them in the following
5
Chapter 2. Theoretical background
sections.
2.3.1 Categories of Bad Topics
2.3.1.1 General and Specific Words
As most words in natural language convey some sort of meaning, it can occur that
the model generates topics made of words that are not useful. These are topics
that contain words which are often frequent in the corpus, but they are also not
specific. Thus, these topics can be perceived as being general and not belonging
to a specific subdivision of the corpus. These topics may include stop words that
were not removed during the preprocessing step. However, it can also be the case
that these are high frequency words, specific to the corpus and should be removed
in order to yield meaningful topics.
Boyd-Graber et al. [2014] also state that low-frequency words can cause problems.
According to them, topics containing a multitude of specific words can also be
bad, because there is a chance that these topics are not representative of a specific
subdivision within the corpus, but were generated due to mere chance, as the model
generates topics based on word frequencies. The authors do not mention how to
avoid the creation of such topics.
2.3.1.2 Mixed and Chained Topics
Boyd-Graber et al. [2014] define ‘mixed topics’ as topics that are made of a set of
words that do not make any sense in combination. However, these topics contain
subset of words that make sense. Example 2.1 is a case of ‘mixed’ topic, as this
topic consists of two subsets, namely names of flowers (in emphasis) and tools.
(2.1) rose, daffodil, daisy, hammer, screwdriver, pliers ...
‘Chained’ topics are related to mixed topics, as here too one has different subset
of words within the topic, but the issue here is that at least one word from one of
the subsets could belong in the other subset. As shown in Example 2.2, there are
two subsets within the topic, but one word, namely ‘apple’, from the first subset,
which is about names of fruits could belong in the other other subset, which is about
products made the company called Apple.
(2.2) apple, banana, grape, iphone, smartphone, ipad...
6
Chapter 2. Theoretical background
2.3.1.3 Identical Topics
Another issue that can happen while generating topics is that the topics are mostly
or completely identical, and possible exhibit a different word order (see Examples
2.3 and 2.4).
(2.3) apple, banana, grape, orange, pineapple
(2.4) grape, apple, pear, pineapple, banana
Boyd-Graber et al. [2014] mention some solutions to avoid such topics. They suggest
that one should check if there are empty documents in the dataset, and if the number
of topics is excessive for the given dataset.
2.3.1.4 Incomplete Stopword List
These are topics generated as a result of having a stop words list that is incomplete.
This issue is somewhat similar to the one mentioned in topics containing general
words. However, the difference here is that the topics are not vague, but make
sense. For example, they could be a list of first names, or Roman numerals. Here
the authors suggest that this problem can be circumvented by updating the list of
stop words and running the model again Boyd-Graber et al. [2014].
2.3.1.5 Nonsensical Topics
These are topics that do not make any sense. Boyd-Graber et al. [2014] mention
that providing the model with an excessive number of topics to generate may cause
nonsensical topics. This is due to the fact that the model tries to generate a given
number of topics even if these do not exist in the corpus, the model tries to infer
topics based on some pattern it found. The authors mention that, for example,
OCR errors can be an artificially generated topic that could probably be detected
by a topic modeling algorithm.
2.3.2 Topic Alignment
Another issue with LDA models is that, even if one uses the same corpus and
parameters, the model generates the topics in different order for each run. Yang
et al. [2016] mention that when using an LDA model, the words that are generated
in the model are based on the fixed conditional distributions. This has a side-effect,
as it leads to the topics in the model as being exchangeable. Due to this reason,
7
Chapter 2. Theoretical background
in statistical topic modeling algorithms such as LDA, even with identical model
generation parameters, the topic indices between two topic models may not match.
Hence, the authors suggest some kind of alignment measure in order to calculate the
similarities between multiple models. Yang et al. [2016] implement the Hungarian
algorithm 1 in their work. However, for my experiments I use a different measure to
calculate the similarities between multiple models (see Chapter 4.2.7).
2.3.3 Topic Quality Evaluation
Boyd-Graber et al. [2014] mention that after retrieving the topics from the model,
there are different ways of judging the quality of them. Furthermore, they indicate
that the weakness of most topic modeling papers is that the researchers do only a
qualitative assessment after generated topics. They also state that in many cases
the quality of the topics are judged based on some NLP related task that is not
directly related to the topics themselves. For example, using the inferred topics
for an information retrieval task (e.g. Wei and Croft [2006], or using them for a
sentiment detection task (e.g. Titov and McDonald [2008]). The second approach
is to have a set of held-out articles (i.e. a test set) and using the probability of the
observations based on the articles in the test set and those used to train the model.
The paper by Wallach et al. [2009] provides a comprehensive overview of evaluating
topics based on the probability of the observations on the test set.
At this point I would like to mention that I will not be using a test set to evaluate
my results, as I do not have a reference for comparison. Hence, I will be using a
different metric to judge the quality of my model (see 4.2.7).
2.3.3.1 Human Evaluation
Chang et al. [2009] propose a method for human evaluation of topics, where the
participants are given a set of words from a topic that also include an intruder topic
word (see Example 2.52). They mention that the participants are able to identify
the intruder word, if the other words in the set belong to the same semantic group.
However, if the set contains words that do not belong together (see Example 2.6),
then the task is much more difficult for the participants, who then seem to start
choosing the intruder word at random. They mention that the quality of the topic
can be evaluated based on the consistency of the human judgement.
1https://en.wikipedia.org/wiki/Hungarian_algorithm2The intruder word is in emphasis.
8
Chapter 2. Theoretical background
(2.5) apple, orange, banana, pineapple, tiger
(2.6) book, computer, teacher, student, weekend
2.3.3.2 Topic Size
The topic size plays a key role in the quality of the model. As Mimno et al. [2011]
state, there is a relationship between the number of topics and quality of the topics
themselves. They point out that on the one hand, models with a large number of
topics provide the user with a more detailed view of the themes that are present in
the corpus. On the other hand, having a multitude of topics comes with disadvan-
tages because certain topic modeling algorithms tend to create topics even if there
are none to be found (see 2.3.1.5).
2.3.3.3 Topic Word Length
Boyd-Graber et al. [2014] refer to the length of the words in the topic as an indicator
of topic quality. Their intuition is as follows: if a word has a specific meaning, then
it is quite likely to be longer in length and vice versa. Thus, according to Boyd-
Graber et al. [2014] topics with a short average word length could be an indication
of anomalous topic clustering (e.g. acronyms). Moreover, they mention that the
length of the topic word is not an indication for a nonsensical topic that cannot be
interpreted by the user. Topics with short words in them probably indicate words
that have a tendency to co-occur. The authors allude to a topic that contains the
word ‘legislator’ and acronyms for the names of states in the US (e.g. ‘ca’,‘pa’, ‘nc’,
‘fl’, etc.). In this case, the topic shows that names of states tend to co-occur with
tokens such as ’legislator‘. The authors do not provide a solution for avoiding such
topics. Nonetheless, this criterion of topic length can be applied to my future topic
models to evaluate if the output is of adequate quality.
2.4 Improving Topic Models
Boyd-Graber et al. [2014] also mention multiple ways of improving topic models.
Their suggestions include merging topics that are similar(see 2.3.1.3) or separating
topics that conflate multiple concepts (see 2.3.1.2). In most of their suggestions,
they recommend measures that calculate the co-occurrence of words (e.g. point-wise
mutual information (PMI)) and expert knowledge amongst few of the approaches
that can be implemented. They also state automatic topic labelling as way to
9
Chapter 2. Theoretical background
interpret topics without the help of domain experts. This method also provides a
summary of what is being presented by the topics.
2.4.1 Automatic Topic Model Labelling
2.4.1.1 Information Retrieval
Lau et al.[2011] attempt at resolving the issue of topic labelling by generating new
labels for the topics. The methods implemented by them include using terms that
are found in the topics. They search the English Wikipedia for the words which are
the top-ranking topic terms and use the article titles that have been returned by
their query to generate more topic labels. Then they rank and process the Wikipedia
article titles and extract label keywords from them. These keywords are further pro-
cessed using a combination of association measures (PMI), and lexical features. Out
of all the data sets evaluated in their work, the topic labels of the PubMed abstracts
perform less well than labels generated on the other datasets. This approach is cer-
tainly interesting; however, in order to implement their methodology, it is necessary
to use an API to get the relevant information from Wikipedia, which goes beyond
the scope of the focus of this Master’s thesis.
2.4.1.2 Neural Embeddings
An improvement of the model proposed by Lau et al. [2011] is the one by Bhatia
et al. [2016]. Even though their methodology has some similarities to the former
approach, Bhatia et al. [2016] forego the information retrieval aspect of Lau et al.
[2011] and replace it with word2vec and doc2vec. The word2vec model generates
more abstract labels, whereas doc2vec returns fine-grained labels for a given topic.
The strength of the system lies in combining the outputs of both systems. They also
make use of different learn-to-rank approaches to improve the quality after the top
ranking topic labels. Unlike the approach by Lau et al. [2011], the researchers claim
that this method is much easier to implement as one is not required to use search
APIs. Moreover, Bhatia et al. [2016] yield better results than Lau et al. [2011] in
the previously mentioned article.
As their method mentions doc2vec, to generate multi-word topic labelling, I chose
not to implement this for my work. Doc2vec requires a significant amount of compu-
tational resources (notably RAM) to run, and it is not recommended for the corpus
that I intend to use.
10
Chapter 2. Theoretical background
2.4.2 Text Preprocessing to Acquire Meaningful Topics
A prevalent issue with LDA topic modeling is finding meaningful topics based on the
given corpus. Zhu et al. [2014] suggest that when using LDA, such a problem can
be significantly reduced by using multiple levels of text pre-processing with methods
such as POS tagging, base noun phrase chunking, and K means clustering . Part
of their preprocessing approach includes the reduction of tokens in plural to their
lemma form. They mention that their method outperforms a baseline with the pre-
processing steps and the output which ranks better among human annotators. This
paper is of particular interest as many of the approaches mentioned by the authors
can be implemented using off-the-shelf NLP tools such as the Natural Language
Toolkit (NLTK) for Python (Bird et al. [2009]).
11
3 Previous work
In this section I give a brief overview of the work that has been done in the field of
biomedical topic modeling and diachronic topic modeling. I will focus only on the
articles which are of interest to my work. Finally, I give a brief overview of the tools
that can be used for topic modeling purposes.
3.1 Biomedical Topic Modeling
3.1.1 Ontology Term Mapping
Zheng et al. [2006] used topic modeling on titles and abstracts of protein-related
MEDLINE articles. They used LDA and extracted 300 topics from their corpus.
They found out that the majority of the extracted topics were not only semantically
coherent, but they also featured biological terms. As an added feature, they mapped
the topics to the Gene Ontology (GO) controlled vocabulary. They did this by
associating the common terms between the topic words the GO term. Thus, this
paper exhibits a practical usage of topic modeling in the domain of biomedical
publication. Furthermore, this paper also contains multiple examples of biomedical
topics which were created using LDA.
3.1.2 Enriching LDA Output with External Data
Other ways of improving the output of the model can be done by enhancing the
output with information from an external knowledge base. Wang et al. [2011] do
this by applying multiple levels of complex pre-processing steps, which include NER,
getting information about the tokens from an external database, and recategorising
the extracted information into a relational database for further usage. Moreover,
they apply other semantic association features to improve the topics. The LDA
method has been enhanced by researchers to suit their need for biomedical topic
modeling. The researchers create Bio-LDA, which is an algorithm that performs
LDA on a given corpus; moreover, it enriches the results using datasets from the
12
Chapter 3. Previous work
life science domain. Then it identifies relationships between the topics using the
aforementioned methods.
This article is of interest to me, as they incorporate external information to enhance
their method. It has also served as an inspiration for the pipeline that I created to
easily process PubMed articles (see Chapter 8).
3.1.3 Discover Relationships between Diseases and Genes with
Topic Modeling
ElShal et al. [2016] aim to find relationships between diseases and genes.
They used LDA to extract the topics from the abstracts of biomedical research
articles, which is their corpus. They converted the documents in their corpus into
vectors using the bag-of-words model approach. From the LDA model, they also
used the topics, the topic word distribution, and from the corpus they used the
mentions of genes and diseases. They combined these information to find links
between genes and diseases. In most cases, they calculated similarities between the
documents, topics, genes, and diseases using cosine similarity.
Using this approach, they found many correct links between genes and diseases;
however, they also realised that the vocabulary plays an important role and the
topics extracted rely heavily on it. This article is interesting as it focusses on the
role of vocabulary when creating LDA models.
3.1.4 Comprehensive Biomedical LDA Topics Example Source
Examples of topics that are created from models with biomedical texts as input
are an useful resource, because they serve as a reference for what such topics could
look like. This is one of the reasons why the article by van Altena et al. [2016] is
helpful. The article focuses on the usage of big data themes in scientific texts, with
an emphasis on biomedical literature. As their corpus, they take 1308 documents
from PubMed and PMC. These abstracts were selected based on the occurrence of
big data related keywords in them. As for the topic modeling methodology, they
implemented LDA using R. The preprocessing methods included stopword filtering
and stemming tokens. An interesting feature implemented by them in the corpus
creation was to transform common bigrams into a single token using an underscore
(e.g. heath care). To extract the best parameters for their model, they calculated
13
Chapter 3. Previous work
the Akaike information criterion (AIC)1 using the following variables: the number
of topics, the likelihood of the model, and the number of unique words in the corpus.
Based on their corpus, the best number of topics is 25. Finally, they used manual
annotators with domain knowledge to ascribe labels to the topics. The article further
describes the usage of big data terminology biomedical texts.
Despite the fact that their corpus is about big data, this article provides a highly
valuable resource, as it gives an insight into what such topics can resemble when
using a biomedical corpus. Moreover, their method for finding the best parameters
for the LDA model can potentially be applied for finding the parametres for the
models during the experiments.
3.2 Diachronic Topic Modeling
3.2.1 Early Work in Modern Non-biomedical Domain
The paper by Wang and McCallum [2006] does diachronic topic modeling using
LDA. The authors developed a tool called Topics over Time (TOT), which uses
topic modeling and combines it with co-occurrence patterns of words. However,
they do not use biomedical data in their training corpus. Nonetheless, they were
able to show trends in their data. This article is being mentioned here, as it is an
interesting early example of diachronic topic modeling and trend analysis.
3.2.2 Observe Diachronic Changes and Author Influence in
Biomedical Domain
The issue of diachronic changes in topics and the influence of authors is tackled by
Song et al. [2014]. They analyse 20,869 articles from PubMed from 2000 to 2011.
They extract bibliographical and other relevant information (e.g. abstract, full-text,
etc.) information from the XML files. Using the extracted bibliographical informa-
tion, they create a relational citation database. Furthermore, the texts are divided
into three temporal categories, namely 2000-2003, 2004-2007, and 2008-2011. They
ran separate LDA models for each of the aforementioned time periods. The authors
state that these temporal categories are based on in-domain trends, the number
of publications per year, and having enough data for diachronic topic modeling.
They use topic modeling to find out the changes in topics that have occurred in the
aforementioned time periods. Moreover, they implement a mixture of information
1https://en.wikipedia.org/wiki/Akaike_information_criterion
14
Chapter 3. Previous work
extraction, to find the most influential authors and topic modeling to find the topics
associated with them. The researchers use Dirichlet multinomial regression (DMR)
as their topic modeling algorithm and the tool MALLET. Using these methods, they
not only find the most productive authors, countries, institutions, but also patterns
of larger collaboration networks based on topics. This article provides an insight into
diachronic changes in topics in the domain of biomedical literature, and what such
topics could look like. It also provides a method for finding influential authors and
the topics used by them. Moreover, it provides information about finding collabo-
ration networks and interrelated fields. Finally, similar to the article by van Altena
et al. [2016], this paper is very useful to me as it also provides a comprehensive list
of topics that the researchers found during the topic modeling process.
3.3 Brief Overview of Available Tools
There are numerous of topic modeling tools that are available for free, and they can
be implemented based on ones knowledge of programming language, the amount of
data being processed, and the level of customizability required for the task.
3.3.1 MAchine Learning for LanguagE Toolkit (MALLET)
It is a cross-platform NLP package, which is Java-based (McCallum [2002]). It has
tools for document classification, sequence tagging, and topic modeling. The topic
modeling toolkit contains some models such as LDA, hierarchical LDA etc. It is
easy to use and does not require any prior knowledge of Java, and contains tutorials
for users who do not have any prior programming knowledge.
3.3.2 Stanford Topic Modeling Toolbox
It is part of the Stanford NLP suite of software (Ramage and Rosen [2009]). Written
in Scala, it takes as an input data in spreadsheet (Excel, CSV) format. The topic
models available for this toolkit are LDA, labelled LDA, and PLDA. The advantage
of using this tool is that it generates the output in Excel format, which eases the data
analysis process; moreover, there’s a Java based user interface. The disadvantages
include that knowledge of Scala is required to fine-tune the model parameters.
15
Chapter 3. Previous work
3.3.3 Spark Machine Learning Library (MLlib)
The machine learning libraries in Spark offer big data solution for LDA models1.
Despite the fact that most of the core code and documentation for Spark MLlib
is written in Java (and Scala), there is it a Python wrapper (PySpark API) which
enables the user to write the required code in Python. The advantage of this library
is that it offers a big data solution that could be useful for handling large amounts
of data, which is usually the case for working with the biomedical literature. A
disadvantage is that this requires an architecture for big data analysis, and the user
documentation is somewhat complicated and requires time to get accustomed to.
3.3.4 R Libraries
There are also topic modeling libraries for R and notable one is called ‘topicmodels’
(Hornik and Grun [2011]). It requires data in document matrix format, so that it can
be processed by the aforementioned R package. Moreover, it requires programming
knowledge of R and one should be skilled in the methods of data manipulation using
other R libraries, so that the input as well as the output can be processed by R.
3.3.5 Gensim
Gensim is a Python library (compatible with Python 2 and 3) that contains a
multitude of tools for extracting semantic relations from documents (Rehurek and
Sojka [2010]). As for the topic modeling algorithms, it has LSI and LDA. It takes
as input either raw text or text in Python readable list formats. Gensim has its
own text processing tools such as simple tokenizers. A key feature is that Gensim
also allows parallel processing which can reduce the time it takes to run the models.
Moreover, this library has a very detailed documentation, which is easy to follow,
and allows higher level of customizability for the user.
In addition, as it works on Python, NLP tools such as NLTK can be implemented for
preprocessing purposes before feeding the data into the topic modeling algorithm. I
will be using Gensim for this project because of the level of customizability of this
library and my current programming knowledge of Python.
1https://spark.apache.org/mllib/
16
4 Methodology
As mentioned before, LDA is an unsupervised machine learning algorithm. Thus,
there is the issue of ground truth, as there is no pre-existing data against which I can
compare my results. Therefore, my aim in the following section will be to be aware
of the issues of topic modeling and circumvent them, based on what is possible and
what is recommended in the literature about building topic models.
4.1 Source of Data
In this section, I document how the topics were extracted from a corpus of around
1.5 million Open access Pubmed articles. I downloaded the Open access bulk article
packages from the Pubmed central. The corpus contains articles from the PMC Open
Access Subset1. I used the articles from this section because the data which is made
available here falls under the Creative Commons or similar licensing regulations,
which enables me to avoid any issues regarding copyright laws. The corpus in this
section was downloaded in the beginning of March 2017.
4.2 Extracting Topic Models
In the following sections I will implement the LDA algorithm on the corpus to create
topic models. I tried different parameters, such as chunk size, dictionary trimming,
number of extracted topics, and corpus filtering by POS tags, to get topics that
appear as coherent and meaningful to a human reader.
1https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/
17
Chapter 4. Methodology
4.2.1 Experiment 1 : Exploring the Topics in the Corpus
4.2.1.1 Preprocessing
The articles downloaded from the PMC Open Access Subset are in XML format.
Hence, with the help of Python libraries I extracted the text from each article to
create my corpus. As a next step, I tokenized the text and then I reduced the corpus
by removing English stopwords1 and punctuation marks that are not part of a token.
By this I mean punctuation marks that do not occur within a token. In order to
further reduce the vocabulary, I lowercased the entire corpus. This approach has
its advantages and disadvantages, as by lowercasing one can consolidate multiple
tokens with different casing into a single lowercased token (e.g. ‘CELL’, ‘Cell’ to
‘cell’), which in turn reduces the number of vocabulary items. However, disadvan-
tages of this method are that by lowercasing a token one can generate homonyms
with different meanings (e.g. ‘AIDS’ vs. ‘aids’) which should not be consolidated
into a single vocabulary item. Despite the aforementioned issues, I decided to low-
ercase the corpus, as in this context the positive aspects of lowercasing outweigh
the disadvantages. Then I lemmatized the text to further reduce the number of
vocabulary items (e.g. ‘cell’ and ‘cells’ to ‘cell’).
I used tokenizing and lemmatizing functions provided by NLTK in the text prepro-
cessing stage (Bird et al. [2009]). The decision to use NLTK was mainly based on
the fact that it is freely available and can be easily combined with using Python.
Secondly, the dataset I am using consists of text written in English, and the tools
provided by NLTK are trained on English datasets. At this moment I would like to
state that biomedical texts pose a difficulty for most text processing tools. This is
due to the specific biomedical terminology used by the authors of these texts. Con-
sidering the scope of this Master’s thesis, I decided against using tools that have
been trained on biomedical data. This decision was based on the examples of topic
models that I saw in the work done by Song et al. [2014]2 and van Altena et al.
[2016]3, where the topic models were generated from biomedical texts.
In both papers, examples of topic words tend to be common nouns such as ‘cell’,
‘disease’, ‘cancer’, ‘virus’ etc. (from Song et al. [2014]), or ‘brain’, ‘disorder’, ‘bi-
ology’ etc (from van Altena et al. [2016]). In both works, the researchers did not
implement any tools for recognising biomedical entities. Nonetheless, terms such as
1These were taken from the set of English stop words provided in the NLTK package (Birdet al. [2009])
2Song et al. [2014] have a table of topics generated by their model in their paper on pages357-359
3van Altena et al. [2016] also show the topics generated by their model, in tables that can befound in pages 357-359 of their paper.
18
Chapter 4. Methodology
‘dna’ were recognised as topic words by the models in both papers. Therefore, based
on the previous work done in this domain, I decided against the implementation of
a tool that was trained on biomedical data.
4.2.2 LDA Model Creation Parameters
Model Parameters
After the pre-processing step, I created an LDA model using Gensim. In my first
tests, I ran the LDA model with different configurations for extracting the topics
within the corpus. I set the parameter number of topics to 10, 20, 50, and 100 for
each iteration. Moreover, I set the chunk size to 5000 and set the number of passes
to 1.
Dictionary Trimming
I also edited the dictionary used by the model, in order to make the model run
faster as well as discard redundant vocabulary items that could potentially behave
like spam in the output. Hence, I not only removed words that occurred less than
10 times in the entire corpus, but also 100 of the most common words were removed.
Other omissions from the dictionary included removing numbers, and tokens that
are less than or equal to three characters long.
Topic number Topic words
1 response participant task trial stimulus subject experiment fig activity day2 fig structure solution protein patient surface size compound acid reaction3 patient health age risk participant care score woman clinical outcome4 snp gene patient population dna expression allele fig association protein5 protein fig gene activity dna mutant binding expression strain acid6 expression mouse fig protein gene tissue antibody day tumor activity7 fig specie water plant concentration area temperature surface site population8 patient disease blood clinical day infection response tumor serum month9 gene sequence document expression protein genome fig set region minimal10 patient fig image parameter region activity network area structure response
Table 4.1: 10 topics generated from a corpus of 1.5 million articles
4.2.2.1 Evaluation: 10 Topics Model
The output of the model returned 10 topics which are based on the entire corpus.
The result of this topic model can be found in Table 4.1, where the topic words
and the topics numbers are displayed. As it can be seen from the results, the topic
can be guessed by using the top five most relevant topic words from the list. For
example, other topics 4, 5, and 9 are about ‘gene related’ themes, whereas topic 3
is about patient information
19
Chapter 4. Methodology
Issues
Within the topics, there are certain terms that can be classified as noise namely,
“fig” (in emphasis) which probably refers to the shortened form of figure. These
terms should be removed in the future experiments. Moreover, as seen in Table 4.1
at least three of the topics can be labelled as gene related. It can be concluded
that there is a lack of thematic variance in this model. As previously discussed
we have issues that are common with bad topic models. Three issues that can be
mentioned here are: mixed and chained topics (2.3.1.2), identical topics (2.3.1.3),
and incomplete stopword list (2.3.1.4).
Model Evaluation
It can be seen that the 10 topics that were given as an output are vague, somewhat
generic. Moreover, they tend to have common themes. Nonetheless, by looking
at them, it is possible to gain a general idea what the corpus is about. However,
the topics themselves are far too vague. Ultimately, it is necessary to have more
fine-grained topics. Finally, it can be said that this is a poor quality model and 10
topics are not enough for a corpus of this size.
4.2.2.2 Evaluation: 20 Topics Model
In the Table 4.2, some of the results of the 20 topic model are shown. This model
was also run with the same set of parameters with the exception of number of topics
as the output. There are still some issues of spam words in the output such as
‘fig’. A topic was found that contained LaTeX related markup terms (topic 7). This
could be due to the fact that in the corpus the markup terms had not been filtered
out during the preprocessing steps. Unlike in the previous iteration, there are more
distinguished topics, as new themes emerge in the output. There are topics about
insulin (topic 15) and mice, also about mice and brain experiments (topic 20), and
cancer (topic 10) which were not in the previous model.
4.2.2.3 Model Evaluation
We can see that more topics are extracted, as new themes emerged from 20 topics
model. Those themes are much more detailed than those in the 10 topics model.
These topics contain more words which are frequent in the corpus, but they are
still not specific. The topics can be perceived as being general and not belonging
to a specific subdivision of the corpus (see 2.3.1.1). Moreover, the issues of mixed
20
Chapter 4. Methodology
and chained topics (2.3.1.2), identical topics (2.3.1.3), and incomplete stopword list
(2.3.1.4) are still prevalent in this model. Finally, it can be said that 20 topics are
not adequate to generate such a model for this corpus.
Topic number Topic Words
7 document amsbsy minimal amsfonts mathrsfs wasysym upgreek amssymb amsmath 12pt10 cancer tumor patient expression breast gene protein tissue mutation line15 concentration glucose insulin mouse acid fig weight diet rat protein20 rat neuron mouse animal brain response day patient experiment trial
Table 4.2: A selection from the 20 topics generated from a corpus of 1.5 millionarticles
4.2.2.4 Evaluation: 50 Topics Model
In this iteration, the same issues with the 10 and 20 topics model persist. This is
due to the fact that this model was created also with the same parameters, corpus,
and dictionary. However, compared to the previous ones this model resulted in
even more detailed topics. It should be said that some of the topics in this model
were also found in the previous models. Hence, I will only mention the new topics
that emerged. These new topics are about mental health/depression (topic 44),
antibiotics resistance (topic 6), vaccines (topic 12), diabetes (topic 17), eyes/surgery
(topic 32), bone/tissue (topic 18), neurons (topic 48) and plants (topic 8).
It appears that in this model of 50 topics, the generated topics are less general and
have a tendency to be more specific.
Model Evaluation
Although the 50 topics model yielded much more detailed results than the previous
iterations, this model contains some of the topics that were found before. Further-
more, it should be mentioned that some of the criteria that make a model bad, were
found in this iteration as well. Nonetheless, with 50 topics some nuanced themes
that are mentioned in the corpus become prevalent as the topics are less vague than
before. If I can achieve to solve some of the issues that put the topics generated by
this model under the category of bad topics, then for future iterations, 50 topics are
an adequate, yet manageable amount that can be generated from this corpus.
4.2.2.5 Evaluation: 100 Topics Model
Unlike the 50 topic model, this iteration contained very few new topics, namely
about heart disease (topic 22), aneurysms (topic 63), malaria (topic91) pregnancy
21
Chapter 4. Methodology
Topic number Topic words
6 antibiotic resistance patient isolates strain antimicrobial infection resistant culture day8 plant leaf soil specie fig water seed site day root12 virus vaccine influenza antibody vaccination response patient infection protein day17 risk patient age diabetes blood subject bmi disease association weight18 bone fracture patient fig tissue implant cartilage week day protein32 patient eye nerve left surgery right clinical disease month image44 patient pain score symptom depression sleep protein disorder fig anxiety48 neuron fig response channel current receptor expression activity synaptic protein
Table 4.3: A selection from the 50 topics generated from a corpus of 1.5 millionarticles
(topic 74) and reproduction (topic 95). The lack of topic diversity is due to the
fact that this model contained similar topics as found in the previous models with
slight variations in the topic words and their order. Thus, it can be said that this
model has a high number of identical topics. Perhaps, this is due to the fact that
100 topics are too much for the given corpus (2.3.1.3). As this model was made with
the same parameters as the aforementioned ones, the weaknesses of those models
are also prevalent in this one.
Topic number Topic words
22 patient heart cardiac pressure vitamin ventricular left blood artery volume63 aneurysm patient fig artery blood protein activity min concentration rbc74 birth pregnancy infant maternal patient woman asthma age child risk91 malaria parasite infection hpv falciparum patient mosquito blood day expression95 oocyte sperm expression embryo patient human chromosome mtdna ovarian stage
Table 4.4: A selection from the 100 topics generated from a corpus of 1.5 millionarticles
Model Evaluation
The 100 topics model exhibit a rather comprehensive array of topics. Unfortunately,
several of them are similar. It is possible to go into further detail with a 200 topics
model. As there are very few new topics generated by the current model, branching
into a 200 topics model at this point is not advisable. As discussed before, LDA has
a tendency of creating topics even though there are none thematically in the corpus
(see 2.3.1.5). Hence, at this point it is no longer required to branch further into a
more detailed topic model, but to fix the issues that are currently present.
4.2.2.6 Evaluation Experiment 1: All Models
The results of the different iterations show that extracting only a small number of
topics, namely 10 to 20 topics, from a corpus result in vague topics. This is due to
the fact that the LDA model tries to cluster together high frequency words in the
22
Chapter 4. Methodology
corpus resulting in generic topics. With increasing number of topics extracted from
the corpus the topics become more specific. It should be noted that with increasing
number of topics, there may not be different topics extracted by the model, but
different versions of topic belonging to a same theme. As discussed before, the
generation of identical topics can sometimes be due to the number of topics which
is excessive for the data set.
At the end of this experiment, the results give an overview of the types of topics that
can be found in this corpus. As mentioned before, there are many topics and topic
words in the output that should be removed which will be done by updating the
stopword list. Hence, the corpus needs to be cleaned again and further experiments
with different parameters should be run on the new corpus.
4.2.3 Experiment 2 : Edited Corpus and Modified Model Update
Parameters
In the previous experiment, I ran the model with the chunk size parameter set to
5000. This means that the model performs online training, meaning that with the
influx of new information the model is being updated continuously. The update
parameter of the model is dependent on the number of workers (parallel processes)
and the chunk size (see 4.1).
update = number of workers × chunk size (4.1)
In this batch of experiments, I decided against online training, as I do not have any
control over how the initial chunk is chosen and how representative it is of the entire
corpus. Thus, I set the batch parameter to False, which results in the LDA model
calculating the topics from the entire corpus at once. The goal of this approach is
to find the topics that are representative of the whole corpus and not those chosen
by the chunks.
4.2.3.1 Preprocessing
The corpus from the previous model needed to be modified as it contained many
spam elements such as LaTeX markup terms. Hence, a new corpus was created with
the spam elements removed from them. The other preprocessing elements remain
the same as before (see 4.2.1.1).
23
Chapter 4. Methodology
Topic words
cell gene patient model mouse cancer activity rate expression population
patient protein cell model gene treatment rate activity expression site
patient cell treatment gene protein model expression risk activity rate
cell protein patient treatment expression activity mouse concentration line model
gene cell patient expression treatment model network activity disease function
cell protein gene mouse activity patient antibody sequence expression treatment
Table 4.5: Identical topics generated from multiple topic models with different topicsizes
4.2.3.2 Results
The results from this iteration are disappointing because for all the different types
of models that I ran, all the topics were variations of the ones shown in Table 4.5.
Hence, I can only assume that the model looked at the most common elements in
the corpus, which are about running experiments on mice, analysing proteins, and
cells. Unlike in the previous experiment, the topics have little variance. The number
of topics generated from the model is irrelevant, as even in the 100 topics model all
the topics have the word ‘cell’ in them. I can assume that doing online training is
important, and in the following experiments, I will experiment with different batch
sizes to see how the topics differ. Nonetheless, the results from this experiment seem
illogical, as I do not see any correlation between online learning and types of topics
generated. Perhaps, the problem could be in the dictionary because a new corpus
was generated (as mentioned in the preprocessing step 4.2.3.1) which has different
word frequencies.
4.2.4 Experiment 3: Online Learning with Different Batch Sizes
4.2.4.1 Preprocessing
Working with a corpus of 1.5 million documents is not recommended when it comes
to using Gensim. This is due to the reason that Gensim during the training process
loads a significant amount of the data onto the RAM. Therefore, one cannot run
multiple batches with different parameters simultaneously, as doing so almost con-
sumed 250 GB of RAM and putting the server to a standstill. For this reason, in
order to reduce the training time and run multiple models at the same time, I chose
a smaller subset of the corpus. This subset contains 150 thousand randomly selected
articles from the entire corpus regardless of year of publication. In Figure 4.1 one
24
Chapter 4. Methodology
Figure 4.1: Number of articles published per year from 1950-2016 in the corpus of150 thousand articles
can see the number of articles published per year in the smaller corpus. The graph
shows the number of publications from the year 1950 to 2016. As it is visible in the
graph of Figure (4.1) the number of publications increases exponentially. In order
to see if the sample set that was randomly chosen is representative of the corpus,
I calculated for each year the percentage of articles from the original corpus that
are in the new corpus of 150 thousand articles. As one can see in Figure 4.2 the
distribution of the articles for every year for the entire corpus oscillates somewhere
between 8 to 11 percent of the articles that were published on a given year and are
available in the open access subset. Reducing the number of articles drastically,
reduced the training time as well as the strain on the RAM created by running
the topic models. The other preprocessing elements remain the same as before (see
4.2.1.1).
4.2.4.2 Results
The results of this batch were nearly identical to the output of the previous exper-
iment, thus confirming my suspicion that the online training did not influence the
topics generated by the model. Hence, I looked into the differences between the first
experiment and the latter two and realised that the key issue is the dictionary. In
experiments 2 and 3, after removing the LaTeX markup words from the corpus, the
25
Chapter 4. Methodology
Figure 4.2: Percentage of articles from the original corpus (1.5 million articles) peryear from 1950-2016 that are in the corpus of 150 thousand articles
token ‘cell’ was not removed when I deleted 100 of the most common tokens from
the corpus. It can be seen that certain vocabulary items have a tremendous influ-
ence on the output of the results. In later experiments, the influence of these words
should be noted and treated as some form of stopword, as they are high frequency
vocabulary items. This raises the issue of which new tokens should be considered
as being disruptive when it comes to finding the underlying topics in the corpus.
Should other tokens such as ‘protein’ or ‘mice’ be removed as well? Despite the fact
that the results from this experiment can be discarded, the underlying cause for not
finding topics that vary has possibly been found.
4.2.5 Experiment 4: Reduced Vocabulary
4.2.5.1 Preprocessing
As seen before, the dictionary plays a critical role regarding the type of tokens
chosen by the model (see 4.2.4.2). In the preprocessing step I decided to reduce the
vocabulary drastically.
I created a new reduced corpus which consists only of lemmatized common and
proper nouns1. This was done by running NLTK’s English POS Tagger on the
corpus and selecting only the noun related tags for further processing. The other
1The POS tagger in NLTK uses the Penn TreeBank POS tags, which in this case are NN, NNS,NNP and NNPS
26
Chapter 4. Methodology
preprocessing methods remain the same as before (see 4.2.1.1).
4.2.5.2 10 Topics Model
As seen in Table 4.6, the topics generated by this model are quite generic and only
partially nonsensical. However, none of them are identical. The token ‘mouse’ ap-
pears in several topics, namely topics 3, 7, 8 and 9. Perhaps ‘mouse’ is an important
token in this subcorpus that I am currently using. Other common topic words are
‘blood’, ‘virus’, and ‘tumour’. The words in these topics are very generic, and the
topics themselves also fall into the category of mixed topics.
Topic number Topic name
1 month intervention therapy blood trial hospital outcome pressure infection event2 sequence mutation specie mouse domain receptor antibody virus genome family3 infection mouse strain blood antibody sequence virus culture serum isolates4 child score trial month mortality death outcome infection antibody sequence5 medium growth solution strain membrane culture surface antibody temperature reaction6 participant woman intervention score child community service family practice problem7 mouse plant growth infection antibody macrophage production tumor activation cancer8 tumor water lesion stage image diagnosis mouse therapy brain field9 cancer mouse tumor antibody blood brain muscle animal breast receptor10 sequence parameter image network length target frequency position performance error
Table 4.6: Topics from 10 topic model from noun corpus
Model Evaluation
The topics generated on this iteration do fall under some of the criteria of being
poor quality topics (see 2.3.1). Despite being somewhat vague, the topics show the
overarching themes in the subcorpus. Reducing the vocabulary items does not have
an adverse influence on the generated topics; nonetheless, more fine-grained topics
are required. Overall one can say that 10 topics are not enough to demonstrate the
thematic diversity within this corpus.
4.2.5.3 20 Topics Model
Unlike in the previous iteration, the topics are less nonsensical for this model. How-
ever, they are still very vague and exhibit signs of mixed topics. Here we also observe
that the tokens ‘cancer’ as well as ‘mouse’ occur in many of the topics. Perhaps these
tokens are important in the subcorpus I am using. Expanding to 20 topics bring
forth more detailed topics that were not present in the smaller model. However, one
can also observe that some of the topics are partially identical.
27
Chapter 4. Methodology
Topic number Topic name
1 infection virus blood mouse woman mutation antibody pregnancy growth phase2 sequence specie genome length variation selection distance frequency receptor family3 cluster sequence strain plant family specie isolates genome annotation transcription4 network image parameter channel exercise frequency performance solution measurement surface5 brain image neuron surgery month lesion diagnosis muscle tumor nerve6 muscle cancer mouse trial blood month specie medium score event7 participant woman child service intervention question people community practice score8 temperature mouse hospital medium culture frequency production image lesion cancer9 cancer blood antibody association exposure tumor bladder smoker brain status10 trial intervention score outcome month child therapy participant review criterion11 feature brain frequency image sequence association event error performance position12 water strain growth infection energy surface temperature culture density production13 sequence mouse strain domain culture mutation mutant primer antibody promoter14 mouse antibody infection brain sequence image marker mirnas cancer culture15 vaccine sequence virus mouse specie infection trial death cancer mortality16 cancer tumor mutation breast stage survival metastasis therapy methylation sequence17 plant stress mouse sequence growth water specie reaction temperature promoter18 compound solution reaction product medium membrane water blood mouse plant19 blood woman association therapy month parameter pressure serum fracture criterion20 antibody mouse tumor cancer activation receptor medium growth culture inhibitor
Table 4.7: Topics from 20 topic model from noun corpus
Model Evaluation
In this iteration, we also have the same issues as shown in the 10 topics model.
However, the topics are more specific than before, as new themes emerge within
them. Nonetheless, they are still too vague and this shows that 20 topics are not
adequate for this smaller subcorpus.
4.2.5.4 50 Topics Model
There are some overlaps with the previous model; however, in this case, we can see
that ‘cancer’ is again a common topic word. Thus, I would assume perhaps there are
texts about cancer in this subcorpus. Some of the topics are about chromosomes
(topic 32), surgery complication (topic 7), patient mental health (topic 23), viral
and bacterial infection (topics 30 and 35), pregnancy (topic 40), microRNA (topics
26), and some social aspects of hospital/clinical procedures (topic 12) (see Table
4.8).
Topic number Topic words
7 surgery injury complication image technique lesion month blood strain brain12 network hospital family staff mouse program score participant member intervention23 depression anxiety movement participant score disorder scale symptom stress month26 mirnas sequence cancer growth target fraction mirna culture infection medium30 virus insulin mouse blood trial vitamin score child infection cancer32 chromosome embryo stage clone exposure receptor mutation locus phenotype marker35 strain bacteria production culture medium activation growth infection inhibition product40 pregnancy birth woman mother antibody child outcome parent muscle infection
Table 4.8: A selection from the 50 topics generated from noun corpus
28
Chapter 4. Methodology
Model Evaluation
This model has more detailed topics, and cancer appears to be a significant topic
word. There are some identical topics based on the themes represented by them.
Overall the model shows varieties based on the types of topics in it. There are some
identical topics and instances of mixed topics are present as well.
4.2.5.5 100 Topics Model
The topics found by this model are variations of the topics found in the 10, 20, 50
topics model. However, the advantage of this model is that one can see many topic
words that are associated with the same topic. A good example for this are topics
about breast-cancer (see Table 4.9). The three topics that fall under the theme of
breast-cancer are depicted in Table 4.9. However, we see that the topic words in
them have a different thematic focus. Some focus on the growth of the carcinoma
(topic 1), others on diagnosis and mortality (topic 23), whereas the last one focuses
on tumours and mutations (topics 80).
Topic words Labels
cancer tumor breast proliferation antibody growth carcinoma woman medium invasion breast-cancer (1)
cancer association death mortality cohort breast diagnosis exposure survival incidence breast-cancer (23)
cancer tumor breast sequence therapy mutation plant stage mirnas association breast-cancer (80)
Table 4.9: Focus on breast-cancer related topics from 100 topics models from nouncorpus
Model Evaluation
Overall the 100 topic model has the issue of identical topics. As shown in the
previous example (see 4.9) it generates fairly good topics. The model lacks thematic
diversity, Perhaps 100 topics are too much to be extracted from this subcorpus.
Hence, if one were to work with this corpus for the scope of this Master’s thesis, one
would have to rely on a better version of the previously generated 50 topics model.
4.2.6 Experiment 5: Influence of POS Tags
In order to gauge the influence of different POS tags, I ran multiple models with
different corpora. As the base corpus, I am using the results from 4.2.5.4, where I
have a 50 topics model with noun related words. Let’s denote it as Noun-corpus.
I created three new corpora, each with a different group of tokens in them. I used
29
Chapter 4. Methodology
Topic number Topic words
1 plant strain sequence medium growth culture primer cancer water mutation2 stimulus stimulation vaccine mouse neuron target layer latency trial sequence3 growth insulin image neuron domain field score strain parameter reaction4 participant trial event image performance block stage activation stimulus frequency5 measurement temperature blood image water sensor phase device field surface6 mouse antibody infection blood animal cytokine culture tumor macrophage brain7 surgery injury complication image technique lesion month blood strain brain8 domain sequence mutation amino molecule peptide position virus compound family9 intervention trial score child outcome participant month disorder symptom criterion10 infant child score infection trial month season image correlation participant11 brain seizure mouse frequency sequence woman blood trial error stimulation12 network hospital family staff mouse program score participant member intervention13 woman participant infection blood child prevalence brain status adult volume14 antibody culture mouse membrane neuron medium image growth mutation fluorescence15 student service country community people practice program question school survey
Table 4.10: 15 topics from 50 topic model from Noun-Corpus
the following criteria to create the corpora, namely a corpus with noun and verb
type tokens (Noun-Verb-corpus)1, a corpus with noun and adjective type tokens
(Noun-Adjective-corpus)2, and a corpus with noun, verb, and adjective type tokens
(Noun-Verb-Adjective-corpus)3. Also, as a vocabulary reduction measure that can
be compared for all three corpora, I removed the top 100 most common words from
each newly created corpus. This was done by removing the top 100 most frequent
words from the dictionary of vocabulary frequencies created by Gensim. I then
analysed the topics returned by the models.
4.2.6.1 Noun-Verb Corpus
For the Noun-Verb corpus I will only look at the first 15 topics returned by the model
(see Table 4.11). The reason I chose 15 topics is due to the fact that topic diversity
in this model is extremely low and it is possible to make a judgement on the quality
based on the first 15 topics returned by it. I could also have randomly chosen e.g.
20 topics that are part of this model and would not have made a difference for my
evaluation of this topic model. I will compare it with the top 15 topics from the
Noun-corpus (see Table 4.10).
Whilst observing the topics returned by both corpora, one can observe that the
topics from the Noun-corpus demonstrate great diversity. In contrast, the Noun-
Verb-corpus exhibits an issue with its dictionary. Most of the observed topics contain
the word ‘cell’, ‘gene’, ‘mouse’, ‘data’. These topics are also very bad as most of
them are identical, they are also very generic and nonsensical. Furthermore, most
of the 15 topics show to a certain extent cases of mixed topics.
1Token with POS tags: NN, NNS, NNP, NNPS, VB, VBD, VBG, VBN, VBP, and VBZ.2Token with POS tags: NN, NNS, NNP, NNPS, JJ, JJR, and JJS.3Token with POS tags: NN, NNS, NNP, NNPS, VB, VBD, VBG, VBN, VBP, VBZ, JJ, JJR,
and JJS.
30
Chapter 4. Methodology
Model Evaluation
This model has severe issues with its vocabulary and cannot be used for any practical
purposes. I also observed that despite the model containing words that fall under the
category of verbs, none of the topic words appear explicitly to be verbs themselves. It
could be the case that some of the tokens in the topics are homonyms of lemmatized
verbs. For example, the tokens ‘effect’, ‘test’, ‘study’ and ‘result’ could possibly
be verbs. Unfortunately, in the topic model it is not possible to say if the tokens
are indeed nouns or verbs. This is due to the reason how LDA works, namely it
ignores the syntax of the tokens and focusses on word frequencies in documents.
Moreover, during the text preprocessing stage I may have conflated tokens which
are homonyms. It is highly likely that the homonyms that appear in the topics refer
both to their verb as well noun forms.
Topic number Topic words
1 cell protein figure antibody activity expression analysis study level membrane2 cell study group data patient analysis difference result time level3 cell protein data treatment expression study analysis level gene receptor4 cell study patient data analysis effect group gene level disease5 cell population region site bone analysis gene study number data6 sequence analysis study gene data population time group result rate7 patient study risk data group time year case result rate8 cell figure analysis condition activity time result data study effect9 study patient analysis data group result model treatment effect test10 gene study patient expression cell group sequence data protein disease11 cell expression mouse control level protein group analysis figure gene12 specie population model effect time study data number group size13 data cell study trial analysis time gene result effect treatment14 study student patient group time level effect rate treatment analysis15 cell patient study infection control mouse gene result expression level
Table 4.11: 15 topics from 50 topic model from Noun-Verb corpus
4.2.6.2 Noun-Adjective Corpus
For this corpus, I will compare the topics returned by the model using the Noun-
Adjective-corpus with the results from our base Noun-corpus. I observed here as well
that the model has issues with the dictionary, as the token ‘cell’, ‘study’, ‘group’.
‘data’ are highly prevalent in all of the 15 topics of the Noun-Adjective-model. In
this case too the topics are very generic and, in some cases, nonsensical.
Model Evaluation
Despite the fact that the model has issues with its dictionary, which should be
reduced in order to yield sensible results, another aspect that I noticed is that the
words in the topics do not contain any adjective-related tokens. I assume that the
adjective tokens are low frequency, compared to their noun counterparts and because
31
Chapter 4. Methodology
of this reason they do not appear in the topic model. In summary, this model cannot
be used because of the aforementioned reasons.
Topic number Topic words
1 cell study analysis patient expression level gene data group figure2 plant cell study control analysis group leaf sample level gene3 cell response expression study data effect macrophage level infection group4 cell expression tumor control mouse figure protein antibody level treatment5 gene data study analysis figure level effect group control genotype6 study risk population year group prevalence case cancer analysis woman7 study level cell data group analysis concentration year child time8 group study bone case result time difference patient data analysis9 strain sample sequence group number vaccine study data analysis isolates10 study group analysis protein data region activity patient gene level11 treatment data study group analysis response sample time cell model12 cell study figure analysis data expression group gene number control13 compound study activity concentration reaction effect result acid group data14 mutation study analysis cell nerve data sample group patient case15 group system process study data change time research community level
Table 4.12: 15 topics from 50 topic model from Noun-Adjective corpus
4.2.6.3 Noun-Verb-Adjective Corpus
In this case too, the model generated form this corpus will be compared to the base
Noun-corpus. It is also observed here that the tokens ‘cell’, ‘study’, ‘group’, ‘data’
are quite prevalent items in the generated topics. For this iteration, the topics are
nonsensical and partially mixed in nature.
Model Evaluation
This model should theoretically exhibit topic words that are verbs, and adjectives.
Unfortunately, what is observed here is that the tokens are mostly nouns. There
could be the case that some of these nouns are homonyms of verbs as discussed
before (see 4.2.6.1). As for the adjectives, they are lacking in this topic model as
well. The dictionary of the Noun-Verb-Adjective-corpus is also the cause of the lack
of topic diversity for this model.
Topic number Topic words
1 patient study cell treatment group level trial result data therapy2 model data number value effect study result time parameter analysis3 cell patient data analysis study effect mouse model figure result4 cell data group study patient time figure protein analysis result5 cell response study result gene expression effect analysis number receptor6 sequence gene specie population data number analysis region genome site7 patient study disease year data risk rate treatment analysis factor8 data time sample analysis temperature solution particle figure result surface9 study treatment result gene time patient group cell analysis effect10 cell protein figure antibody control result data expression membrane time11 group study effect analysis time data response control difference patient12 health study care woman research data intervention group time service13 cell concentration study activity effect data control figure analysis treatment14 gene study cell level analysis data activity expression sample treatment15 cell protein study level activity gene expression effect control result
Table 4.13: 15 topics from 50 topic model from Noun-Verb-Adjective corpus
32
Chapter 4. Methodology
Term Similarity between the Models
Noun-Verb Noun-Adjective Noun-Verb-Adjective
Noun 45.77% 44.09% 37.93%Noun-Verb - 61.8% 54.59%Noun-Adjective - - 60.4%
Table 4.14: Percentage of identical terms in between the models
As a next step, I calculated the percentage of identical words between the topic
words of the corpora. As seen in Table 4.14, the Noun-corpus has the least amount
of common terms with the other three corpora. As a further measure, I also looked
at the number of unique topic words for each case (see Table 4.15).
Noun Noun-Verb Noun-Adjective Noun-Verb-Adjective
Unique terms 201 178 165 159
Table 4.15: Number of unique words in found in all the topics
It can be seen that the Noun-corpus has the highest number of unique words in the
topics. Even though the other models hopefully have tokens from different groups
of POS tags, the number of unique tokens decrease with as the number of token
types increases.
The explanation for this can be found by looking at the topic models and the
dictionary for each corpus. Even though with each new group of tokens included
in the corpus the number of tokens in the corpus increases, when it comes to the
topic models, the topic words are different types of nouns. This is true for all
the corpora created thus far. Hence, the issue of dictionary trimming becomes
important because with increasing amounts of tokens, removing only the top 100
tokens for the three corpora is not the best method for getting good topic models.
For the verb and the adjective based corpora it has been mentioned some of the
vocabulary items may have been conflated during the lemmatization processes. The
word frequencies within the dictionaries are no longer identical due to this conflation.
Therefore, for the other larger corpora based on the results of Noun-Verb-, Noun-
Adjective-, Noun-Verb-Adjective-corpora, it would be advisable to remove a larger
amount of most frequent terms than in those corpora that only have fewer types
of lemmas. Unfortunately, there are no guidelines that state the amount of high-
frequency dictionary items that should be removed. This can only be determined
via trial and error with multiple experiments.
Hence, I decided to discard the verbs and the adjectives from my future experiments.
This decision is easily comprehensible if one considers the fact, that with the current
33
Chapter 4. Methodology
subcorpus, a somewhat ideal cut off point for reducing the dictionary has been
found. In the following experiments, I will only work with the corpus of noun
related lemmas.
4.2.7 Experiment 6: Extracting Models with Distinct Topics Using
Topic Similarity
Similar topics are an issue for topic models. In this section, I look at a measure for
detecting similar topics and check if with the current model parameters, my models
have them.
In order to find the common topics, I ran the model with the Noun-corpus and
previously mentioned parameters 25 times. I calculated if the topics within a model
are similar. Let us consider that we have a topic model A, which consists of topics
At1 to At50, and each topic has 10 topic words. If I want to check if At1 is similar
to At2, At3...At50, I can check if the topic words in At1 match the topic words in
the other topics.
Here, I can use a similarity measure, inter-topic-similarity, which is the amount of
similarity required between the two topics. If the inter-topic-similarity measure is
100% then all the topic words in At1 are identical to all the topic words in At2, if
the similarity measure is 60% then only 6 out of 10 words in any order between the
two topics need to be identical.
To check the inter-topic similarity between all the topics of a given model, one
should first calculate the the number of combination between two topics (k) in a
model of 50 topics (n). This can be calculating number the pairwise combination of
50 objects as follows:
n!
k!(n− k)!=
50!
2!(50 − 2)!= 1225 (4.2)
Then the inter-topic similarity can be calculated by using the number of cases where
two topics are the same (for a given similarity measure) divided by the number
of possible combinations of topics between them. For example, if the model A
has 1000 cases where the combination of 2 topics are the same (for a inter-topic-
similarity measure of 20%), then the inter-topic similarity for all the topics in model
A is (1000/1225)× 100 = 81.63%. If one has multiple topic models, then one can
calculate the average of all the inter-topic similarity scores for the models.
34
Chapter 4. Methodology
Figure 4.3: Average inter-topic similarity
Figure 4.3 shows the average inter-topic similarity for the models. It shows that
with an inter-topic-similarity measure of 40% the average inter-topic similarity of
the model drops to 2.65%. This means that in average only in 32.56 out of 1225
possible topic combinations, one could expect topics with 4 out of 10 words in
common. For 50% the average inter-topic similarity of the model drops to 0.43%.
Given a high inter-topic similarity between the topics, a low average inter-topic
similarity score for the model indicates that the model does not have many identical
topics. This also shows that with the current set of parameters the models do not
have many identical topics (see 4.3).
4.2.8 Experiment 7: Extracting Stable Models
As discussed before, LDA models tend to be unstable as each run of the model
returns a different set of topics in a different order (see 2.3.2). Hence, the task is
to find a model that is stable enough so that it can be used for further testing and
experiments. In the following section I will try to find a way to judge the stability
of my topic models.
As mentioned before, topic model A consists of topics At1, At2... At50 and similarly
35
Chapter 4. Methodology
another model B consists of topics Bt1 . . . Bt50. In order to check if there are any
similarities between the models, one could try and match the topics from model A,
with model B. This could be done by checking if the topic words At1 match those
from any topic in model B. Here it is highly likely that the match is not perfect.
Hence, we can have results such as, At1 matches Bt3 (6 out 10 words), and Bt6
(7 out of 10 words). Then I can say that At1 matches Bt6. Then I remove At1
and Bt6 from my comparison setup, and continue to do the same with At2 until
At50. I go through all the topics and choose the one with the highest score. If
the scores are identical, I choose the first one. For example, if At7 matches with
Bt27 and Bt35 with 5 common words, here I would match At7 with Bt27. In order
to calculate the similarity between two models, inter-model-similarity, I can count
the similarity between the topics, e.g. At1 and Bt6 has the similarity of 7. Two
models can have the maximum inter-model-similarity score of the number of topic
words per topic multiplied by the total number of topics. In our case this is, 10 ×50 = 500. Thus, if two models (modular permutations) are identical, their inter-
model-similarity score would be 500. I calculated the average similarity score for
the aforementioned models. I chose one model as a constant and compared it with
the other models (model A with model B, then model A with model C etc.). This
resulted in an average inter-model similarity score of 171.25 or in other words the
models are 34.25% similar to each other.
However, I realized that this measure is not adequate to check the similarity between
the models. This is due to the fact that by trying to match every topic in one model
with every topic in another model, the matches are sometimes very much forced.
I tried a similar approach as explained above; however, this time I only calculated if
the topic was found or not. This means if At1 can be matched with any of the topics
in model B. Moreover, I also accounted for the number of common words between
the topics, as explained in section 4.2.7.
This can be explained in the following manner. I take model A as my base model,
and try to match At1 with any of the topics in model B, with a specific inter-topic
similarity measure. For example, with a inter-topic similarity measure of 0.7 (7 out
of 10 words should be common), At1 matched with at least one topic in model B,
and At2 matched with none and so on. At the end of this example, the result is
that 32 topics from model A found a counterpart with a given inter-topic similarity
measure in model B. Then the inter-model similarity is (32/50) × 100 = 64%.
The logic behind this measure is that if the inter-model similarity score is high for
a high inter-topic similarity score, then the models are similar to each other, and
have a variety of topics that are not identical.
36
Chapter 4. Methodology
number of passes 10 20 30 40 50 60 70 80 90 100 200 300 400 500
similarity(%)10 93.2 91.6 91.6 90.4 88.8 92.8 90.4 93.6 88.4 90.4 89 91 91 9420 79.6 80.4 82 82 80.4 80.4 85.2 80 82 85.2 83 84 85 8430 68.4 74 78.4 75.2 76 79.2 81.2 79.6 78.8 80.4 84 82 84 7940 61.6 62 70.8 66.8 69.2 72 76.4 73.2 75.6 78.4 76 78 76 7950 51.2 54 56 56 60.8 61.6 66.8 66 66.4 70.4 67 74 69 7060 37.2 40.8 43.2 41.6 48.4 49.6 54 52.4 48.8 54 58 61 51 4970 23.6 28 28.4 26.8 34.4 34.4 39.6 36.8 32 42.8 41 49 41 3480 11.6 18.4 16 17.2 20 24.4 26.8 18.8 16 28.4 20 32 22 2090 5.2 5.2 8.8 4.8 7.2 11.6 12 8 10.8 16 9 17 13 10100 0.4 0.4 2 0.4 0.4 3.2 2 1.6 2.4 4.4 2 2 3 4
Table 4.16: Topic similarity based on number similar words over multiple passes
I calculated the inter-model similarity for different inter-topic similarity measures.
As my preliminary results were rather low, I changed another parameter of my LDA
model, the number of passes, which is the number of times the model goes through
the entire corpus. Then I calculated the scores for those models. For every change
in the number of passes, I ran the model 5 times. The scores on Table 4.16 show
the average inter-model similarity scores.
One can see in Table 4.16, that the similarity between the models increase with
the number of passes. Moreover, there is a greater similarity between the models
if one takes the number of common words between the topics into consideration.
Intuitively, the lower the minimum number of common words is, the more similar
are the models with each other. Moreover, as one can see with the similarity of 50 to
60% for 100 passes, a model similarity of 70.4% and 54% respectively are achieved.
It can be assumed that the models appear to be stable. With 100 passes one achieves
a fairly adequate level of similarity. After 100 passes the level of similarity for the
models do not increase significantly any more.
At this point, the hyper-parameters for the topic models have been set, and a fairly
stable set of topic models have been found. I will choose one of the models with 100
passes, namely the one used as the base model as my topic model for the upcoming
sections.
4.2.9 Topic Labelling
There are many ways of automatically labelling topics within a topic model (see
2.4.1). However, despite their usability, they are either dependant on external APIs
(Lau et al. [2011]) or require further models to be trained (Bhatia et al. [2016])
However, instead of referring to a topic by merely a number, I propose a rudimentary
37
Chapter 4. Methodology
approach of referring to them by their topic number and the top three topic words in
a hyphenated construct. For example, topic 19 (see Table 5.1) can also be referred
as topic 19-surgery-injury-complication. This will convey more information about
the content of the topic. A comprehensive list of all 50 topics can be found in the
appendix (see Table A.1).
38
5 Data and Topic Exploration
5.1 Document Topics Distribution
Checking the topic distribution per model can be a challenging task, because LDA
represents a document as a probability distribution of multiple topics.
For example, the document in Figure 5.1 is represented by topic 34-cancer-tumor-
breast (see Table 5.1) with a topic probability of 0.9034. Whereas the document in
figure 5.2 is represented by topic 19-surgery-injury-complication with a probability
of 0.4172 and topic 28-lung-liver-platelet with a probability of 0.4142. The topic
words from both topics 19 and 28 are present in the article in Figure 5.2.
In some cases it is easy to guess which topic best represents the document (e.g. topic
19-surgery-injury-complication for Figure 5.1). However, for other documents this
distinction is not easy, as shown for Figure 5.2. The probability difference between
the top 2 topics is 0.4172 -0.4142 = 0.003. There are other cases in the corpus where
the probability score between the top topics are identical.
Topic number Topic words
19 surgery injury complication pain technique nerve catheter pressure operation vein28 lung liver platelet respiratory mortality fibrosis admission sepsis count fluid34 cancer tumor breast survival metastasis growth carcinoma lung tumour chemotherapy
Table 5.1: Topics 19, 28, 34
5.2 Data Exploration
5.2.1 Topic Distribution
Trends in the data can be discovered by looking at the topic probability distribu-
tion within the corpus. The logic for this approach is as follows: For the 150,000
documents in the corpus, using Gensim, one can get the document topic probability
for each document that was used to create the model. For example, for a fictitious
document D4, the topic probabilities are 0.85 for topic 1, and 0.1 for topic 4 (see
39
Chapter 5. Data and Topic Exploration
Table 5.2). After gathering the topic probability scores for all the documents, one
can calculate the topic probability distribution for a given topic. For topic 3 in
the example shown in Table 5.2, the probability distribution can be visualised by
creating a histogram for all the 150,000 data points (documents in the corpus) for
which the document probability is greater than 0 (e.g. 0.45,..., 0.1).
Figure 5.1: Topic words in article(green : topic 34)
Figure 5.2: Topic words in article(green: topic 19, yellow:topic 28)
Based on the corpus of 150,000 PubMed documents, Figure 5.3 shows the topic
probability distribution for the corpus. The x-axis represents the topic probability,
and the y-axis shows the number of articles in each category. As it can be seen, this
graph does not show much information. This could be due to the reason that in
many cases the articles exhibit a low topic probability towards this topic. In order
to visualize the data better, I only selected the cases where the topic probabilities
are greater than 0.1 (Figure 5.4).
Document ID Year Topic 1 Topic 2 Topic 3 Topic 4
D1 2000 0.25 0 0 0D2 2000 0 0 0.45 0D4 2001 0 0.85 0.1 0D5 2001 0 0 0 0.3
Table 5.2: Fictitious topic probability distribution over multiple topics and docu-ments
For most of the 50 topics, the probability distribution looks similar to that of topic 19
(see Figure 5.2). As for topic 30-adult-host-male and 41-image-volume-measurement,
one can observe sudden spikes in the topic distribution. For topic 41, there is a
sudden increase in the number of articles at the probability of approximately 0.5. For
40
Chapter 5. Data and Topic Exploration
Figure 5.3: Topic probability distribu-tion of documents of topic 19
Figure 5.4: Topic probability distributionof documents of topic 19,where topic probability >0.1
topic 39, the spike in increase of articles with a topic probability of approximately 0.7
and 0.8. For topic 6-specie-specimen-margin, the probability distribution does not
fall exponentially, but stagnates a little. For topic 25-domain-molecule-chain, the
drop in number of articles with increasing probability is observed until 0.6, then the
number of articles with higher probability increases before falling again (see Figure
5.5). In general, it is observed that the higher the topic probability, the fewer the
number of articles that belong to that category. Otherwise, this information is not
useful for detecting trends in the data.
5.2.2 Average Topic Probability
To observe the temporal changes in the data, it is required to transform the infor-
mation shown before (see 5.2.1) to show the diachronic aspects. For a given topic,
an article has the topic probability k. During a given year y, there are ny articles
published. Thus, one can calculate the yearly average topic probability Ay (see
Equation 5.1). For example, in the fictitious example in Table 5.2, the average
yearly topic distribution for the year 2000 and topic 3 is (0+0.45)/2 = 0.225.
Ay =
ny∑i=1
ki
ny
(5.1)
This process results in showing temporal trends in the data based on the topics. I
looked at a time frame of 15 years, namely from 2000 to 2015. The following graphs
below show the results (Figures 5.7 -5.10). The x-axis represents the year and the
41
Chapter 5. Data and Topic Exploration
Figure 5.5: Topic probability distribution of documents of topics 6,10, 39, and 41,where topic probability >0.1
y-axis represents the average topic probability. It should be mentioned that the data
points are not continuous, the line drawn between them are for visual aid.
In most cases the average topic probability fluctuates in a given time range or re-
mains more or less constant. However, there are cases where one can observe an
increase in the topic probability. As one can see in Figure 5.7, the topic probabil-
ities for the topics 23-trial-month-therapy and 33-lesion-diagnosis-biopsy fluctuate,
with the gradual tendency to increase over time. For topic 10-surface-temperature-
particle, after a dip in the early 2000s, the probability for this topic also sharply
rises.
In other cases, the opposite is true. The average topic probability gradually drops
over time. This is illustrated in Figure 5.8 where the topic probability has ei-
ther gradually dropped, as is the case for topic 11-antibody-vector-construct, or the
topic probability has dropped and remained stable over the years, e.g. topics 12-
activation-inhibitor-phosphorylation and 28-care-intervention-service.
There are also instances where a topic suddenly peaks and then its topic probability
42
Chapter 5. Data and Topic Exploration
Figure 5.6: Topic probability distribution of documents of topic 25, where topicprobability >0.1
Figure 5.7: Average topic probability of documents from 2000 to 2015 of topics 10,23, and 33
wanes gradually over time. This is shown in Figure 5.9, where the topic 25-domain-
molecule-chain suddenly exhibits a surge in topic probability in the late 2000s and
then sharply declines over the upcoming years. Peaks are also observed for the topics
2-sequence-genome-specie and 5-parameter-correlation-probability.
Finally, Figure 5.10 shows that a topic can vanish over time. This is the case for
topic 50-exposure-skin-smoking, which suffers a sharp decline in 1995 and does not
reappear in the corpus after 1999.
43
Chapter 5. Data and Topic Exploration
Figure 5.8: Average topic probability of documents from 2000 to 2015 of topics11,12,17,21,28, and 43
Figure 5.9: Average topic probability of documents from 2000 to 2015 of topics 2,5,and 25
Thus, using the average yearly topic probability, one can observe temporal trends
in the data, namely changes in diachronic topic probability, where it increases (see
Figure 5.7), decreases (see Figure 5.8), exhibits peaks (see Figure 5.9), and stops
being popular (see Figure 5.10).
44
Chapter 5. Data and Topic Exploration
Figure 5.10: Average topic probability of documents from 1980 to 2005 of topic 50
5.3 Observing Diachronic Trends Using Topic Models
With the help of average topic probability as well as a diachronic corpus, it has been
shown that diachronic trends within a corpus can be detected using a topic model
(see 5.2.2). Moreover, trends in the data vary based on the type of topic. Thus,
using topic models one can detect surges as well declines of certain topics within
these corpora. These observations assist my claim, that it is indeed possible to
detect temporal trends in a corpus by using topics generated from a topic modeling
algorithm (see 1.2).
5.4 Topic Exploration
For the upcoming sections, I will be analysing the following topics, namely 13-
woman-heart-pregnancy, 22-infection-virus-vaccine, 34-cancer-tumor-breast and 38-
infection-resistance-bacteria (see Table 5.3). I chose these topics because I not only
know the meaning of topic words in them, but also because the topics themselves
have a cohesive theme. As for their average topic probability, these topics exhibit,
a certain variability (see Figure 5.11). Thus, it is visible that these topics have
gained as well lost popularity during 2000 to 2015. However, simply looking at the
average topic probability does not convey much information about the content of
these topics. Hence, in the following sections, I will took into the variability that
exists within the topics themselves.
45
Chapter 5. Data and Topic Exploration
Topic number Topic words
13 woman heart pregnancy pressure birth hypertension infant delivery mother week22 infection virus vaccine antibody vaccination antigen replication titer influenza transmission34 cancer tumor breast survival metastasis growth carcinoma lung tumour chemotherapy38 infection resistance bacteria isolates strain culture pathogen phage tuberculosis coli
Table 5.3: Four topics selected for data exploration
Figure 5.11: Average topic probability of documents from 2000 to 2015 of topics13,22,34, and 38
5.4.1 Frequency of Topic Words in the Corpus
To analyse the contents of the individual topics, I looked at the relative frequency
of the topic words as they occur in the corpus. The relative frequency for each year
was calculated as follows:
yearly relative frequency =
number of documents in which the topic word occurred in a year
number of documents published that year
The aim of this approach is to demonstrate the usage of the topics, and the di-
achronic changes that have occurred to them. As seen in Figure 5.12, it is difficult
to see any trends as the graph if highly cluttered by the 10 topic words. Hence, I de-
cided to divide the topic words into groups, for better visibility of the trends. These
were created by dividing the topic words into 2 groups, namely pregnancy related
words (e.g. ‘birth’, ‘infant’, ‘mother’, and ‘pregnancy’) and words related to heart
disease (e.g. ‘heart’, ‘hypertension’, ‘pressure’). I discarded the words ‘week’ and
‘woman’ from the analysis as they are too general and can occur in many articles
regardless of the topic. As for the pregnancy related words, it can be seen that there
46
Chapter 5. Data and Topic Exploration
Figure 5.12: Relative frequency of topic words for topic 13-woman-heart-pregnancy
has been a steep increase in the usage of these words since 2001, but the usage has
somewhat stagnated since 2005 (see Figure 5.13). As for the words related to heart
disease, the usage of these words have steadily increased over time (see Figure 5.14).
Another observation made during this process was that by analysing the trends that
are present between the topic words, one can observe groups within a given topic.
Unlike in the example mentioned above one can clearly detect groups within the
topic words as demonstrated in Figure 5.15. It is clearly visible that from 2005
onwards the topic words form three distinct groups. Based on the graph in Figure
5.15 one can divide topic topic words into three groups as shown in Examples 5.2 -
5.4.
(5.2) ‘antibody’, ‘infection’
(5.3) ‘replication’, ‘transmission’, ‘virus’, ‘antigen’
(5.4) ‘influenza’, ‘vaccination’, ‘vaccine’, ‘titer’
Out of these immunology related terms from topic 22-infection-virus-vaccine one
is able to observe logical grouping between them, which is the case for Example
5.4 where the topic words ‘vaccination’ and ‘vaccine’ fall into the same semantic
category. Nonetheless, as these are low frequency terms within the specific grouping
47
Chapter 5. Data and Topic Exploration
of topic words, it is somewhat difficult to see the changes in their usage within the
topic as a whole. Hence, I plotted the topic words from 5.4 on a separate graph
for better visibility of the temporal trends (see Figure 5.16). Unlike in Figure 5.15,
where due to the scaling of the image it appears that the relative frequency of the
terms from Example 5.4 had remained somewhat stable since 2007, in Figure 5.16 it
can be observed that the frequency of these topic words have fluctuated over time. I
applied a similar approach to visualize the topics in Example 5.3 (see Figure 5.17).
However, this visualization was not helpful to ascertain temporal trends amongst
the topic words within this group. Thus, with the means of visualisation it is
also possible to detect groups within a topic and observe how these groups evolve
temporally.
Figure 5.13: Relative frequency ofpregnancy relatedwords from topic 13
Figure 5.14: Relative frequency of heartdisease related words fromtopic 13
It can be seen here that the internal trends within a topic can be examined by
looking at the diachronic relative frequency of the topic words. This task can be
made easier with the help of a dynamic interface (see Chapter 7).
5.4.2 Diachronic Shifts within a Topic
In the previous section it has been demonstrated that one can use the relative
frequency of the words that exist within a model to show diachronic shifts and trends
that exist within a topic. Thus, it has been shown that in using topic modeling one
can detect diachronic changes within the words of a given topic. Consequently, this
provides an answer to my second research question.
48
Chapter 5. Data and Topic Exploration
Figure 5.15: Relative frequency of topic words for topic 22-infection-virus-vaccine
Figure 5.16: Relative frequency ofimmunology relatedwords from topic 22(group 2)
Figure 5.17: Relative frequency ofimmunology relatedwords from topic 22(group 3)
5.5 Frequency of Popular Words within a Topic
A topic does not only consist of the topic words in it, but one also has to consider
other underlying trends within the articles of a given topic. These trends can be
analysed by visualizing the relative frequency of the most popular words in them. For
49
Chapter 5. Data and Topic Exploration
my analysis, I selected articles that belong to a certain topic. I selected these articles
based on their document topic probability. If a document has a topic probability
that is greater than zero for a specific topic, then it belongs to that topic. As a
consequence, a document can belong to multiple topics. Then from this subset of
articles I calculated the most frequent words. These are words that have the highest
absolute frequency in this subcorpus of articles. Then I used the relative frequency
of these popular words within the subcorpus and visualized them in a diachronic
manner.
Figure 5.18: Relative frequency ofwords from topic 13(top 1-5 words)
Figure 5.19: Relative frequency ofwords from topic 13(top 6-10 words)
In Figures 5.18 and 5.19 one can see the diachronic relative frequency of the 10
most popular words in topic 13-woman-heart-pregnancy. Similarly, the diachronic
frequencies of the popular words in topic 22-infection-virus-vaccine, are displayed
in the Figures 5.20 and 5.21.
Figure 5.20: Relative frequency ofwords from topic 22(top 1-5 words)
Figure 5.21: Relative frequency ofwords from topic 22(top 6-9 words)
Specifically for the word ‘brain’, the diachronic relative frequencies differs signifi-
cantly for topic 13 and 22. Thus, one can assume, that the focus of the articles
50
Chapter 5. Data and Topic Exploration
that occur with these topics exhibit a different thematic focus, despite the usage of
common words (see Figure 5.181 and 5.212).
5.5.1 Diachronic Popularity of Non-topic Word Related Terms
In the previous section, I was able to demonstrate that by using topic modeling one
can group a corpus into different topic groups using their document topic probability
(see 5.5). In these partitions, one can show diachronic trends in the most popular
words. I was able to detect diachronic changes in in the usage of words within
documents that fall under a specific topic, consequently answering my third and
final research question.
1the word ‘brain’ is displayed by the turquoise line2the word ‘brain’ is displayed by the blue line
51
6 Results and Discussion
6.1 Research Question Nr. 1
I could show with the help of the average topic probability that topics exhibit
diachronic trends within a corpus. I also visualized them over a period of 15 years
for certain topics. As mentioned before (see 2.3), LDA is an unsupervised machine
learning technique and there does not exist any ground truth with which I can
compare my results. Furthermore, each iteration of the model generates a different
output, which indeed makes LDA a weak candidate, when it comes to replication of
the methods mentioned in this Master’s thesis. This means that even with the same
data and model parameters one could get different results based on the probabilities
calculated by the model. Thus, the results here are for demonstrating my hypothesis
as mentioned in the research questions. Therefore, the only way to demonstrate my
claim was with the help of visualizations, as the goal was to show diachronic trends
in the data.
LDA aids in automatic generation of subdivisions within the corpus. However, LDA
does not create the diachronic trends, as they already exist in the data set. However,
the topics made by LDA do exhibit thematic entities that are present in the corpus.
Hence, the topics exhibit semantic coherence between the topic words. This is the
case for topic 34-cancer-tumor-breast, where the topic is made up of mostly words
that are about cancer or somehow belong into cancer-related discourse. A similar
argument can also be made for topic 22-infection-virus-vaccine, where the topic
words belong to the theme of immunology. In other cases, the topics are mixed (see
2.3.1.2), but the subtopics in them demonstrate semantic coherence; for example,
topic 13-woman-heart-pregnancy is about pregnancy and heart disease. Therefore,
following the trend of such topics might be to some extent helpful; however, one
should take the trends in the subtopics also into consideration (see 6.2).
It should be emphasised that these topics are specific to the corpus and are not
about the general historical trends in the field of biomedical publication. Hence, the
results may vary based on a different subset of documents. As mentioned before
(see 4.2.4.1), the documents used for model creation only consists of 150 thousand
52
Chapter 6. Results and Discussion
articles, which are about 8-11% of randomly selected articles that were published
each year. These topics could be of interest to anyone wishing to explore a data
set and check why there is a certain trend occurring in the corpus. It should also
be mentioned that in order to acknowledge the validity of a given topic one should
consult a domain expert.
6.2 Research Question Nr. 2
I was able to demonstrate diachronic shifts within a topic. These temporal changes
in the usage of topic words throughout the entire corpus, show how certain terms
within a given topic can exhibit different levels of usage over time. The results also
showed that words belonging to a specific subtopic have relative frequencies that are
closer to each other.
A side effect of this approach is that one can use it to check the quality of a given
topic. As demonstrated by topic 13-woman-heart-pregnancy, vague and generic
terms that occur in a topic tend to exhibit a high relative frequencies within the
corpus. The results are logical, as generic terms are expected to occur in multiple
documents.
In some cases, the topic words with low relative frequencies show an interesting
property. If these topic words are similar to each other, then they tend to be
grouped together, and can be detected in the visualization. This was the case for
topic 22-infection-virus-vaccine, where similar words had similar diachronic rela-
tive frequencies. Hence, one an use the diachronic visualisation to detect thematic
subdivisions within a topic.
6.3 Research Qsuestion Nr. 3
Using LDA created corpus subdivisions, one can also check trends of words within
a topic. This approach as well cannot be quantified with the help of some pre-
existing ground truth as the results are entirely corpus dependent. However, if one
looks at the topics and the trends of the popular words within the topic related
subcorpus, it can be said that this approach is useful for analysing trends within
the data. For topic 13-woman-heart-pregnancy one observes that usage of the word
‘brain’ is fairly high. Whereas, if one compares this topic with topic 22-infection-
virus-vaccine, which is about pregnancy and heart disease, one would observe a
different relative frequency of the word ‘brain’ and a very different diachronic trend
53
Chapter 6. Results and Discussion
as well. This shows the different thematic focus within the topics, and that the most
popular words in each topic are different. Furthermore, the difference in diachronic
trends of the relative frequencies of the words that are common between certain
topics emphasize the temporal thematic focus of these topics. Hence, it is possible
to detect diachronic changes in the usage of words in a specific topic.
6.4 Summary
In summary, it can be said that even though the three aforementioned approaches
cannot be quantified, the results exhibit clear trends, which appear to be logical,
based on the topics that I analysed in detail. Visualisations are a key to judging
the viability of these diachronic trends. As an exploratory approach into analysing
the underlying trends in unlabelled data, visualisations are adequate to judge the
validity of the claims made in the research questions. As the aim of all three research
questions were to demonstrate visible diachronic trends in the data, a quantifiable
measure is not yet required. However, for future work in this domain, it would be
advisable to find some metric which quantifies the trends shown in the visualisations.
54
7 Website
As it is quite challenging to view the results of this Master’s thesis in a non interac-
tive interface, I have built a companion website, for diachronic topic visualization
(DiaTopVis), where one can access and view the data interactively. This section
introduces the companion website designed to visualize the results of my Master’s
thesis. Also in this section, the features and functionalities of this website are ex-
plained. The aim of this website is to provide the user with an easy to use interface
for exploring the topics and the underlying themes that exist within them. The in-
spiration for this project is based on the work done by Wang and McCallum [2006]
and Song et al. [2014]. The interface was likewise inspired by a previous project
done by me (Ghoshal et al. [2017]). The website consists of three primary sections
each providing an interactive interface to explore the answers to the three research
questions tackled in this paper. There are also sections where one can view the topic
models and the topics in them. Finally, there is a help section which is there for
the user to assist with website navigation and orientation. The help section also
provides explanations for each section.
7.1 Generating Charts
The graphs on the website is created by using the C3 library1. C3 is a JavaScript
library (based on D32) that can be used to generate multiple formats of charts. An
advantage of C3 is that the graphs are interactive. The user can select or remove a
line within the chart by clicking on the legend that is associated with the line. In
addition, the y-axis is dynamic and immediately responds to the addition or removal
of a given line. Furthermore, by placing the cursor on any given point in the chart
the user is provided with all the values for a chosen cross section on the y-axis.
1http://c3js.org/examples.html2https://d3js.org/
55
Chapter 7. Website
7.2 Website Sections
7.2.1 Observing Diachronic Trends in Topics
This section of the website provides the user with a tool to interactively observe the
average topic probabilities of a chosen topic. On the website this section, labelled as
‘Part 1: Generate diachronic topic distribution’, generates diachronic topic models.
Also in this section of the website the user is guided through a series of steps. If
the steps are followed correctly, this would result in the generation of a graph which
would show the average topic probability of documents for a chosen number of topics
and a specified time range. In ‘Step 1’ the user can choose between multiple topics,
the choice being shown in check box format. In ‘Step 2’ the user specifies what time
range should be looked at. Finally, ‘Step 3’ generates the topic distribution (see
Figure 7.1 and 7.2).
Figure 7.1: Website: Part 1 User options
Figure 7.2: Website: Part 1 Example output for topics 2,3,4,5
56
Chapter 7. Website
7.2.2 Generate Frequency of Topic Words in the Corpus
In this section, the website provides a tool to look at the distribution of the topic
words in the corpus. On the website, this section is called ‘Part 2 : Generate topic
words distribution’. In a first step the user can choose a topic from a drop-down
menu. In a second step the user specifies what time range should be looked at. The
final step, again, generates the topic distribution. The graphs here are similar to the
ones generated in 5.4.1. The x-axis shows the year and the y-axis shows the relative
frequency of the topic words in the entire corpus that was used to create this model.
After the graph is generated it shows the relative frequencies for all ten topic words.
As this view can be a little cluttered, the user can remove some of the topic words
by clicking on the name in the legend below. The advantage of this section is that
one has the opportunity to set a time scale and toggle the number of topic words
that one wishes to visualize only with a few clicks of a button (see Figure 7.3).
Figure 7.3: Website: Part 2 Example output for topic 13 (topics shown partially)
7.2.3 Frequency of Popular Words within a Topic
In this part of the website can be used to visualize the popular words that occur in
documents that belong to a specific topic. This section is called ‘Part 3: Generate
distribution of top word(s) in a topic’ on the website. The graphs generated in this
section of the website are the ones that were used to demonstrate the popularity
of non-topic related terms (see 5.5). In this third part of the website, the user can
choose a specific topic from the drop-down menu. Then they can select the years for
which they would like to visualize the topics. As a next step, the user can chose the
range of popular words they wish to view. The range of popular words that shall
57
Chapter 7. Website
be depicted can be set using two HTML input buttons with a step attribute1. That
means, for example, if the user sets the range of words to be shown between 2 and
4, then the generated graph will show the results for the second, third and fourth
most popular words. Finally, in a last step, the user can simply press a button to
generate the relative frequency of the words from a topic subcorpus. These are the
three key-sections of the website where the user has the opportunity to explore the
topic models with an interactive interface (see Figure 7.4).
Figure 7.4: Website: Part 3 Example output for topic 13 (top 2-5 words shown)
1https://www.w3schools.com/tags/att_input_step.asp
58
8 Diachronic Topic Modeling Pipeline
In the methodology section, I mentioned the steps taken in order to arrive at the
final topic model which was used to answer the research questions. This section has
a particular focus on the pipeline that takes as an input PubMed XML documents
and returns topic models, as well as the CSV files that can used to create the
visualizations that I used for answering my research questions. Figure 8.1 shows the
structure of the pipeline, which is explained in the following sections.
8.1 Data Extraction
8.1.1 Extract Metadata
The first part of the pipeline consists of data extraction. In the data extraction part
a script reads the PubMed XML file and extracts any required metadata from the
file. For the purpose of this Master’s thesis only the information about the year of
publication was extracted from the file. It is however possible to extend the script
in order to allow the extraction of other metadata. The information about the year
of publication is saved to a separate file. The metadata information will be used in
the data mapping section (see 8.3).
8.1.2 Extract Text
In the text extraction part of the pipeline, the article text from the corpus undergoes
different levels of text preprocessing and filtering. These processes are explained in
detail in the following sections (see 8.1.2.1 and 8.1.2.2).
8.1.2.1 Text Pre-processing
Here the article text is extracted from the XML file. Then the article text is divided
into sentences using a sentence splitter. As a next step these sentences are tokenized
59
Chapter 8. Diachronic Topic Modeling Pipeline
Extract Text
Token Pre-processing
Extract Metadata
Sentence Segmentation
Token Filtering
Remove short tokens (e.g. tokens with less than 3 characters)
Remove stopwords, punctuation and numbers
Words with specific POS tags (optional)
Corpus Creation
LDA Topic Modeling
Create LDA Dictionary
Create LDA Corpus
Create LDA Model
Edit LDA Dictionary (optional)
Data Mapping
User defined criteria (optional)
Token lemmatization
POS Tagging (optional)
Word Tokenization
Data Extraction
Output: Tabular Data
Figure 8.1: Diachronic topic modeling pipeline
into words. There is an optional step which one can take to filter the results based
on POS tagging. This optional step will be discussed in section. Then the tokens
in the sentences are lemmatized (see 8.1.2.1.1).
8.1.2.1.1 POS Tagging of the Corpus
This is an optional part of the text extraction section where one can do a POS
tagging of the corpus and chose to keep only the tokens that fall under a chosen set
of POS tags which is defined by the user.
8.1.2.2 Token Filtering
In this section, the corpus is reduced and the following elements are removed from it.
These elements include the stopwords, punctuation and numbers. It is also possible
to remove tokens that are below a certain length. For example, for my corpus I only
kept tokens that had at least three characters in them. The output of this section
60
Chapter 8. Diachronic Topic Modeling Pipeline
is sent to the token filtering part of the pipeline (see 8.1.2.2).
8.1.3 Corpus Creation
In this section of the pipeline, the data from the token filtering process is converted
into a Python list. The list is then saved as a JSON file. The purpose of creating a
corpus at this stage is twofold. Firstly, the corpus will be used to create the LDA
model (see 8.2.4), and secondly, at a later state this corpus will be also used to
create word frequencies.
8.2 LDA Topic Modeling
In this section, certain aspects such as corpus creation, dictionary creation and LDA
model creation are explained. An advantage of this part of the pipeline is that the
user can always reuse the data from the previous section to generate different models.
This data recycling aspect can be useful as it saves time.
8.2.1 Dictionary Creation
The files from the previously created corpus are used to create a dictionary that
will be used by the LDA model. This dictionary is created by Gensim and contains
information about the word frequencies in the corpus. In the later stage, this dictio-
nary serves as one of the input parameters for the corpus creation that is required
by Gensim to make an LDA model.
8.2.2 Editing the Original Dictionary
As it was mentioned multiple times in the methodology section (see Chapter 4), the
dictionary plays a key role in the creation of the quality of the topic model. High
frequency words in the dictionary may lead to the creation of identical topics and
subsequently to a poor quality topic model. However, creating a new dictionary
form a corpus can be a time consuming endeavour. For this reason this pipeline
contains a dictionary reduction section where the user can specify which parts of
the vocabulary should be reduced (e.g. high frequency words, low frequency words
or words with certain features that can be user-defined). The output is a reduced
dictionary which will be used to create the corpus.
61
Chapter 8. Diachronic Topic Modeling Pipeline
8.2.3 LDA Corpus Creation
In this section Gensim creates a corpus based on the frequencies calculated by the
aforementioned dictionary (original or reduced). The corpus itself is not human
readable but contains information about word location and frequencies of the doc-
uments.
8.2.4 LDA Model Creation
This part of the pipeline creates the LDA model. Gensim takes as an input the
dictionary and the LDA corpus and generates the model based on them. Here the
user has the option to set certain parameters of the model such as the number of
topics, the chunk size and or number of passes. Based on these information, Gensim
creates an LDA model. From the output of the LDA model one can extract the
topics. Moreover, the model has additional information such as document topic
probability for all the documents that were used to create the corpus. The topics
as well as the document topic probabilities will be used to generate the diachronic
topic information for some parts of the desired output.
8.3 Data Mapping
In order to generate output that would answer my research questions, it is necessary
to map the information that was generated in the data extraction section (see 8.1)
and the LDA topic modeling section (see 8.2). The key issue here is mapping the
documents in the corpus that was created during the data extraction process (see
8.1.3) and the document topic probabilities that were calculated during the LDA
model creation process (see 8.2.4).
8.3.1 Mapping: Creating Yearly Average Topic Probability
For calculating the average topic probability for every year, I first map the docu-
ments IDs used by Gensim to the metadata information that was extracted before
(see 8.1.1). Then I calculate the topic probabilities for all the documents in this cor-
pus. This data is saved as a CSV file for later use. The average topic probabilities
for every year is calculated using the data from this file and is saved for future use
such as visualization.
62
Chapter 8. Diachronic Topic Modeling Pipeline
8.3.2 Mapping: Generating Relative Frequencies for Topic Words
For calculating the diachronic trends of the topic words I also map the document
IDs with the metadata. Then for each of the topic words I extract their occurrence
from the corpus files (see 8.1.3). Then I calculate the yearly relative frequency for
those topic words using the method mentioned in section 5.4.1. The output was
saved as a CSV file for visualization purposes.
8.3.3 Mapping: Generating Relative Frequencies for Popular
Words in Topic Subcorpora
Finally, for calculating the relative frequencies of words I divide my corpus into
multiple subcorpora where each subcorpus represents a topic. For the documents
that belong to a specific subcorpus I calculate all the words and their absolute
frequencies. Then I selected the top 250 words with the highest absolute frequency.
For these words, I calculate their relative frequencies using the same method as in the
previous section 8.3.2. Again, the output was saved as a CSV file for visualization
purposes.
8.4 Other Functions
This pipeline also contains other functions that can be useful to judge the quality
of a topic model. There is a function that calculates topic similarity of multiple
models based on number of similar words between them. There is also another
function which calculates the inter-topic similarity of a topic model which could be
used as a measure to judge the quality of a topic model.
8.5 Summary
The pipeline described in this section helps the user create their own topic models
and CSV-tables for data visualisation. Unlike off-the-shelf topic modeling tools,
here the user is provided with some predefined pre-processing functions. Due to
the modular structure of this pipeline, it is possible to skip certain steps. This is
true for the data extraction part where one can skip the part of speech tagging.
In the topic modeling section, the modular structure supports the user by saving
time, as the output generated in the corpus (see 8.2.3) and the dictionary creation
(see 8.2.1) sections can always be used by the LDA model (given these were created
63
Chapter 8. Diachronic Topic Modeling Pipeline
from the same original corpus). Furthermore, the data mapping structure helps the
user to identify the original files from which the LDA model was created. This is
a clear advantage over the current version of Gensim, which does not keep track of
the original input files used to create the corpus and the dictionary.
64
9 Conclusion
In this Master’s thesis, I implemented topic modeling on biomedical texts to detect
temporal trends within the data. I take a systematic approach towards topic mod-
eling of biomedical texts. By conducting a series of experiments, I finally arrive at
the desired topic model, which will be used to answer my research questions.
For my first research question, I was able to demonstrate temporal trends within
the topics that were generated using an LDA model. With the means of the average
topic probability, temporal fluctuations in the popularity of the topics were observed.
For my second research question, I then delved deeper into the topics themselves
in order to investigate the impact of the topic words throughout the corpus. For
this I looked at the relative frequency of the topic words and observed temporal
trends within them. Moreover, I was able to observe diachronic groupings of topic
words. These groupings showed tendencies about the quality of the topic words.
Semantically similar topic words tend to be grouped together, whereas generic words
tend to exhibit a high frequency and are grouped further away for the semantically
coherent topic words.
For the third research question, I observed temporal trends using topic modeling
within documents that belong to a specific topic. I was able to demonstrate that
the popular words within a subcorpus of documents belonging to a specific topic
tend to have different diachronic relative frequencies. Moreover, the top 250 most
popular words within the corpus tend to be different as well, based on the topic they
are representing.
As a further feature, I created a website to demonstrate the results that have been
found. The website has a general three part structure, where each part concentrates
on the results of one of the research questions.
Finally, I was also able to create a pipeline that would enable the user to create
similar results as shown in this paper. The pipeline contains a text extraction
component with the focus on text preprocessing and a topic modeling component.
65
Chapter 9. Conclusion
The aim of the pipeline is to facilitate the topic modeling process of PubMed XML
files and map elements form the corpus document metadata and information from
the topic model together to create data tables that can be used by researchers.
9.1 Future Work
In the scope of this Master’s thesis I was only able to scratch the surface of diachronic
topic modeling using a large corpus of biomedical texts. I would like to improve
my current work especially the created website to show also the articles with the
highest topic probabilities for a specific time frame. I aim to focus on finding the
most relevant document for a specific topic, which I was not able to do for this
project, as I did not have any access to domain experts who could have evaluated
the output of my system.
In future, I plan on gathering articles from only few sources on a specific topic
and try to match the temporal trends with actual historical developments in the
field. For example, for articles belonging to the general theme of immunology, one
could try to match with historical events such as major outbreaks of diseases and
discoveries of cures. I was not able to implement such an aspect for this project, as
I first needed to test the viability of the framework .
66
References
S. Bhatia, J. H. Lau, and T. Baldwin. Automatic Labelling of Topics with Neural
Embeddings. ArXiv e-prints, Dec. 2016.
S. Bird, E. Klein, and E. Loper. Natural Language Processing with Python.
O’Reilly Media, Inc., 1st edition, 2009. ISBN 0596516495, 9780596516499.
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of
machine Learning research, 3(Jan):993–1022, 2003.
P. P. Bonissone. Machine Learning Applications, pages 783–821. Springer Berlin
Heidelberg, Berlin, Heidelberg, 2015. ISBN 978-3-662-43505-2. doi:
10.1007/978-3-662-43505-2 41. URL
http://dx.doi.org/10.1007/978-3-662-43505-2_41.
J. Boyd-Graber, D. Mimno, and D. Newman. Care and Feeding of Topic Models:
Problems, Diagnostics, and Improvements. CRC Handbooks of Modern
Statistical Methods. CRC Press, Boca Raton, Florida, 2014. URL
docs/2014_book_chapter_care_and_feeding.pdf.
J. Chang, S. Gerrish, C. Wang, J. L. Boyd-Graber, and D. M. Blei. Reading tea
leaves: How humans interpret topic models. In Advances in neural information
processing systems, pages 288–296, 2009.
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman.
Indexing by latent semantic analysis. Journal of the American society for
information science, 41(6):391, 1990.
S. ElShal, M. Mathad, J. Simm, J. Davis, and Y. Moreau. Topic modeling of
biomedical text. In 2016 IEEE International Conference on Bioinformatics and
Biomedicine (BIBM), pages 712–716, Dec 2016. doi:
10.1109/BIBM.2016.7822606.
P. Ghoshal, J. Goldzycher, and S. Clematide. Bbdia: Diachronic visualisation of
semantically related n-grams using word embeddings. Conference Poster, June
2017. Poster presentation at SwissText 2017: 2nd Swiss Text Analytics
Conference.
67
References
T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd
annual international ACM SIGIR conference on Research and development in
information retrieval, pages 50–57. ACM, 1999.
A. Holzinger, J. Schantl, M. Schroettner, C. Seifert, and K. Verspoor. Biomedical
Text Mining: State-of-the-Art, Open Problems and Future Challenges, pages
271–300. Springer Berlin Heidelberg, Berlin, Heidelberg, 2014. ISBN
978-3-662-43968-5. doi: 10.1007/978-3-662-43968-5 16. URL
http://dx.doi.org/10.1007/978-3-662-43968-5_16.
K. Hornik and B. Grun. topicmodels: An r package for fitting topic models.
Journal of Statistical Software, 40(13):1–30, 2011.
C.-C. Huang and Z. Lu. Community challenges in biomedical text mining over 10
years: success, failure and the future. Briefings in Bioinformatics, 17(1):
132–144, 2016. doi: 10.1093/bib/bbv024. URL
http://bib.oxfordjournals.org/content/17/1/132.abstract.
J. H. Lau, K. Grieser, D. Newman, and T. Baldwin. Automatic labelling of topic
models. In Proceedings of the 49th Annual Meeting of the Association for
Computational Linguistics: Human Language Technologies - Volume 1, HLT ’11,
pages 1536–1545, Stroudsburg, PA, USA, 2011. Association for Computational
Linguistics. ISBN 978-1-932432-87-9. URL
http://dl.acm.org/citation.cfm?id=2002472.2002658.
E. Leopold. Models of Semantic Spaces, pages 117–137. Springer Berlin
Heidelberg, Berlin, Heidelberg, 2007. ISBN 978-3-540-37522-7. doi:
10.1007/978-3-540-37522-7 6. URL
http://dx.doi.org/10.1007/978-3-540-37522-7_6.
A. K. McCallum. MALLET: A Machine Learning for Language Toolkit.
http://mallet.cs.umass.edu, 2002.
D. Mimno, H. M. Wallach, E. Talley, M. Leenders, and A. McCallum. Optimizing
semantic coherence in topic models. In Proceedings of the Conference on
Empirical Methods in Natural Language Processing, EMNLP ’11, pages 262–272,
Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. ISBN
978-1-937284-11-4. URL
http://dl.acm.org/citation.cfm?id=2145432.2145462.
D. Ramage and E. Rosen. Stanford TMT, 2009. URL
http://nlp.stanford.edu/software/tmt/tmt-0.4/.
68
References
R. Rehurek and P. Sojka. Software Framework for Topic Modelling with Large
Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for
NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA.
http://is.muni.cz/publication/884893/en.
M. Song, S. Kim, G. Zhang, Y. Ding, and T. Chambers. Productivity and
influence in bioinformatics: A bibliometric analysis using PubMed Central.
Journal of the Association for Information Science and Technology, 65(2):
352–371, 2014. ISSN 2330-1643. doi: 10.1002/asi.22970. URL
http://dx.doi.org/10.1002/asi.22970.
I. Titov and R. T. McDonald. A Joint Model of Text and Aspect Ratings for
Sentiment Summarization. In ACL, volume 8, pages 308–316. Citeseer, 2008.
A. J. van Altena, P. D. Moerland, A. H. Zwinderman, and S. D. Olabarriaga.
Understanding big data themes from scientific biomedical literature through
topic modeling. Journal of Big Data, 3(1):23, 2016. ISSN 2196-1115. doi:
10.1186/s40537-016-0057-0. URL
http://dx.doi.org/10.1186/s40537-016-0057-0.
H. M. Wallach, I. Murray, R. Salakhutdinov, and D. Mimno. Evaluation methods
for topic models. In Proceedings of the 26th annual international conference on
machine learning, pages 1105–1112. ACM, 2009.
H. Wang, Y. Ding, J. Tang, X. Dong, B. He, J. Qiu, and D. J. Wild. Finding
complex biological relationships in recent PubMed articles using Bio-LDA.
PLOS ONE, 6(3):1–14, 03 2011. doi: 10.1371/journal.pone.0017243. URL
https://doi.org/10.1371/journal.pone.0017243.
X. Wang and A. McCallum. Topics over Time: A non-Markov Continuous-time
Model of Topical Trends. In Proceedings of the 12th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, KDD ’06,
pages 424–433, New York, NY, USA, 2006. ACM. ISBN 1-59593-339-5. doi:
10.1145/1150402.1150450. URL
http://doi.acm.org/10.1145/1150402.1150450.
X. Wei and W. B. Croft. LDA-based document models for ad-hoc retrieval. In
Proceedings of the 29th annual international ACM SIGIR conference on
Research and development in information retrieval, pages 178–185. ACM, 2006.
Y. Yang, S. Pan, J. Lu, M. Topkara, and Y. Song. The Stability and Usability of
Statistical Topic Models. ACM Trans. Interact. Intell. Syst., 6(2):14:1–14:23,
July 2016. ISSN 2160-6455. doi: 10.1145/2954002. URL
http://doi.acm.org/10.1145/2954002.
69
References
B. Zheng, D. C. McLean, and X. Lu. Identifying biological concepts from a
protein-related corpus with a probabilistic topic model. BMC Bioinformatics, 7
(1):58, 2006. ISSN 1471-2105. doi: 10.1186/1471-2105-7-58. URL
http://dx.doi.org/10.1186/1471-2105-7-58.
P. Zhu, J. Shen, D. Sun, and K. Xu. Mining meaningful topics from massive
biomedical literature. In 2014 IEEE International Conference on Bioinformatics
and Biomedicine (BIBM), pages 438–443, Nov 2014. doi:
10.1109/BIBM.2014.6999197.
70
A Tables
Topic number Topic words
1 strain growth medium culture production bacteria coli yeast mutant plate2 sequence genome specie family position domain alignment amino tree database3 genotype marker locus frequency polymorphism variant allele trait variation haplotype4 mutation mutant embryo phenotype deletion stage loss mouse defect domain5 parameter correlation probability error prediction estimate feature performance equation simulation6 specie specimen margin length view dorsal head genus surface seta7 specie water soil temperature community season abundance habitat climate diversity8 frequency phase input stimulation stimulus neuron amplitude noise power electrode9 transcript file probe mirnas microarray array mrna transcription pathway mirna10 surface temperature particle energy solution layer water film property nanoparticles11 antibody vector construct sequence plasmid transfection domain buffer blot lane12 activation inhibitor phosphorylation pathway kinase apoptosis inhibition receptor death growth13 woman heart pregnancy pressure birth hypertension infant delivery mother week14 food insulin cholesterol obesity intake consumption diabetes weight alcohol mass15 bone fracture cartilage spine knee teeth joint surface root pain16 mouse antibody medium hour culture section animal week serum plate17 migration adhesion microtubule actin formation image localization junction motility filament18 brain mouse neuron cortex animal cord motor hippocampus nerve matter19 surgery injury complication pain technique nerve catheter pressure operation vein20 promoter methylation transcription histone chromatin chip element regulation modification motif21 membrane fluorescence vesicle fusion microscopy mitochondrion transport image localization surface22 infection virus vaccine antibody vaccination antigen replication titer influenza transmission23 trial month therapy week baseline event outcome intervention medication cohort24 network cluster module node database edge user tool feature file25 domain molecule chain bond ligand conformation energy crystal position atom26 child participant student school family parent people question experience behavior27 care intervention service practice program management community provider staff hospital28 lung liver platelet respiratory mortality fibrosis admission sepsis count fluid29 score disorder item scale symptom depression pain questionnaire anxiety correlation30 adult host male female larva stage mosquito fitness generation selection31 chromosome cycle replication damage repair phase focus testis irradiation follicle32 drug administration agent injection mg/kg combination toxicity dose efficacy therapy33 lesion diagnosis biopsy examination tumor resection recurrence month mass feature34 cancer tumor breast survival metastasis growth carcinoma lung tumour chemotherapy35 vessel segment artery wall plaque toxin aggregation formation flow lesion36 serum kidney plasma vitamin donor biomarkers correlation creatinine marker laboratory37 plant leaf seed root rice arabidopsis growth fruit wheat stage38 infection resistance bacteria isolates strain culture pathogen phage tuberculosis coli39 review search quality paper publication criterion strategy literature science heterogeneity40 task trial participant memory stimulus performance word session face block41 image volume measurement sensor intensity position field detection device resolution42 primer reaction sequence product amplification cycle clone polymerase extraction detection43 mouse macrophage cytokine production antibody inflammation activation receptor immune antigen44 differentiation stem culture collagen proliferation marker growth progenitor medium fibroblast45 stress metabolism acid enzyme iron production synthesis pathway oxygen reaction46 receptor channel release calcium activation neuron stimulation solution action voltage47 muscle exercise movement force motor hand training strength limb fiber48 compound solution acid reaction water mass mixture extract spectrum fraction49 country prevalence cost survey mortality status death hospital proportion household50 exposure skin smoking smoker tobacco cigarette worker hair product meta
Table A.1: All 50 topics from final model
71
Curriculum Vitae
Personal Details
Parijat Ghoshal
Lerchenberg 31
8046 Zurich
Education
2015 Bachelor of Arts in English Languages and Literatures
at University of Bern
since 2015 M.A. studies in Multilingual Text Analysis at the
University of Zurich
Professional Activities
July 2016 – September 2016 Machine Learning Intern at Neue Zurcher Zeitung
since October 2016 Data Scientist at Neue Zurcher Zeitung
72
Selbststandigkeitserklarung
73