topic mining on disaster data (robert monné)

A practical application

of Topic Mining on

disaster data

Robert Monné

Master of Business Informatics

Capita Selecta (7.5 ECTS)

January 2016

Intro As described in our previous research (Monné, van den Homberg & Spruit, 2016), in a disaster situation

there are many information challenges, one of which is the vast amount of unstructured information. We

want to solve a little piece of the puzzle by creating a method to quickly analyze unstructured documents,

this analysis can be used to extract the most relevant information for a specific audience.

We identified 84 information needs in earlier research, which depict the required information for disaster

responders, in a disaster situation around floods in Bangladesh. We use these in the current research to

extract relevant information from the unstructured data. A disaster puts time pressure on the decisions

required for an effective response, and therefore pressure on the timely retrieval of the information required

for these decisions. In our (and similar) contexts, NGOs and governments produce PDF reports to describe

the disaster situation, or the situation before the disaster. These documents entail hundreds of pages and are

therefore not easily handled in a disaster situation, however, they contain a high amount of useful

information for the disaster responder. We want to use algorithms to quickly extract the information

required for the disaster responder.

Scope

There are multiple ways to analyze textual data, we can read the text manually, or we could try to analyze

it automatically, one of the research subjects focusing on this is Text Analytics. We briefly introduce the

field to scope our research.

There are multiple ways in which text analytics algorithms can be applied to analyze data. The first one is

clustering, which is the process of calculating the distances between documents and then try to cluster them

in coherent groups. Secondly we have classification, which is the process to automatically classify the

documents to separate classes based on their characteristics. We also have sentiment analysis, which tries

to find the opinions and sentiments of the writer of the text. This could entail the writer’s attitude towards

specific functionality of a laptop for example. This overview is merely for positioning our paper, and does

not aim to be an exhaustive list.

We also have topic mining, which is the process of extracting words from the text which have a high

probability of representing the core of the document. We choose this approach to experiment with in our

prototype.

Literature

We used the text analytics course provided by the Illinois University, available on Coursera at:

https://www.coursera.org/course/textanalytics. This course gave us a clear understanding of the field and

pointed us in the direction of Topic Mining. We used the TM and Quanteda package and the xpdf software

for the preprocessing of the data, for which references can be found in the reference section. We used the

Topicmodel package for extracting the topics, which can also be found in the reference section.

Research Goal

Our goal is to create a replicable practice-oriented text-mining method that can be applied across cases.

Users of our method would know beforehand which practical challenges and considerations occur when

applying a topic model. Next to this we want to apply topic-mining-algorithms in a prototype, to validate

the applicability of our method, and the related techniques for the specific situation. This prototype can be

further developed and applied in similar (disaster) situations. In a new situation, only the input data needs

to be preprocessed in the same way as we did, then the script can predict topics related to the text segments.

https://www.coursera.org/course/textanalytics

Research Method We conducted experimental and explorative research, where we used the CRISP-DM process to help us

determine the steps we need to take in a data oriented experiment. We specifically do not aim to create new

algorithms, we merely want to apply the ones readily available in a new context. Based on the keywords

from the techniques we found in the Coursera course, we used plain internet search to find packages that

matched the functionality we required. From this experiment we deduced the lessons learned and created a

replicable method for text mining projects.

To implement the topic model in our experiment we used R Studio, which is a powerful user interface for

R. R is a functional language used for statistical computing and graphics.

Results Our method can be found in figure 1, and is described and validated

with an example in the following sections.

Document retrieval

The first step is to identify and retrieve the documents of interest. There

are two documents already identified in earlier research (Monné et al,

2016), these documents cover a large part of the information needs of

disaster responders. In our case the documents are downloaded and

saved to a local hard disk, but for larger cases we envision much more

documents, which could be stored in a document-store database for

example.

Pre-processing

First the documents need to be converted to a format that is easily

handled, the format (PDF) in which we downloaded the documents is

not usable. We decided to convert the documents to TXT format, which

is the most basic format of text representation. For this step we used the

TM package available for R, this package first requires to create a

corpus from the PDF documents. The PDFs are converted to the corpus

by using XPDF software, which is available online for free. This

software is only required to be unzipped on your machine, and then you

need to adjust your windows system path to the respective folder, so R

Studio can find the software. After the corpus is created we write it

completely to a TXT file, because in the next step we use a different

package that cannot handle a TM oriented corpus.

library(tm)

setwd("C:/R/")

fp <- file.path(".","docs")

corp <- Corpus(DirSource(fp), readerControl=list(reader=readPDF))

writeCorpus(corp, "C:/R/preprocessed")

Pre-processing

Fitting the model

Split documents

Document retrieval

Process results

Validate results

Process segments

Figure 1 Step wise text mining process

Split documents

There are multiple ways we can look at the data, ranging from a very high level perspective, to a very low

and granular perspective. On the highest level we can identify 2 units of interest in our case, namely the

District Disaster Management Plan and the JNA (Joint Needs Assessment). We could be interested to

identify the topic of the whole documents based on the contents, however, the results of this analysis would

yield no useful insight, since we already know the topics of these documents. We could also divide the

documents based solely on pages, so an 80 page document would yield 80 units which need to be tagged.

However, we deem this division not usable from the practical perspective. Since the similar information

could be split by the page break, and there would be a higher probability of incorrectly assigning a topic to

the text. The smallest unit could be a single word, which is not feasible since the information needs are far

more complex than single words. Tagging a sentence would yield a sufficient sample size since the amount

of individual sentences in the document is fairly large (couple of thousands). We choose to use the

information encapsulated in the table of contents to divide the document in portions that can be tagged.

This way we are sure all the information in the units is related, and it leaves an interesting number of

analyzable elements (116 to be precise).

So we wanted to split the corpus based on the table of contents. We extracted and manually cleaned the

table of contents to create a usable format for the splitting (removed page numbers and separated the

headings with a comma). Unfortunately, we discovered that the table of content headings do not exactly

match the headings in the text. Therefore we chose to manually copy the relevant split points (e.g. the

headers inside the text) and store them in a CSV file, this file is used to actually do the segmentation of the

texts. We also need to manually add some text because the string “Annexure 2” occurs multiple times in

the document (for example also in: “Annexure 23” etc.), resulting in wrong splits. Therefore we modified

“Annexure 2” to “Annexure 2-1” in the pre-processed TXT file, which leads to a unique string, and

therefore can be conveniently used for splitting.

The algorithm works as follows: first we read in the text from the earlier pre-processing step and we convert

it to a Quanteda corpus. Then we use the CSV file described above to segment the corpus with a for-loop

(and print some status indicators). We clean up our workspace with the rm-function. At last we use the CSV

split file to label the documents and also clean up the workspace again. We also found out our documents

are encoded in a different format than the standard, which resulted in wrong results.

library(quanteda) #segmenting to paragraphs

#import DDMP in quanteda format for splitting in blocks

JNAtxt <- textfile("C:/R/preprocessed/JNA.pdf.txt", encodingFrom = "ASCII")

qcorpJNA <- corpus(JNAtxt)

#splitting JNA

splitsJNA <- read.csv("C:/R/splitpoints/JNAsplits2.txt", header = T)

JNASplitted <- qcorpJNA

for (i in 1:length(splitsJNA[,1])) {

JNASplitted <- segment(JNASplitted, "other", delimiter = toString(splitsJNA[i,1]))

print("i")

print(i)

print("length after")

print(length(JNASplitted$documents[,1]))

}

rm(i)

rm(JNAtxt)

rm(qcorpJNA)

#creating names for documents

JNAnames <- c("JNA intro")

for (i in 1:length(splitsJNA[,1])) {

JNAnames <- c(JNAnames, paste("JNA", toString(splitsJNA[i,1])))

}

docnames(JNASplitted) <- JNAnames

rm(i)

rm(splitsJNA)

rm(JNAnames)

We used the same algorithm for segmenting the District Disaster Management Plan, which only differs on

the input file and the split points.

Process segments

Too fit a model that can predict the topic of the segments we need to process the documents further and

finally create a document feature matrix from it. These steps can be performed by the quanteda package.

We start with: combining the two corpora. Then we create a document frequency matrix with some

processing settings. Stemming, which brings the words in the document back to their root form, so all words

can be interpreted on the same level, instead of differences like Walk vs Walking. However, we did not use

this option, because it gave strange results, like: “disast” instead of “disaster”. We also remove punctuation

because this is irrelevant for the analysis. We also remove stop-words, because these are not deemed

plausible to be a segment topic. Stop words in English include: “I, Me and Yourself”. We also remove

words occurring frequently in the corpus and which are irrelevant as a topic, like for example: Sirajganj

(which is the area where the documents are written about) or Upazila (which means something like State in

the US). At last we remove numbers from the corpus, because these could lead the algorithm to be fitted on

unique numbers which are really not a topic. Numbers and punctuation removal are standard in the dfm-

function.

We convert the quanteda document frequency matrix back to a tm dfm, because this one can be handled by

the topicmodels package, which we use to fit a topic model.

Then we calculate a TF-IDF data table, this is used to remove terms that are “too frequent”, for example

terms that occur in nearly every document, these are not suitable for fitting. In our case we draw the line at

0.012 for the TF-IDF, which is just over the median over all terms and documents. The meaning of the TF-

IDF value is: It increases when the word is frequent in a document, but is decreased when the word is also

frequent in the corpus. This is a statistic of the importance of a term in a corpus. Then we remove documents

that have no frequent terms.

Before the TF-IDF removal we had a matrix with dimensions: 109 documents and 6412 features (words),

afterwards we have 105 documents and 3204 features.

DDMPJNA <- DDMPSplitted + JNASplitted

Bothdfm <- dfm(DDMPJNA,

ignoredFeatures =

c(stopwords("english"), "sirajganj", "sirajgonj",

"district", "upazila", "flood",

"unions", "assessment", "jna", "md"),

stem = F)

Bothdfm <- convert(Bothdfm, "tm")

#calculating tf-idf

tfidf <- tapply(Bothdfm$v/row_sums(

Bothdfm)[Bothdfm$i],Bothdfm$j, mean) *

log2(nDocs(Bothdfm)/col_sums(Bothdfm>0))

summary(col_sums(Bothdfm))

summary(tfidf)

#removing too frequent terms (and docs with 0 terms)

dim(Bothdfm)

Bothdfm2 <- Bothdfm[,tfidf>=0.012]

rm(Bothdfm)

dim(Bothdfm2)

Bothdfm2 <- Bothdfm2[row_sums(Bothdfm2)>0,]

dim(Bothdfm2)

Fitting the model

In this step we choose a Latent Dirichlet Allocation model to be fitted. The main reason we choose for this

model is the fact that it supports multiple topics (in fact the result is a probability distribution of topics), as

opposed to 1 topic resulting from a unigram model. There are many settings we can tweak and modify,

however, we don’t go into too much depth, because we want the algorithm to be easily applicable by non-

technical users. We are free to choose the amount of topics we want, in the example code below there are

40 topics. We set the seed to get replicable results.

k = 40

SEED = 2015

VEM = LDA(Bothdfm2, k = k, control = list(seed = SEED))

Process results

The topicmodels package delivers a fitted model with the results, so there is a vector where every

document/segment is related to a topic, and a data frame where every topic is related to most likely terms.

A disadvantage is that the package does not support a combination of the two objects. Which makes the

results not easily and quickly understandable. For this reason we wrote a script which combines the two

data frames, so this can be easily analyzed.

The 5 most likely terms are extracted with the terms-function, whilst the related topic for every document

is extracted with the topics-function. Now we are able to process the results further.

First we transpose the two objects to be column-oriented, instead of row-oriented. We need to use the

transpose function twice for the Topics vector, because with the first run it only transforms to a data frame

and does not switch rows with columns. Then we append a column to the Terms object with the topic

numbers (which are depicted in the row-names in the result set, however these are omitted when using the

merge function, so we choose to set them as a separate column). Then we append the row-names of the

Topics object (which are the related documents). Now we have two tables with were one is a 2 X 104 table

(104 = documents, 2 = Topic number and Document title), and one is a 6 X 15 table (6 = topic number and

5 related terms, 15 = amount of Topics). We use the colnames function to set the column names for the two

tables. At last we use the merge function from the base R kit to combine the two tables.

Terms <- terms(VEM, 5)

Topics <- topics(VEM, 1)

rm(k, SEED, tfidf)

Terms <- t(Terms)

Topics <- t(Topics)

Topics <- t(Topics)

Terms <- cbind(Terms, c(1:length(Terms[,1])))

Topics <- cbind(Topics, rownames(Topics))

colnames(Topics) <- c("Topic nr", "Heading")

colnames(Terms) <- c("Topic1",

"Topic2", "Topic3", "Topic4",

"Topic5", "Topic nr" )

TopicTerms <- merge(Topics, Terms, by = c("Topic nr"), all.x = T, all.y= F, sort = F)

rm(Terms, Topics)

Validate Results

The result table can be found in the appendix.

We determined by trial and error that 40 topics are the most suitable for this case, since 10, 15 and 20 gave

far too less distinguished topics as a result. This lead to text segments with the same topic, while they were

totally unrelated.

Semantics is an issue, it cannot link the topics to real world information. For example:

DDMP 1.3.2 Area: char

From this example we know the word “char” is very relevant for the area segment, because the specific

name for the ground situation is char (but this is a Bangladeshi term). An untrained respondent would not

know this, and therefore we would have liked to see a topic like: “Area” for this specific chapter. But this

term is not frequent enough in this segment, and therefore it will never be the topic using the LDA method.

We manually analysed every segment of text to see whether the topics from the algorithms matched our

understanding of the text. Every topic-segment combination we found useful we marked with a Yes and all

other with a No. The results are mixed. We see a very clear distinction between the results of the JNA and

the DDMP. DDMP has only 19/75 useful topics assigned, whilst the JNA has 24/30 topics usefully

assigned. This also means that the results for the total set are a bit unsatisfactory, only 43/105 topics are

usefully assigned.

Count of Useful Column Labels

Row Labels DDMP JNA Grand Total

No 56 6 62

Yes 19 24 43

Grand Total 75 30 105

Because we want to understand the results even more we gave every “not useful” marking a reason. These

reasons, and the occurrence counts can be found in the table below. We see a very clear leader in the reasons

for incorrect topic assignment, which is “Topic is not mentioned in the text”. This basically means that we

believe that the text is really about something else then the most frequent term makes us believe. The LDA

algorithm is solely based on the words mentioned in the actual text, and therefor incapable to suggest the

terms we see more probable as a topic.

The second most occurring issue is “Table as content” which occurs also in combination with “Numbers as

content” and “Picture as content”. These are segments that are not recognized correctly by the algorithm.

In a table, the header row has the highest probability of being related to the topic, however, this additional

information is not used by the algorithm, it weights all words equally. The numbers in the text are removed

in the pre-processing part of the analysis, because these can never be a topic, this leads however to some

segments being low in content, which leads to a wrong topic. The algorithms does not recognize images,

and therefore cannot correctly assign topics to text segments with a high amount of images.

The third most occurring issue is “Topics are not related to information need”, where the contents of the

text is not related to an information need expressed by our previous research. These are for example text

segments like: “Shortcoming of assessment”.

At last, in the DDMP there are a lot of region names mentioned, this leads to a high frequency of these

words, and therefor they seem probable to be the topic of a segment, this is however not the actual subject.

We could counter this to remove all region names in the “processing segments” step.

Count of Reason Column Labels

Row Labels DDMP JNA Grand Total

Topic is not mentioned in text (related information need) 14 14

Table as content 9 9

Topics are not related to information need 9 9

Table as content and Region names 6 6

Picture as content 4 1 5

Region names frequently occurs 5 5

Cannot reproduce 2 2 4

Not related to information need 3 3

Table and Numbers 3 3

Table with much numbers 1 1

Numbers as content 1 1

Schools are shelters 1 1

(blank)

Grand Total 55 6 61

Next steps For further extraction of relevant details we propose to use an intelligent search engine which uses the topic

models we created. We now have broad categories in which the documents can be divided. However, we

cannot get exact statistics, like the amount of affected based on this analysis. These statistics can be highly

usable for the disaster responders.

We could also have used a categorization package like RTextTools which can be trained to predict in which

category a document belongs. However, we did not have enough data to both create a train and test set for

this specific case. Nonetheless, for future research we see the possibility to create a train set based on

Wikipedia articles. This way we can create custom categories like: baseline information, situation overview,

needs of affected etc. and train an algorithm to recognize and categorize these texts.

I perceive the field of text mining as a very interesting field, and will continue to explore it in my

professional career.

Conclusion and Research suggestions For every interesting step in the process we draw conclusions and provide suggestions to improve

Splitting

We suggest to incorporate a string similarity function to the split-points in the “splitting” step. In our case

we got incorrect results because the strings from the table of contents did not match the strings in the actual

text. We could counter this by applying a string similarity algorithm, and selecting the most similar sentence

in the text, and use this as a split-point.

Process segments

The results provided by the topicmodels packages are not really intuitive to use. We needed to process the

two result datasets to analyse the dataset more effectively.

Validating

Unfortunately the topics we identified were not nearly a 100% match, this is mostly due to 4 reasons shared

in the validation of our results. These 4 are:

1. Topic incorrect because: not mentioned in text

This is basically a disagreement between the authors of this article and the algorithm on

the assumptions of the algorithm. The LDA algorithm assumes that the topic the text is

mentioned in the text. This is however not always the case.

Suggestion: we could develop an algorithm that matches the topics of the text to the related

information needs we are trying to find. For example by incorporating information from a

dictionary or a thesaurus.

2. Topic incorrect due to: Table as content

Tables have very valuable information in the headers, this is not recognized by this LDA

algorithm, and therefore leads to incorrect assignment of the topic. The LDA algorithm

values every word equal irrespective of place.

Suggestion: develop an algorithm that takes the position of the words in the table into

account. We know from earlier encounters that algorithms exist which assign a higher

value on the words when they are in a certain location of the sentence. We do not know of

any research which incorporates the position of a word in a table.

3. Topic useless because: not related to information need

This basically means that the topic is correctly assigned, they are however not applicable

for our specific case, because they are not related to an “information need” we are interested

in.

Suggestion: use the topics to filter the data not required for the disaster responders.

4. Topic incorrect due to wrong stop word removal

We removed some clearly wrong topics from the text (disaster, Sirajganj etc.), we did

however not remove all sub-area names. This lead to, the name of “small town in the

region” to be the assigned topic.

Suggestion: use an iterative approach to remove the useless stop words specific to the text.

We partly applied this iterative approach, since later in the process we identified the stop

words like (disaster and Sirajganj etc.) for our case. We then applied them again in the

“process segments” step.

References Bettina Gruen, Kurt Hornik (2011). topicmodels: An R Package for Fitting Topic Models. Journal of

Statistical Software, 40(13), 1-30. URL http://www.jstatsoft.org/v40/i13/.

Ingo Feinerer, Kurt Hornik, and David Meyer (2008). Text Mining Infrastructure in R. Journal of Statistical

Software 25(5): 1-54. URL: http://www.jstatsoft.org/v25/i05/.

Kenneth Benoit and Paul Nulty (2015). quanteda: Quantitative Analysis of Textual Data. R package version

0.9.0-1. https://CRAN.R-project.org/package=quanteda

R Core Team (2015). R: A language and environment for statistical computing. R Foundation for Statistical

Computing, Vienna, Austria. URL https://www.R-project.org/.

topic mining on disaster data (robert monné)

Documents