mining twitter data pertaining to psychosis for self...
TRANSCRIPT
Mining twitter data pertaining to
psychosis for self-reported sleep
disturbance investigation
Author: Mladen Dinev
Supervisor: Goran Nenadic
The University of Manchester
School of Computer Science
May 2016
Abstract
Twitter provides vast amount of wide-range data which gives a rich opportunity to mental
health clinicians and researchers to perform analysis with automated text mining approaches.
The ability to study behavioural trends at demographical and social level makes social media a
valuable cross-sectional source of information.
In this report, we discuss the relationship between psychotic(-like) experiences and sleep
related problems in the general population using the social network platform Twitter. The
research focused on extracting self-reported sleep disturbance from users who have been
recently diagnosed with a mental illness. This study identified different semantic features from
the gathered datasets and revealed their impact on the problem domain.
The application used machine learning classifiers in order to predict tweets with expressed self-
reported sleep disturbance, achieving 89% accuracy with a Support Vector Machine predictive
model. Various distinctive attributes were extracted from the corpus, however, the current
results indicate that a combination of semantic classes and individual word scores is the most
informative feature.
Keywords: text mining, machine learning, mental health, text normalisation
Acknowledgements
Firstly, I would like to express my thankfulness to my supervisor Goran Nenadic for his
continuous guidance, support and numerous advices throughout the whole period of
development.
Furthermore, I would like to thank Dr. Rohan Morris and Natalie Berry, psychology researchers
at University of Manchester, for sharing their domain knowledge and helping me to achieve the
aim of the project.
Finally, I would like to express my gratitude to Maksim Belousov, Computer Science PhD
student, for his assistance, especially when I was struggling, and for showing me the right
direction towards the successful completion of the application.
This research could not have been done without their contribution.
Abbreviations
AES – Advanced Encryption Standard
API – Application Programming Interface
HMM – Hidden Markov Model
NB – Naïve Bayes
NER – Name Entity Recognizer
PLE – Psychotic like Experience
POD – Part of the Day
POS – Part of Speech Tags
SVM – Support Vector Machine
TF-IDF – Term Frequency - Inverse Document Frequency
1
Table of Contents
Table of Contents ....................................................................................................................... 1
List of Tables ............................................................................................................................... 4
List of Figures.............................................................................................................................. 5
Chapter 1 Context .................................................................................................................. 6
Introduction ................................................................................................................................ 6
1.1 Motivation .................................................................................................................... 6
1.2 Aim and objectives ....................................................................................................... 6
1.3 Ethical Approval ........................................................................................................... 7
1.4 Challenges .................................................................................................................... 8
1.5 Report Structure ........................................................................................................... 8
Chapter 2 Background ........................................................................................................... 9
2.1 Text Mining .................................................................................................................. 9
2.2 Machine Learning ....................................................................................................... 11
2.2.1 Supervised Learning ............................................................................................ 11
2.3 Text Mining Social Media ........................................................................................... 14
2.4 Summary .................................................................................................................... 15
Chapter 3 Design and Development .................................................................................... 16
3.1 Design ......................................................................................................................... 16
3.1.1 Requirements ...................................................................................................... 16
3.1.2 System design...................................................................................................... 17
3.1.3 Development environment ................................................................................. 18
3.1.4 Graphical User Interface ..................................................................................... 19
3.2 Development .............................................................................................................. 20
3.2.1 Collection of potential diagnostic tweets ........................................................... 20
3.3 Text normalisation ..................................................................................................... 21
3.3.1 Tokenisation ........................................................................................................ 21
2
3.3.2 Twitter object removal ........................................................................................ 22
3.3.3 Stemming ............................................................................................................ 22
3.3.4 Abbreviations ...................................................................................................... 22
3.3.5 Part of Speech Tagging ........................................................................................ 23
3.3.6 Term Frequency - Inverse Document Frequency ............................................... 23
3.3.7 Negation detection ............................................................................................. 24
3.4 Pre-filtering of diagnostic tweets ............................................................................... 24
3.5 Manual Annotation .................................................................................................... 25
3.6 Classification of self-reported sleep disturbance ...................................................... 26
3.6.1 Feature Construction .......................................................................................... 26
3.6.2 Feature Selection ................................................................................................ 28
3.7 Summary .................................................................................................................... 28
Chapter 4 Testing and Evaluation ........................................................................................ 29
4.1 Software Testing ........................................................................................................ 29
4.1.1 Unit testing .......................................................................................................... 29
4.1.2 Usability testing ................................................................................................... 29
4.2 Classifiers’ performance evaluation ........................................................................... 30
4.2.1 Cross-fold validation ........................................................................................... 30
4.2.2 Confusion Matrix ................................................................................................. 30
4.2.3 Precision, Recall, F-score. .................................................................................... 31
4.3 Summary .................................................................................................................... 31
Chapter 5 Analysis and results ............................................................................................. 32
5.1 Data statistics ............................................................................................................. 32
5.2 Accuracy, precision, recall of classifiers ..................................................................... 32
5.3 Error Analysis ............................................................................................................. 34
5.4 Experimental findings ................................................................................................ 34
5.4.1 Sentiment distribution amongst the timeline tweets ........................................ 35
5.4.2 Semantic Class Influence ..................................................................................... 35
5.4.3 Posting time ........................................................................................................ 36
5.5 Summary .................................................................................................................... 36
3
Chapter 6 Conclusion ........................................................................................................... 37
6.1 Reflection and Achievements .................................................................................... 37
6.2 Future work ................................................................................................................ 37
References ................................................................................................................................ 39
Appendix A ............................................................................................................................... 42
Appendix B ............................................................................................................................... 44
Appendix C ............................................................................................................................... 46
Appendix D ............................................................................................................................... 47
Terms ........................................................................................................................................ 48
4
List of Tables
Table 1.1 Objectives ................................................................................................................... 7
Table 3.1: Requirements ........................................................................................................... 17
Table 3.2: Constructed Features............................................................................................... 27
Table 3.3: Semantic Classes...................................................................................................... 27
Table 4.1: Test Cases ................................................................................................................ 29
Table 5.1: Data collection statistics.......................................................................................... 32
Table 5.2: Classifiers performance evaluation ......................................................................... 33
Table 5.3: Correctly classified sleep related instances ............................................................. 34
Table 5.4: Misclassified examples ............................................................................................ 34
Table 5.5: Semantic class frequency ........................................................................................ 36
5
List of Figures
Figure 1.1: Encryption pipeline ................................................................................................... 7
Figure 2.1: Text mining pipeline adapted from [20] ................................................................ 10
Figure 2.2: Supervised learning flow diagram ......................................................................... 12
Figure 2.3: Illustration of SVM components in linearly separable case ................................... 13
Figure 3.1: System Architecture ............................................................................................... 18
Figure 3.2: Predicted labels of timeline tweets (GUI) ............................................................... 20
Figure 3.3: Dependency graph of a tweet containing a negated word ................................... 24
Figure 3.4 Annotation Window ................................................................................................ 26
Figure 4.1: K fold cross-validation ............................................................................................ 30
Figure 4.2: Confusion Matrix .................................................................................................... 31
Figure 5.1: Sentiment Distribution ........................................................................................... 35
Figure 5.2: Time of posting trend ............................................................................................. 36
6
Chapter 1 Context
Introduction
This chapter includes introduction to the problem domain, aim and objectives set for this
project and on overall structure of the report. In addition, it underlines the most common
challenges in the area.
1.1 Motivation
Recently, the amount of information on the Web has been constantly growing, giving
researchers the opportunity to analyse big amount of mental health data. Internet live
statistics show that every second on average around 6,000 tweets are generated which
makes 500,000,000 heterogeneous messages per day [2]. Tweets are particularly valuable
source of knowledge, because they contain diversified information at international level,
happening at the speed of thought. In addition, Twitter is particularly useful because it has
well-documented APIs and most importantly, provides its data freely. However, manually
establishing new patterns and trends using typical data analysis methods is an impossible
task for a human being, therefore automatic text mining tools are often employed.
Social media has no geographical boundaries allowing large quantities of data to be analysed
automatically. Text mining systems provide researchers the ability to mine efficiently human
behavioural trends from wide range of angles. Previous studies analysing mental health
related Twitter data concluded that people who experience psychological disorder
symptoms have high rates of social media usage [3]. Another reserach utilized that
quantifiable mental health signals related to bipolar, major depressive and post-traumatic
disorder signals were encoded in Twitter messages[4].
1.2 Aim and objectives
The first task of this project was to build an algorithm which was going to detect Twitter
users, diagnosed with psychosis. The next task, and main aim of this project, was to
implement a system which was going to identify self-reported sleep disturbance in timeline
tweets, published by pre-filtered diagnosed users.
This study also gathered different evidences in order to identify behavioural patterns
occurring before and after the diagnosis, taking into account the frequency of posted
7
tweets, the expressed emotions and the time of posting.
To achieve the set out aim, the project was split into smaller objectives shown in Table 1.
Objectives Collect tweets automatically from Twitter using predefined search queries.
Filter out invalid tweets and spam.
Manually annotate a subset of diagnostic and sleep-related tweets through an annotation tool.
Find different informative features with respect to the problem domain.
Design and develop classifiers to automatically identify sleep related issues.
Analyse data analyse potential trends.
Plot and visualise the results on graphical user interface.
Table 1.1 Objectives
1.3 Ethical Approval
In order to adhere to the formal ethical regulations in social media, an ethical application
form was submitted to the Computer Science Department on 10th of November and
approved on 23rd (Approval Number: CS 218). Sensitive part in this research was the
identification and prevention of ethical issues, therefore the project was developed in
accordance with the guidelines formed by the Association of Internet Researchers and The
British Psychology Society [5] [6].
Moreover, to keep the identity of our users unrevealed, methods of anonymization have
been applied. Firstly, all sensitive Twitter data fields were encrypted with AES-128 ciphers
and stored on external password protected university data storage [7]. Secondly, individual
profiling has been avoided throughout the whole development and the analysis was
performed only on aggregated dataset. Thirdly, occurrences of any user mentions were
automatically removed before the data transfer, so anonymity has been preserved. Access
to the database was granted only to the people involved in the study and each contributor
had his own login credential, assuring and maintaining data protection.
Figure 1.1: Encryption pipeline
8
1.4 Challenges
Despite the fact that there are many sophisticated text mining tools, analysing social media
still presents challenges due to the distinct and persistent characteristics of its data.
Firstly, Twitter restricts the length of the messages to 140 characters, which may not
provide sufficient context information for effective text processing methods. Another
limitation is the lack of external knowledge, which is often used to bridge the gaps in text
representation models [8].
Secondly, social media data contains non-standard language use such as abbreviations,
acronyms, slang and interjections. This is also a very challenging problem, which involves
Natural Language Processing methods eliminating the structural and lexical ambiguities and
identifying the semantic meaning of the text.
Nevertheless, one of the main challenges in data mining social media is obtaining relevant
examples, mainly due to the fact that networking platforms offer a small chunk of the whole
available set [9]. In addition, without knowing the population distribution in the gathered
data, it is very hard to estimate whether the selected information is truly representative of
the larger group.
1.5 Report Structure
The organisation of the report is structured as follows:
Chapter 2 Background - introduction to the problem domain and brief explanation of the related concepts, as well as presenting previous studies in the field.
Chapter 3 Design and Development – provides information about the functional requirements of the software and design rationale.
Chapter 4 Testing and Evaluation – discusses the testing methods used to validate and improve the results of the system.
Chapter 5 Analysis and results – presents observations and results after the performed analysis on the gathered dataset.
Chapter 6 Conclusion – summarizes the major inferences and findings drawn at the end of the project and presents potential future work.
9
Chapter 2 Background
Chapter two describes the fundamental concepts of text mining, machine learning and
natural language processing. Further details can be found in Appendix A and Appendix B.
2.1 Text Mining
Text mining, also known as knowledge discovery, refers the process of deriving unknown
information from unstructured text. The process starts by extracting facts and events from
textual resources and then enables forming new hypothesis based on data mining and data
analysis methods. Text mining includes linguistic, statistical and machine learning techniques
that model and structure the information content for business intelligence, research or
investigation. However, data mining and text mining do not refer to the same thing. Data
mining extracts implicit information from a given input, while in text mining the content is
explicitly stated in the text, but needs automatic techniques to deduce the interpretation of
the content [10].
Text mining was introduced in 1980’s and since then has been applied to a wide range of
fields such as security, marketing and biomedical science [11] [12] [13]. Text mining is
associated with the concepts of natural language processing, information extraction,
information retrieval and sentiment analysis, which are described briefly in the next few
paragraphs. The whole text mining pipeline is presented on Figure 2.1. As can be seen, text
mining focuses on four main tasks – retrieving relevant documents from a large volume of
data, annotating and converting the material into a same format, extracting lexical and
semantic information and finally discovering new knowledge.
Natural Language Processing – process which extracts meaning from a natural language
text. The method seeks to find the answers to the questions – Who, When, Where, How and
Why. To gain insight into the semantic structure of a text, natural language processing
typically involves part-of-speech tagging, lexical dependency parsing and word sense
disambiguation.
Information Retrieval – an extension of document retrieval, describing the process of
returning documents relevant to user preferences.
Information Extraction – important part of the text mining system is the information
extraction process which aims to find events, facts, entities and relationships from
unrestricted natural language text. Unlike information retrieval, information extraction is
not undertaken by people and it does not require deep domain-specific expertise. For
instance, name entity recognition (NER) detects different predetermined classes
10
(organisations, people, locations, time expressions) from unstructured text. The term name
entity does not have a specific meaning, so it has to be defined in the context in which it
appears. If we consider the following example:
“I used to work for Microsoft long time ago.”
“Microsoft” should be flagged as an organisation and the rest of the sentence as temporal
expression. There are different NER tools which provide deep semantic understanding of the
natural text using dictionary, machine learning or Hidden Markov Model (HMM) approaches
[14] [15] [16].
Sentiment Analysis – sentiment analysis, also known as opinion mining, relates to the
automated process of identifying an expressed opinion in a text in a binary or trinary format.
Automatic recognition of people’s opinions and sentiment about a wide range of topics has
been applied in many fields including business, psychology and computer science [17] [18]
[19]. However, sentiment analysis over Twitter data faces many challenges due to the highly
diversified nature of social media content. For that purpose, automatic sentiment analysis
tool, developed on informal language texts, has been used in this research [17].
Sentistrength coped with spelling mistakes, negated words and untypical phrases, which
boosted or decreased the polarity of subsequent words.
Figure 2.1: Text mining pipeline adapted from [20]
11
2.2 Machine Learning
Machine learning is the science that provides computers the ability to learn without being
explicitly programmed. Alpaydin et al. (2014) characterises data mining as subdivision of
machine learning, referring it to the process of extracting knowledge from a large volume of
data, typically a database [20].
Machine learning has been proved as an efficient approach in a variety of fields such as
fraud detection, spam filtering and pattern recognition. It has been especially useful when it
is applied to data mining problems, because its learning does not require deep
understanding of the problem domain. In order to acquire knowledge, the machine learning
algorithms automatically inspect the attributes and their implicit relationships from the
given dataset (training data). After extracting information from the training examples, the
learner’s task is to generate hypothesis about each target class. For that purpose, it applies
the theory of statistics to build mathematical models with adjustable parameters,
dependent on the problem in question. However, there are two types of models – predictive
and descriptive. The predictive models make predictions about future trends, while the
descriptive models gain knowledge from a given dataset. Once the model has been set, real-
life input (testing data) is supplied and classified respectively.
On the other hand, the machine learning tasks are typically classified into two types:
supervised learning and unsupervised learning. The next paragraph outlines the difference
between the two techniques and provides details about two classifiers, used in this project.
2.2.1 Supervised Learning
In supervised learning, each example of the training set is composed of a pair of an input
object, represented by feature vector, and a class label. Unlike supervised approaches,
unsupervised learning does not require the labels of the input instances, but instead clusters
the data. However, acquiring labelled dataset is not always a trivial task and in fact requires
good knowledge of the problem domain. Therefore, semi-supervised algorithms which are a
combination of supervised and unsupervised approaches, are often preferred. For this
particular study, supervised learning was chosen, because of the presence of domain
experts.
The aim of the supervised learning is to build a model which makes predictions based on
relations determined in the training set. Supervised algorithms should be able to efficiently
generalise, optimise and approximate from the given examples to generate accurate
decision rules. All supervised learning methods determine the structure and heterogeneity
(discrete or continuous) of the data and afterwards fit a learning model which classifies new
unknown instances. Figure 2.2 illustrates the different stages of the supervised learning
12
model.
There is a wide range of supervised learning algorithms each with its strengths and
weaknesses, however, their performance highly depend on the extracted relations and on
the complexity of the problem domain.
Support Vector Machine
In Machine Learning, Support Sector Machine (SVM) is used a prediction model based on
the concept of decision planes. A decision plane separates the classes in the classification
problem by finding the maximum margin that divides the group of points, so that a distance
between a decision plane and the nearest point from each class is maximized. As can be
expected, the process needs to find the optimal decision plane which will maximize the
margin of the training data. Points closest to the separating hyperplane are called support
vectors, hence the name Support Vector Machine. A mathematical formulation of the SVM
model is presented below.
For linearly separable case, if {𝑥1, . . . , 𝑥n} are all data entry points and 𝑦𝑖 ∈ {1, −1} are
their input classes then the two respective hyperplanes would be the form of:
𝑥i · 𝒘 + 𝑏 = +1 for H1
𝑥i · 𝒘 + 𝑏 = −1 for H2
where w is the normal to each hyperplane and 𝑏 the intercept term.
Figure 2.2: Supervised learning flow diagram adopted from [1]
13
The learning phase refers to process of calculating the margin 1
||𝑤|| and the Quadratic
Programming optimisation:
𝑚𝑖𝑛 1
2 ||𝑤||2 such that yi(xi · 𝑤 + 𝑏) − 1 ≥ 0 ∀i
𝑏 = 1
Ns ∑(𝑦s − ∑ 𝑎mymxm. xs )
𝑚∈𝑆𝑠∈𝑆
Having the variables 𝑤 and 𝑏 we can define our separating hyperplane’s optimal orientation
and thus the solution of the SVM model. Then each new point x’ would be evaluated based
on the result of the following equation:
𝑦′ = 𝑠𝑔𝑛(𝑤 · 𝑥′ + 𝑏)
Figure 2.3: Illustration of SVM components in linearly separable case
SVM has been a widely used classifier in data mining and machine learning because of its
ability to overcome one of the most common machine learning issues– overfitting. Use of
SVM has been prominent, both dealing with linearly separable and non-linearly separable
data (kernel trick) performing at the state of art level. Figures showing the two different
cases are presented in Appendix A. Another strength of SVM is that it provides fast and
effective means of learning capable of processing multidimensional input, even in cases
reaching more than 10,000 features [21]. Text categorisation using SVM has been applied to
numerous fields, including mental health [22] [23].
14
Naïve Bayes
Naïve Bayes (NB) is a widely used classifier in text classification and anti-spam studies. NB is a supervised probabilistic learning method which assigns labels to data points, represented as feature vectors. The NB classifier is based on the Bayes’ Theorem, describing the probability of an event, dependent on its conditions. The major property of the NB theorem is the interdependence attribute assumption which says that values are independent of each given class and their probabilities are calculated independently from the training dataset. The mathematical representation is illustrated below.
𝑃(𝑐) − the prior probability of a randomly picked document with class c. 𝑃(𝑥|𝑐) − the conditional probability of predictor occurring in class c. 𝑃(𝑥) − the prior probability of predictor. 𝑃(𝑐|𝑥) − the posterior probability of predictor belonging to class c. Some of the NB strengths include handling missing data, enabling categorical attributes and quick re-calculation. This project explored the performance of the other classifiers, which are explained in details
in Appendix A.
2.3 Text Mining Social Media
Previously conducted studies explored the subject of text mining mental health data using various textual resources via social media. For instance, Shepherd et al. (2015) described in his study the role of Twitter as means of space in which people could openly discuss mental health problems [24]. They searched for tweets in particular conversation containing the hashtag
15
#dearmentalhealthprofessionalsand and performed their analysis on unique anonymised dataset. Their results show that 515 communications have been found on Twitter related to mental health problems.
Reavley and Pilkington et al. (2014) reported the usefulness of Twitter content to measure attitudes towards mental illness [25]. In their study, they collected tweets only matching the hashtags #schizophrenia and #depression during 7-day period. After applying filtering, 5907 depression tweets and 451 schizophrenia tweets were stored and used for analysis. From the set of the potentially schizophrenia related tweets, 76 provided evidences for people diagnosed with schizophrenia and 47 tweets expressed personal experience. They concluded that schizophrenia tweets were 43% (193) neutral, 42% (191) supportive, almost 10% (44) explicitly anti-stigma, 5% expressing stigmatising attitude and less than 1% reflecting stigmatising attitude of personal experience.
In another study, McManus et al. (2015) collected Twitter posts to detect individuals with schizophrenia [26]. Using different features such as emoticons, posting time of the day and dictionary terms, they trained and validated a SVM model and managed to achieve 92% precision and 71% recall. Their study identified 96 people with schizophrenia and proved that Twitter can also be a space where people, with mental health problems, could release their emotions. Moreover, they concluded that the peak posting time of tweets from users with psychological disorder experiences lied between the early morning hours.
Belousov et al. (2015) gathered Twitter data to analyse the mental illness schizophrenia and
its symptoms [27]. Their pipeline process consisted of four main tasks: filtering, pre-
processing, feature extraction and classification. During the first two processes all
ambiguities and noise were removed from the collected tweets to assure consistency and
quality. They applied machine learning techniques to predict tweets related to the problem
domain. According to his paper, the Naïve Bayes classification model outperformed the
other text classification classifiers and managed to attain 92% accuracy. They analysis
showed that 485 instances were classified positively, most of which were posted between
the hours 10PM and 2AM.
2.4 Summary
In this chapter, theoretical explanation of text mining and machine learning has been presented, including previous studies evaluating the usefulness of social media content in mental health data analysis.
16
Chapter 3 Design and Development
This section will cover the most challenging aspects involved in the development cycle of the
software. In this chapter diagram of the designed architecture is provided to facilitate the
understanding of the project architecture.
3.1 Design
3.1.1 Requirements
In order to understand the project requirements it was important to define the project
specifics and hence clarify what was going to happen with data, how was that going to be
achieved and what were the most appropriate technologies for that purpose. The very first
step was to collect as much information as it was possible from my supervisor and the
domain experts Rohan Morris and Natalie Berry because of their vast amount of knowledge
in this area, which later prevented making unnecessary assumptions and mistakes. I was
aware of the fact that requirements will change over time, so I complied with one of the
Agile Practices [28] and started with a simple understanding of the problem. As the project
progressed, the requirements evolved and the system became more sophisticated. Spending
time working through the architecture and designing paper prototypes helped to capture
the crucial functionality of the system in advance. Table 3.1 presents the final version of the
functional and non-functional requirements with their respective priority.
Requirements Priority
Define search queries with respect to the problem domain.
High
Create automated scripts for collecting data from Twitter. High
Store results in a local non-relational database.
High
17
Explore data to gain insight into the domain and refine the search queries
Medium
Implement methods to filter out spam and irrelevant tweets.
Medium
Predetermine the subset of data passed for annotation.
High
Transfer processed tweets to an encrypted data storage. High
Design a handy annotation tool for the psychology researchers. High
Develop an effective security strategy.
Medium
Extract valuable and informative features from the selected tweets.
High
Implement different classifiers and evaluate their performance.
High
Craft a graphical user interphase, allowing different modifications to the dataset.
High
Analyse and plot the results.
High
Table 3.1: Requirements
3.1.2 System design
The architecture of the system was broken down into smaller components and each
component was made in isolation from the others in order to prevent conflicting points. The
main property of the system is that it allows quick and smooth integration of other newly
created modules. In addition, the system takes into account the constant modifications to
the database and allows different experiments to be carried out without affecting the
original state of the data.
As previously mentioned, the project aimed to implement a text mining tool for data
analysis, therefore the main components of the system can be classified into five abstract
levels - data collection, pre-processing and knowledge extraction, prediction, analysis and
utilization. The aforementioned steps are depicted in Figure 3.1.
Data collection – retrieving Twitter messages related to psychosis and sleep
disturbance.
Annotation Tool
Project set-up
Text classification and visualisation
18
Text normalisation – preprocessing the natural language text in order to unifies
different variations of the same content by performing various text transformation
methods.
Diagnostic filtering – dictionary-based approach filtering out irrelevant tweets.
Timeline extraction – tweets pulled out from diagnosed users’ timelines.
Classification – automatic detection of sleep-reported disturbance.
Analysis – exploring trends within the timeline dataset.
Figure 3.1: System Architecture
3.1.3 Development environment
Python
Python has been chosen as the main supportive language for the implementation of this
project, because of the rich set of available python-based text processing and manipulation
libraries such as NLTK, matplotlib, scipy, etc. Some of these libraries played an essential role
in the development of the software. Secondly, Python is considerably simple and easy to
understand language and in addition there are numerous books and tutorials on the Web
helping beginners to advance their knowledge quickly.
19
MongoDB
For the purpose of storing the collected tweets, the non-relational database MongoDB has
been selected [29]. The project dataset was expanding constantly and a SQL database could
have created performance degradation. On the other hand, non-relational databases do not
have explicit and structured mechanisms which allows incorporation of all types of data,
while at the same time retain the ability to link efficiently entries from different buckets.
Furthermore, the response from Twitter is in a JSON format which a Mongo database
instance supports, therefore MonogDB was preferable to SQL.
Intellij
The whole project was developed with the integrated development environment Intellij
[30]. The platform provided useful features such as code refactoring, code navigation and
code suggestion which made the implementation process easier.
Classification toolkit
All machine learning classification tasks were developed using the useful scikit-learn library.
It provided free access to loads of different learning algorithms as well as feature
transformation and feature selection modules.
3.1.4 Graphical User Interface
To give users the opportunity to analyse more deeply the collected information, graphical
user interface was implemented enabling different human-computer interactions. The
designed system allows users to inspect the assigned labels as well as presents the most
informative features, extracted during the learning process. In addition, users have the
ability to retrieve dynamically timeline tweets from the project database and are also
allowed to select one or many anonymised accounts for further analysis. A screenshot of the
“Prediction” frame is shown in Figure 3.3, presented in a table-like structure a bunch of
timeline tweets and their respective class.
20
Figure 3.2: Predicted labels of timeline tweets (GUI)
3.2 Development
3.2.1 Collection of potential diagnostic tweets
Collecting data was the very first step of the development and it was important to be
implemented in a way, allowing persistent experiments to be carried out. In this study, both
of the available Twitter APIs have been used [31] [32]. The Search API returned tweets
which were based on relevance and popularity, while the Streaming API exposed a live
stream of tweets. We included repetitive data exploration step to the project pipeline to
examine different attributes in the gathered dataset, and in this way improved the quality of
the retrieved results. Term document-inverse document frequency was used as weighting
method in the data exploration process to reveal uniqueness and word relevance [33].
Detailed explanation and mathematical representation of the algorithm is presented later in
this chapter. In order to further optimise the process of collecting tweets and avoid
occurrences of duplicates, two Twitter parameters were introduced - since_id and max_id.
The first parameter returned tweets published more recent than a tweet with specified id
(since id), while the second one returned results older than a tweet with a specified id
(max_id). The values were automatically updated after each run of the program and were
applied in the next search of tweets.
In addition, to analyse behaviour patterns, the software extracted all timelines tweets from
each user who has been recently diagnosed with psychosis. All gathered tweets including
their embedded attributes (tweet ID, user ID, timestamp, UTC offset, geo coordinates, time
21
zone) were collected through the Search API and stored separately on a database collection.
The aim of this process was to provide quantifiable dataset on in order to perform our
analysis. However, after a quick discussion with the psychology researchers and my project
supervisor, we decided to set time restrictions for the extraction process. We focused only
on tweets which were published 6 weeks before the diagnosis or infinitely many days after.
Since the finding of diagnostic tweets was one of the crucial elements in our system, and
which was going to have a significant impact on the results of the other processes, we
decided to perform the annotation manually. Undoubtedly, manual annotation was time-
consuming but in our case we endeavoured to achieve 100% accuracy of our results, which
was not going to be easily accomplished even with an advanced machine learning classifier.
Nevertheless, all collected tweets were pre-filtered and validated against different semantic
rules before being sent out to the researchers. More information about the pre-filtering
process is given in section 3.6.
The data collection started on 15th of October 2015 and finished a week before the
submission of the practical work - 14th of March 2016. Table 5.1 in Chapter 6 shows some
statistics retrieved from the project database at the last stage of the development.
3.3 Text normalisation
The collected Twitter data went through few text pre-processing methods to ensure high
quality and consistency. This translated non-standard words into their canonical forms,
while at the same time preserving their contextual meaning. Many text normalisation tasks
include stopwords removal, but in this particular study it was omitted as we revealed the
importance of personal pronouns in the process of distinguishing objective and subjective
tweets. Similar text normalisation tasks were found in different text mining social media
studies [34] [35] [36].
3.3.1 Tokenisation
This process aimed to convert raw tweet components into linguistic units such as words,
symbols and phrases. Although this task is generally straight-forward, it becomes
problematic and challenging when it has to be applied to microblogging messages. The main
issues arise from the fact that tweets contains non-standard language such as abbreviations,
acronyms, slang, emoji, hashtags etc. For that purpose, tokenizer trained on Twitter data
has been used to provide better performance and precision [37].
Example:
22
I feel horrible because today I was diagnosed with schizophrenia :(
['I', 'feel', 'horrible', 'because', 'today', 'I', 'was', 'diagnosed', 'with', 'schizophrenia', ':(']
3.3.2 Twitter object removal
After the tokenisation process, identified units as hashtags, links and special characters,
including emoji, were trimmed off as they did not bring any meaningful value to the
problem domain.
Example:
@Chris159 my aunt has schizophrenia but that’s because she used to take a bad batch of
drugs in the 90’s 😭.
[my aunt has schizophrenia but that’s because she used to take a bad batch of drugs in the
90’s.]
3.3.3 Stemming
The goal of stemming was to reduce the inflectional forms of words to common root form in
order to achieve uniform representation. For example, stemming the words “sleep”,
“sleeping” and “sleps” will produce “sleep“, which otherwise will be identified as three
different heterogeneous words. The removal of morphological affixes was done by the
Porter Stemmer provided by the NLTK package [38] .
Example:
The only medication that has succeeded in controlling my bipolar disorder was Olanzapine.
['The', 'onli', 'medic', 'that', 'ha', 'succeed', 'in', 'control', 'my', 'bipolar', 'disord', 'wa',
'Olanzapin', '.' ]
3.3.4 Abbreviations
Another very common issue with Twitter data was the usage of abbreviations. In the
following example, “asap” is the short form of “as soon as possible”, which could be easily
understood by a human being, but in a simple bag-of-words model, these forms would be
denoted as different, although the semantic meaning remains the same. For that purpose,
we built our own dictionary of shortened words with their associated expanded form.
23
Example:
I need to visit a doctor asap… I hear voices when I'm alone at home.
[I need to visit a doctor as soon as possible… I hear voices when i'm alone at home.]
3.3.5 Part of Speech Tagging
In corpus linguistics, part-of-speech tagging (POS) is an essential task corresponding to the
process of classifying words with their respective part of speech tag. POS tags were useful
source of information, because they facilitated the word sense disambiguation and
managed to boost the prediction accuracy of our classification models. Tweet NLP Part-of-
Speech tagger has been selected as a POS tagger for this project. [37]. An example is given
to illustrate the usage of the POS tagger. Explanation of each POS tag is given in Appendix C.
Example:
I've come to terms with my schizophrenia, my friends are not isolating me anymore
[["I've", 'L'], ['come', 'V'], ['to', 'P'], ['terms', 'N'], ['with', 'P'], ['my', 'D'], ['schizophrenia', 'N']
[(',', ',',], ['my', 'D'], ['friends', 'N'], ['are', 'V'], ['not', 'R'], ['isolating', 'V'], ['me', 'O'],
[('anymore', 'R']]
3.3.6 Term Frequency - Inverse Document Frequency
Key role in the filtering and search refining process played the information retrieval method
Term Frequency – Inverse Document Frequency (TF-IDF) [39] . The algorithm assigned
weight to each word of the corpus by computing its term frequency and its inverse
document frequency. The term frequency measured how frequently a word 𝑥 appeared in a
tweet, while the inverse document frequency measured how much information it brought.
The term frequency decreased the scores of words which appeared frequently in a text and
increased the scores of words that occur rarely. On the other hand, pure term frequency
assumes that all terms are equally important. To scale down the words score, inverse
document frequency was computed. It defined the number of documents in the database
which contained the term 𝑥 and thus reducing the effect of frequent trifling terms. If we
take as an example the preposition “a”, which itself does not hold any useful meaning, will
be assigned a very low score, whereas as large discriminatory words like “sleep” or “awake”
will be given high score. For that reason, we excluded all common words from our corpus
before we proceeded to the TF-IDF calculation.
24
Some of the revealed keywords (excluding the words from the search queries) include:
“Depression”, “got”, “ill”, ”make”, ”mind”, ”substance”, “care”, ”mental”, “normal”,
“develop”
𝑤i,j = 𝑡𝑓i,j ∗ log (𝑁
𝑑𝑓i)
𝑤i,j - score of term 𝑖 within t 𝑗
𝑡𝑓i,j - number of occurrences of term 𝑖 in tweet 𝑗
N - total number of tweets
𝑑𝑓i - number of documents containing 𝑖
3.3.7 Negation detection
Although detecting negations presented extremely challenging problem and was outside the
scope of this project, simplified methods were implemented to capture the focus of negated
word(s). For that purpose, dependency parsing [40] was conducted to output the
grammatical relations between the words and then an algorithm inspected each vertex of
the graph for occurrence of negation modifier.
Figure 3.3: Dependency graph of a tweet containing a negated word
3.4 Pre-filtering of diagnostic tweets
The filtering process aimed to determine a qualitative diagnostic dataset, passed to the
psychology researchers for annotation. Although the Twitter APIs provide the user an
opportunity to define his own queries with simple or complicated constraints, it does not
denote the contextual meaning of the words and thus introduce noise. To improve the
accuracy of the results, a semantic search was performed [41]. The filtering layer included a
set of semantic rules based on the definition of the searched terms, which pinpointed valid
examples and minimised the number of irrelevant tweets in our data storage. For example,
after the initial data exploration we noticed that the usage of personal pronouns and drugs,
prescribed for the treatment of psychological disorders, lead to relevant examples. We
applied that knowledge in our algorithm and the number of relevant examples instantly
25
increased. In addition, the lexical diversity analysis and the TF-IDF word scores determined
some valuable verbs such as “get”, “diagnose”, “suffer”, “have” ,“ill” which were also used
as an indicator of potential diagnostic messages.
3.5 Manual Annotation
As already stated, an important part of the classification process is the training set which is
used to discover potential predictive relationships. The training data in supervised learning
is typically in the format of a tuple(𝑥, 𝑦), where 𝑥 is the representation of the documents in
vector space, known as vector space model, and 𝑦 their respective class. Unfortunately, the
specificity of the domain problem led to the inability of using publicly available annotated
data. The lack of information can be explained also by the fact that manually tagging text is
an expensive process which requires domain expertise and a lot of time. Fortunately, Dr
Rohan Morris and Natalie Berry, psychology researchers from the University of Manchester,
volunteered to accomplish this task. Although there were several annotation programs
already available on Web, separate software was designed and developed that facilitated
the annotation process by including only the requisite functionality. To incorporate secure
communication, high-level authentication layer was introduced. Unique login credentials
were provided to the researchers in order ensure data protection and to display appropriate
content.
Users had to classify each tweet as either negative, neutral or positive, providing their level
of confidence from 1 to 10. After signing in, each researcher was asked to annotate a set of
tweets - 75% of which were pre-filtered and the other 25% were selected randomly. The
dataset included random tweets to eliminate the selection bias and to increase the chances
of finding new unknown patterns, missed in the first filtered set.
In statistical study such as this one, it was important to calculate the reliability agreement
between the annotators, therefore a statistical measure was evaluated. 100 examples were
shared between the two annotators, so that we can assess their accordance. Fleiss Kappa
was chosen as such measure, because of the ability to rate the agreement between
unlimited number of participants, in comparison with other kappas which work only for
particular number of people [42]. The value was regularly updated and synchronized with
the researchers to guarantee the high quality of the annotated dataset. Example showing
how Fleiss’ Kappa is computed can be seen in Appendix C.
User experience techniques were also integrated into the software development process to
enhance the usability and accessibility of the program.
26
3.6 Classification of self-reported sleep disturbance
3.6.1 Feature Construction
Many machine learning algorithms require the input text to be translated into specific
format. The most common model is the Vector Space Model which ignores the linguistic
structure of the text and creates multidimensional vectors corresponding to each separate
word from the text. The performance evaluation of each model is presented in chapter
“Analysis and Results”. In regards to machine learning and data mining, feature
construction, also known as feature engineering, addresses the problem of creating a set of
features using domain knowledge. During the feature construction phase, it is important not
to lose any useful information, therefore a common practice is to compare the performance
of the constructed feature with its raw form. The overall process can be described as follows
[43]:
1. Generate an initial feature space Fo
2. Transform the initial feature space into a new feature space Fn
3. Select a subset of features Ft from Fn satisfying predefined criteria.
4. Use Ft
In the next few sections, the top three most informative features are described, however, the whole set of generated features throughout the work of this project is shown in Table 3.2. Feature importance charts are presented in Appendix B.
Initial Constructed Features
Word Frequencies
Word scores
Time of posting as minutes after midnight
Figure 3.4 Annotation Window
27
Mentions of drugs used for Insomnia
Semantic classes
Sentiment polarity scores
POS tags
Table 3.2: Constructed Features
Semantic classes
Semantic classes were defined as words which shared a semantic property. The psychology
researchers manually labelled such terms in the annotation process and expanded the table
of semantic classes from Belousov’s research (2015). Table 3.3 depicts all semantic classes,
the number of the entities they contain and some examples. All matched entities were
collected via dictionary look-up method, taking into account possible spelling variations.
Semantic Class Count Examples
Electronics 50 voice message, record, TV, radio, speaker
Fear Expression 13 scary, afraid, nervous, shocked, creepy
Health Problems 10 insane, mad, crazy, batty, irrational
Location and Content of Hallucination
9 whispering, in my mind, uncanny, seeing things
Physical Space Location 18 house, office, room, apartment
Psychosis Problems 6 bipolar disorder, psychotic, delirium
Relationships 129 child, roommate, brother, fiancé, sister
Religious Terms 19 christian, bible, church, jesus, allah
Sleep Terms 5 fever, cant sleep, lack of sleep, sleep paralysis
Swear Words 503 * * *
Table 3.3: Semantic Classes
Posting time of Tweets
During the process of collecting data, all geographical metadata embedded in the messages
was gathered and stored safely on a university server. The information was then processed
to find the exact upload local time of each tweet. In an ideal case, the algorithm took the
geographical attributes (longitude, latitude) to identify the time zone and then converted
the initial timestamp to the user local time, taking into account daylight saving time (DST). In
case when there was a lack of geographical information, the algorithm approximated the
local time by subtracting/adding the UTC offset from the respective Twitter account.
28
Sentiment Polarity Score
For this feature, a separate function was implemented which assigned a sentiment polarity
score to each individual tweet. SentiStrength was not used for this particular case because
our method applied semantic rules related to the project specifics. For instance, the
function incorporated some important words, found in the data exploration process, and in
the presence of such words the scores of subsequent words were either boosted or
reduced. The sentiment score of the each term was obtained by to the sum of all sentiment
scores of its synonyms, derived from the lexical database WordNet [44]. However, there
were words which had more than one meaning, so the algorithm used additionally the part-
of-speech tags in order to approximate the word sense. For normalisation purposes, the
final score was divided by the number of synonyms in the respective word network.
3.6.2 Feature Selection
Feature selection, also known as variable selection, is a critical part of the feature
engineering process seeking to select only the subset of features related to the problem
domain. Feature selection requires deep understanding of all the important aspects of the
dataset and could be very challenging task. For that purpose, we took full advantage of the
rich set of feature selection tools, which the data mining library sklearn provided [45], and
employed low-variance selection algorithm. We set our own variance threshold and
removed all features whose variance was less than the specified value.This method was
especially useful when the multidimensional TF-IDF and bag-of-words models were
developed, eliminating all redundant features (words) and thus preventing overfitting.
3.7 Summary
In this chapter we discussed the implementation methodology and we gave an overview of
the functionality of the developed software.
29
Chapter 4 Testing and Evaluation
This chapter will touch upon the testing and evaluation techniques applied to validate the
results of the study.
4.1 Software Testing
4.1.1 Unit testing
Each major component of the software was verified and tested via Python Unit Tests [46].
Table 4.1 presents the test cases which have been inspected during the development
process. Each test case introduced a separate function, contributing to full test coverage and
validity. The test script was regularly run and updated to guarantee the normal behaviour of
the software. PyUnit was chosen as a testing framework because it provided a rich set of
tools for constructing and maintaining the test suite.
Test Case Assertion Result
Detect only English tweets True
Check if tweet contains links True
Check if tweet is a retweet True
Remove user mentions and spam True
Remove special characters (emoji) True
Remove punctuation from tweets True
Expand abbreviations from a collected dictionary
True
Table 4.1: Test Cases
4.1.2 Usability testing
The annotation tool was repeatedly modified and improved by means of usability testing.
The domain experts provided verbal feedback after each alteration to the software, which
was quite helpful as we had the opportunity to reflect on their comments in the next
iteration. This gave direct input on how the real users see the system and helped to identify
performance issues as well as indicating participant’s overall satisfaction with the product.
30
4.2 Classifiers’ performance evaluation
4.2.1 Cross-fold validation
The cross-fold validation method in this research operated as access to the statistical
relevance to the classifiers and outputted how well its predictive model generalized on
independent dataset. The explanation of this method is depicted in Figure 4.1, where a
given dataset is randomly portioned into k folds, of which a single one is used for testing and
the rest k-1 for training. The process is repeated k times enabling all data points from the
selected dataset to be used for training and testing.
Issues of imbalance data have been experienced throughout, therefore a variation of the k-
fold validation, known as stratified, has been employed. Using the specified modification,
the equal distribution of classes in each individual fold was ensured [47]. The performance
of the classifier was measured against an independent dataset, because relying on patterns
visible only in the training data was going to cause overfitting [48]. For that purpose, I asked
the researchers to annotate an additional set of tweets, which was used later to confirm the
effectiveness of the classifier.
In this empirical study, the value of K was determined to be 10.
4.2.2 Confusion Matrix
In machine learning, confusion matrices are often used to evaluate the accuracy of a
classifier based on the actual and predicted labels. Each column of the confusion matrix
represents a predicted class, while each row shows the actual label. Typical confusion matrix
reports the false positive, true positives, false negatives and true negatives instances after
Figure 4.1: K fold cross-validation
31
the classification process. Explanation of each term can be found on the last page of the
report.
4.2.3 Precision, Recall, F-score.
The accuracy itself is not typically inspected in isolation, but it is evaluated simultaneously
with some other statistical measures such as precision, recall and f-score.
Recall – is the fraction of relevant examples from the whole selected dataset.
Precision – measures the result relevancy.
F-score – conveys the balance between precision and recall.
Precision = 𝑇𝑃
𝑇𝑃+𝐹𝑃
Recall = 𝑇𝑃
𝑇𝑃+𝐹𝑁
F-score = 2𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑟𝑒𝑐𝑎𝑙𝑙
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
4.3 Summary
In summary, this chapter presented an overview of the conducted testing methods and
provided technical details about the used classification performance metrics.
Figure 4.2: Confusion Matrix
32
Chapter 5 Analysis and results
The role of this chapter is to outline the most noteworthy findings and results, obtained
after the performed analysis. Furthermore, the chapter provides performance metrics of the
predictive models.
5.1 Data statistics
As can be seen from Table 5.1, we managed to collect 29,890 potential diagnostic tweets
and 46,953 examples of potential sleep related tweets. However, significant amount from
those instances were noise or spam, which led to data imbalance issue. The psychology
researchers manually annotated 547 tweets from our potential diagnostic dataset and
classified positively 43 (8%) examples. On the other hand, the number of tweets expressing
self-reported sleep disturbance was significantly higher. From 507 labelled examples, 354
(69%) has been positively tagged. The difference in the number of positive examples
between the two datasets could be explained with the fact that discussing mental health
problems such as schizophrenia and bipolar disorder is often considered a taboo subject on
social media, whereas tweets about sleep-related issues are more commonly met and less
likely to cause social anxiety. Nevertheless, it is important to state that all of the numbers
mentioned in this section were derived from two independent datasets and the examples
were used only for training purposes.
Diagnostic Tweets Sleep-related Tweets Timeline Tweets
Streaming API 8812 44588 0
Search API 21078 2365 3886
Number of unique words
23505 29622 7908
Total Tweets 29890 46953 3886
Table 5.1: Data collection statistics
5.2 Accuracy, precision, recall of classifiers
In information retrieval context, it is of vital importance to retrieve maximum number of
relevant examples. This often presents huge challenge in studies exploring social media
content, because of the high volume of irrelevant and noisy data. Frequent false positive
33
examples could even potentially overwhelm the effect of the correct results, therefore f-
score is commonly used, which considers both precision and recall.
Table 5.2 shows the classification performance of the four used classifier in this project.
Accuracy, precision, recall and f-measure operated as statistical measures of their
effectiveness. To determine whether the classifier was being overfitted, we ran 10 fold
stratified cross-validation and the results are illustrated in the table below. As can be seen,
SVM, trained on the combination of TF-IDF scores and semantic classes, achieved the best
accuracy of 0.89% and the highest precision 0.89%. Surprisingly, the accuracy of the same
classifier, developed on the two aforementioned features separately, dropped down to
0.83% and 0.66% respectively. In addition, POS tags were also identified as an informative
feature achieving almost 69% precision and 68% accuracy with the Random Forest classifier.
Decision Tree Naïve Bayes
Support Vector Machines
Random Forest
Features p r f a p r f a p r f a p r f a
Bag-of-words 0.83 0.83 0.83 0.83 0.86 0.85 0.85 0.84 0.87 0.87 0.87 0.87 0.82 0.82 0.82 0.81
TF-IDF 0.87 0.87 0.87 0.87 0.83 0.82 0.82 0.83 0.81 0.83 0.82 0.83 0.80 0.80 0.81 0.80
POS 0.65 0.65 0.65 0.65 0.64 0.64 0.64 0.64 0.69 0.64 0.67 0.68 0.68 0.68 0.67 0.68
POS Frequency 0.66 0.63 0.65 0.67 0.62 0.63 0.63 0.64 0.68 0.60 0.65 0.66 0.68 0.64 0.67 0.68
POD 0.77 0.64 0.68 0.66 0.63 0.60 0.62 0.62 0.75 0.73 0.74 0.75 0.68 0.69 0.69 0.69
Sentiment Score 0.74 0.74 0.74 0.74 0.64 0.65 0.65 0.65 0.66 0.64 0.65 0.65 0.68 0.62 0.66 0.66
Semantic classes 0.70 0.66 0.63 0.66 0.76 0.64 0.57 0.64 0.72 0.67 0.63 0.66 0.69 0.66 0.62 0.65
Semantic Classes & TF-IDF 0.87 0.87 0.87 0.87 0.79 0.71 0.68 0.71 0.89 0.83 0.87 0.89 0.87 0.86 0.86 0.86
Table 5.2: Classifiers performance evaluation
p - precision r - recall f - f-score a - accuracy
34
Some correctly classified tweets as self-reported sleep disturbance are listed in Table 5.3.
Class
Tweet Predicted Actual
“I’m Satan and I can’t sleep, no rest of the wicked” Positive Positive
“As usual I cant fall asleep” Positive Positive
“I probably won’t, I need to take a pill to fall asleep”
Positive Positive
“I feel like sleeping on bed made of broken glass” Positive Positive
“My dream keeps me awake Positive Positive
Table 5.3: Correctly classified sleep related instances
5.3 Error Analysis
To do error analysis, after cross-fold validation, the labels of the testing examples were
manually examined for signs of systematic error trends. Once our model has been trained
and tested, an error analysis was used to indicate whether the learning algorithm was
suffering from high bias or variance. Table 5.1 presents some of the misclassified sleep
related tweets by the SVM classifier. The first example was incorrectly predicted, because it
contains two typically strong features – “time” and “sleep”. “Time” was even part of our
semantic class table, which also boosted the confidence of the classifier that the example
expresses sleep-related disturbance. However, we cannot omit the fact that the classifier
managed to interpret successfully subjectivity and objectivity, which can be seen in all of the
three examples.
Text Predicted Class
“Alright time to go back to sleep” Positive
“Noah sleep talking is creepy af” Positive
“I can’t think straight” Positive
Table 5.4: Misclassified examples
5.4 Experimental findings
To investigate sleep-related phenomena amongst the collected timeline tweets, we
conducted a few experiments and presented the results in this section.
35
5.4.1 Sentiment distribution amongst the timeline tweets
To investigate the expressed emotions before and after the diagnosis, we performed
sentiment analysis. We noticed two emotional peaks occurring within the range of 40 to 50
days before the diagnosis and 115 to 125 days after. Figure 5.3 also provides another
interesting finding, people seem to use less social media after they have been diagnosed,
which is supported by the drastic decline in posted tweet within the period of 1 - 100 days
after the diagnosis.
5.4.2 Semantic Class Influence
As previously stated, our research identified and stored different semantic entities found in
the gathered timeline tweets. The results show that the semantic class “Relationships”
occurred most frequently 325 times (33%), of which “friend“ , “ family“, “mom “, “ baby”,
“parent”, “father”, “brother” were the most common entities.
Semantic Class Count
Relationships 325
Swear Words 278
Electronics 180
Physical Space Location 81
Figure 5.1: Sentiment Distribution
36
Religious Terms 30
Health Problems 27
Fear Expression 24
Psychosis Problems 14
Location and Content of Hallucination
1
Sleep Terms 1
Table 5.5: Semantic class frequency
5.4.3 Posting time
One of our most significant experimental findings is depicted in Figure 5.3. It illustrates the posting trends generated from our diagnostic dataset, according to which, the peak posting time lies between the hours 8PM – 3AM with 42 (38%) examples out of 109 totally. This a strong in
Figure 5.2: Time of posting trend
5.5 Summary
This chapter presented the outcomes of the conducted research providing paraphrased
tweets showing sleep-related experiences. In addition, classifiers’ performance evaluation
was given, exposing their weakness and strengths.
37
Chapter 6 Conclusion
The chapter gives an overview of the achieved objectives and presents proposed future
work as an extension of the current software.
6.1 Reflection and Achievements
In terms of performance metrics, the project managed to achieve similar results to other
studies of text mining social media. We provided evidences that Twitter content is valuable
source of information for mental health analysis and confirmed that sleep-related trends
could be hidden in psychosis-like phenomena. In addition, graphical user interphase was
implemented to assist future researchers in their analysis. Despite the numerous challenges
and obstacles, our system managed to achieve satisfactory performance of 89% accuracy
with the machine learning classifier SVM. Therefore, the project aim, to establish a robust
system capable of automatic text mining can be considered as success. The major
achievement of this project are:
Development of efficient diagnostic filter.
Implementation of annotation tool employing server-client model.
Automatic classification of self-reported sleep disturbance.
Implementation of analysis tool.
Last but not least, an abstract was submitted and subsequently approved by the Population
Data Linkage Conference, based on the proposed methodology in this report [49]. The event
was focused on research papers, dealing with data linkage and data science approaches,
seeking to improve healthcare services.
6.2 Future work
Although we incorporated numerous informative features, we did not manage to test the
performance of the classifiers, trained on name entities. There exists already developed
tools which chunk and classify text elements into pre-defined categories such as
organisations, temporal expressions, percentages etc. We found particularly useful the
recognisers ManTime [50] and Stanford Parser [51] because of their ability to extract time
patterns from general domain texts, but due to time limitations the feature importance was
not evaluated. I believe that such temporal entities could improve the classification of sleep-
related tweets, since we have spotted frequent usage of temporal expressions in the
positively labelled tweets.
38
Another possible task that could be added to the implementation of the existing software is
the process of word-sense disambiguation. Although we approximated the sense of the
words by applying various natural language processing techniques, the process could be
further improved and advanced. For example, an algorithm could be used to detect the
contextual meaning of a word by comparing all its senses as well as taking into consideration
its surrounding words. There are more advanced algorithms which exploit the structural
properties of a particular word network by applying graph-based [52] or tree-based [53]
approaches.
6.3 Summary
To summarise, the project managed to establish robust and user-friendly system in order to
retrieve, analyse and visualize sleep-related phenomena amongst set of tweets from people
experiencing mental health problems. The proposed methodology coped with the
inconsistent and heterogeneous nature of social media by carrying out different text
normalisation and transformation tasks. The implemented software would be handed over
to the psychology researchers from the University of Manchester for further data
exploration.
39
References
1. Astroml. Machine Learning 101: General Concepts — Machine Learning for Astronomy with Scikit-learn. 2016 [cited 2016 4 Apr]; Available from: http://www.astroml.org/sklearn_tutorial/general_concepts.html.
2. Twitter Usage Statistics - Internet Live Stats. 2016; Available from: http://www.internetlivestats.com/twitter-statistics/.
3. Birnbaum, M.L., et al., Role of social media and the Internet in pathways to care for adolescents and young adults with psychotic disorders and non‐psychotic mood disorders. Early intervention in psychiatry, 2015.
4. Coppersmith, G., M. Dredze, and C. Harman. Quantifying mental health signals in twitter. in Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality. 2014.
5. Buchanan, E., Ethical decision-making and internet research. 6. Hewson, C. and T. Buchanan. Ethics Guidelines for Internet-mediated Research. 2013. The
British Psychological Society. 7. Pub, N.F., 197: Advanced encryption standard (AES). Federal Information Processing
Standards Publication, 2001. 197: p. 441-0311. 8. Hu, X. and H. Liu, Text Analytics in Social Media, in Mining Text Data, C.C. Aggarwal and
C. Zhai, Editors. 2012, Springer US: Boston, MA. p. 385-414. 9. Morstatter, F., et al., Is the sample good enough? comparing data from twitter's
streaming api with twitter's firehose. arXiv preprint arXiv:1306.5204, 2013. 10. Witten, I.H., Text mining. 11. Gegick, M., P. Rotella, and T. Xie. Identifying security bug reports via text mining: An
industrial case study. in Mining Software Repositories (MSR), 2010 7th IEEE Working Conference on. 2010.
12. Sullivan, D., Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing, and Sales. 2001: John Wiley; Sons, Inc. 560.
13. Percha, B., Y. Garten, and R.B. Altman, DISCOVERY AND EXPLANATION OF DRUG-DRUG INTERACTIONS VIA TEXT MINING. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, 2012: p. 410-421.
14. Cohen, W.W. and S. Sarawagi. Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. 2004. ACM.
15. Baluja, S., V.O. Mittal, and R. Sukthankar, Applying Machine Learning for High‐Performance Named‐Entity Extraction. Computational Intelligence, 2000. 16(4): p. 586-595.
16. Zhou, G. and J. Su, Named entity recognition using an HMM-based chunk tagger, in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 2002, Association for Computational Linguistics: Philadelphia, Pennsylvania. p. 473-480.
17. Thelwall, M., et al., Sentiment strength detection in short informal text. Journal of the American Society for Information Science and Technology, 2010. 61(12): p. 2544-2558.
40
18. Saif, H., Y. He, and H. Alani, Semantic Sentiment Analysis of Twitter, in The Semantic Web – ISWC 2012: 11th International Semantic Web Conference, Boston, MA, USA, November 11-15, 2012, Proceedings, Part I, P. Cudré-Mauroux, et al., Editors. 2012, Springer Berlin Heidelberg: Berlin, Heidelberg. p. 508-524.
19. Rambocas, M. and J. Gama, Marketing research: The role of sentiment analysis. 2013, Universidade do Porto, Faculdade de Economia do Porto.
20. Alpaydin, E., Introduction to machine learning. 2014: MIT press. 21. Leopold, E. and J. Kindermann, Text Categorization with Support Vector Machines. How
to Represent Texts in Input Space? Machine Learning. 46(1): p. 423-444. 22. Joachims, T., Text categorization with support vector machines: Learning with many
relevant features. 1998: Springer. 23. Diederich, J., A. Al-Ajmi, and P. Yellowlees, Ex-ray: Data mining and mental health.
Applied Soft Computing, 2007. 7(3): p. 923-928. 24. Shepherd, A., et al., Using social media for support and feedback by mental health service
users: thematic analysis of a twitter conversation. BMC psychiatry, 2015. 15(1): p. 1. 25. Reavley, N.J. and P.D. Pilkington, Use of Twitter to monitor attitudes toward depression
and schizophrenia: an exploratory study. PeerJ, 2014. 2: p. e647. 26. McManus, K., et al., Mining twitter data to improve detection of schizophrenia. AMIA
Summits on Translational Science Proceedings, 2015. 2015: p. 122. 27. Belousov, M., Identifying signs of schizophrenia in Twitter using text mining techniques.
2015. 28. Cao, L. and B. Ramesh, Agile Requirements Engineering Practices: An Empirical Study.
IEEE Software, 2008. 25(1): p. 60-67. 29. MongoDB. 2016; Available from: https://www.mongodb.org/. 30. IntelliJ IDEA. 2016; Available from: https://www.jetbrains.com/idea/. 31. The Search API. 2016; Available from: https://dev.twitter.com/rest/public/search. 32. The Streaming APIs. 2016; Available from: https://dev.twitter.com/streaming/overview. 33. Ramos, J. Using tf-idf to determine word relevance in document queries. 34. Coppersmith, G., et al., From ADHD to SAD: Analyzing the language of mental health on
Twitter through self-reported diagnoses. NAACL HLT 2015, 2015: p. 1. 35. Clark, E. and K. Araki, Text normalization in social media: progress, problems and
applications for a pre-processing system of casual English. Procedia-Social and Behavioral Sciences, 2011. 27: p. 2-11.
36. Bontcheva, K., et al. TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text. in RANLP. 2013.
37. Owoputi, O., et al. Improved part-of-speech tagging for online conversational text with word clusters. 2013. Association for Computational Linguistics.
38. Bird, S., E. Klein, and E. Loper, Natural language processing with Python. 2009: " O'Reilly Media, Inc.".
39. Kim, J., et al., Noise Removal Using TF-IDF Criterion for Extracting Patent Keyword, in Soft Computing in Big Data Processing, M.K. Lee, S.-J. Park, and J.-H. Lee, Editors. 2014, Springer International Publishing: Cham. p. 61-69.
41
40. Klein, D. and C.D. Manning. Accurate unlexicalized parsing. in Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1. 2003. Association for Computational Linguistics.
41. Guha, R., R. McCool, and E. Miller. Semantic search. in Proceedings of the 12th international conference on World Wide Web. 2003. ACM.
42. Falotico, R. and P. Quatto, Fleiss’ kappa statistic without paradoxes. Quality & Quantity, 2014. 49(2): p. 463-470.
43. Sondhi, P., Feature construction methods: a survey. 2009. 44. Miller, G.A., WordNet: a lexical database for English. Communications of the ACM, 1995.
38(11): p. 39-41. 45. Pedregosa, F., et al., Scikit-learn: Machine learning in Python. The Journal of Machine
Learning Research, 2011. 12: p. 2825-2830. 46. Unit testing framework Python 3.5.1 documentation. 2016 [cited 2016 5 April]; Available
from: https://docs.python.org/3/library/unittest.html. 47. Refaeilzadeh, P., L. Tang, and H. Liu, Cross-Validation, in Encyclopedia of Database
Systems, L. Liu and M.T. ÖZsu, Editors. 2009, Springer US: Boston, MA. p. 532-538. 48. Elkan, C., Evaluating classifiers. 2012. 49. IPDLN Conference 2016. 2016; Available from:
http://ipdlnconference2016.org/CallForAbstracts. 50. Filannino, M. and G. Nenadic, Temporal expression extraction with extensive feature type
selection and a posteriori label adjustment. Data & Knowledge Engineering, 2015. 100, Part A: p. 19-33.
51. Finkel, J.R., T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. in Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. 2005. Association for Computational Linguistics.
52. Agirre, E., et al. A study on similarity and relatedness using distributional and wordnet-based approaches. in Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. 2009. Association for Computational Linguistics.
53. Hatori, J. and Y. Miyao. Word Sense Disambiguation for All Words using Tree-Structured Conditional Random Fields.
42
Appendix A
Decision Tree – supervised learning method modelled as a decision tree, which maps labels
to input variables, based on decision rules. In decision trees, each node has one unique
input variable and each leaf node is assigned a class label. The class labels depend on the
values of the input variables, which are located on the explored path from the root node. A
simple representation of a decision tree is shown in the figure below.
The advantages of decision trees include:
Handling multi-output problems.
Coping with categorical and numerical input data.
Linearly Separable SVM
Linearly Separable SVM
Non-linearly Separable SVM case
Non-linearly Separable SVM case
43
The complexity of using decision trees is logarithmic to the number of points used for
training.
Implicit feature selection.
Random Forest – ensemble learning method for classification problems which fits number of decision trees, constructed by a random subset of the training data. After a large number of trees is generated, a majority vote is performed for class selection. From computational point of view, Random Forest are appealing because:
Does not get overfitted.
Handle multi-class classification.
Considerably fast to train and predict.
Measure variable importance.
44
Appendix B
45
46
Appendix C
After the participants assigned labels to each tweet from our pre-annotated dataset, the
Fleiss’ Kappa value showed how consistent they were using the following formula:
where:
- 𝑘 is the kappa coefficient, where k< 0 shows no agreement, 0 <=k <=19 is poor, 0.20
<=k<=39 is fair, 0.40 <=k <=0.59 is moderate, 0.60 <=k <=0.79 is substantial and
0.80 <=k <=1.00 is almost perfect agreement.
describes the degree of agreement attainable above chance.
describes the degree of the actual agreement achieved above chance.
The following example illustrates the practical use of Fleiss Kappa, when there are 4 categories, 10
tweets and 3 participants.
1 2 3 4 Pi
1 0 2 1 0 0.33
2 3 0 0 0 1.00
3 1 1 1 0 0
4 0 0 1 2 0.33
Total 4 3 3 2
pi 0.33 0.25 0.25 0.16
= 12
𝑘 =0.242
0.826
𝑘 = 0.29
47
Appendix D
Tag Description
, Punctuation
D Determiner
L Nominal + verb
N Noun
O Pronoun
P Pre or postposition
R Adverb
S Nominal + possessive
V Verb
48
Terms
Annotation – manual classification of sleep related /diagnostic tweets into positive, negative
or neutral class respectively.
False negative (FN) – the predicted class was negative, but should have been positive.
False positive (FP) – the predicted class was positive, but should have been negative.
Overfitting – occurs when a classification model captures noise and so does not perform
well on the evaluation dataset.
Retweets – reposting someone else’s Tweet (normally retweeted tweet is identified with
“RT”).
Self-reported sleep disturbance – any abnormal sleep experiences expressed by the user
who shares the message.
Stopwords – common words which do not bring meaningful value such as prepositions,
determiners, personal pronouns, etc.
Sentiment – identified expressed opinion in a given tweet.
Timeline – chronologically sorted stream of tweets posted by a single user.
Training data – manually annotated tweets by the domain experts, used for machine
learning classification purposes.
True negative (TN) – the predicted class was negative, which corresponds to the right label.
True positive (TP) – the predicted class was positive, which corresponds to the right label.