mining twitter data pertaining to psychosis for self...

Mining twitter data pertaining to

psychosis for self-reported sleep

disturbance investigation

Author: Mladen Dinev

Supervisor: Goran Nenadic

The University of Manchester

School of Computer Science

May 2016

Abstract

Twitter provides vast amount of wide-range data which gives a rich opportunity to mental

health clinicians and researchers to perform analysis with automated text mining approaches.

The ability to study behavioural trends at demographical and social level makes social media a

valuable cross-sectional source of information.

In this report, we discuss the relationship between psychotic(-like) experiences and sleep

related problems in the general population using the social network platform Twitter. The

research focused on extracting self-reported sleep disturbance from users who have been

recently diagnosed with a mental illness. This study identified different semantic features from

the gathered datasets and revealed their impact on the problem domain.

The application used machine learning classifiers in order to predict tweets with expressed self-

reported sleep disturbance, achieving 89% accuracy with a Support Vector Machine predictive

model. Various distinctive attributes were extracted from the corpus, however, the current

results indicate that a combination of semantic classes and individual word scores is the most

informative feature.

Keywords: text mining, machine learning, mental health, text normalisation

Acknowledgements

Firstly, I would like to express my thankfulness to my supervisor Goran Nenadic for his

continuous guidance, support and numerous advices throughout the whole period of

development.

Furthermore, I would like to thank Dr. Rohan Morris and Natalie Berry, psychology researchers

at University of Manchester, for sharing their domain knowledge and helping me to achieve the

aim of the project.

Finally, I would like to express my gratitude to Maksim Belousov, Computer Science PhD

student, for his assistance, especially when I was struggling, and for showing me the right

direction towards the successful completion of the application.

This research could not have been done without their contribution.

Abbreviations

AES – Advanced Encryption Standard

API – Application Programming Interface

HMM – Hidden Markov Model

NB – Naïve Bayes

NER – Name Entity Recognizer

PLE – Psychotic like Experience

POD – Part of the Day

POS – Part of Speech Tags

SVM – Support Vector Machine

TF-IDF – Term Frequency - Inverse Document Frequency

1

Table of Contents

Table of Contents ....................................................................................................................... 1

List of Tables ............................................................................................................................... 4

List of Figures.............................................................................................................................. 5

Chapter 1 Context .................................................................................................................. 6

Introduction ................................................................................................................................ 6

1.1 Motivation .................................................................................................................... 6

1.2 Aim and objectives ....................................................................................................... 6

1.3 Ethical Approval ........................................................................................................... 7

1.4 Challenges .................................................................................................................... 8

1.5 Report Structure ........................................................................................................... 8

Chapter 2 Background ........................................................................................................... 9

2.1 Text Mining .................................................................................................................. 9

2.2 Machine Learning ....................................................................................................... 11

2.2.1 Supervised Learning ............................................................................................ 11

2.3 Text Mining Social Media ........................................................................................... 14

2.4 Summary .................................................................................................................... 15

Chapter 3 Design and Development .................................................................................... 16

3.1 Design ......................................................................................................................... 16

3.1.1 Requirements ...................................................................................................... 16

3.1.2 System design...................................................................................................... 17

3.1.3 Development environment ................................................................................. 18

3.1.4 Graphical User Interface ..................................................................................... 19

3.2 Development .............................................................................................................. 20

3.2.1 Collection of potential diagnostic tweets ........................................................... 20

3.3 Text normalisation ..................................................................................................... 21

3.3.1 Tokenisation ........................................................................................................ 21

2

3.3.2 Twitter object removal ........................................................................................ 22

3.3.3 Stemming ............................................................................................................ 22

3.3.4 Abbreviations ...................................................................................................... 22

3.3.5 Part of Speech Tagging ........................................................................................ 23

3.3.6 Term Frequency - Inverse Document Frequency ............................................... 23

3.3.7 Negation detection ............................................................................................. 24

3.4 Pre-filtering of diagnostic tweets ............................................................................... 24

3.5 Manual Annotation .................................................................................................... 25

3.6 Classification of self-reported sleep disturbance ...................................................... 26

3.6.1 Feature Construction .......................................................................................... 26

3.6.2 Feature Selection ................................................................................................ 28

3.7 Summary .................................................................................................................... 28

Chapter 4 Testing and Evaluation ........................................................................................ 29

4.1 Software Testing ........................................................................................................ 29

4.1.1 Unit testing .......................................................................................................... 29

4.1.2 Usability testing ................................................................................................... 29

4.2 Classifiers’ performance evaluation ........................................................................... 30

4.2.1 Cross-fold validation ........................................................................................... 30

4.2.2 Confusion Matrix ................................................................................................. 30

4.2.3 Precision, Recall, F-score. .................................................................................... 31

4.3 Summary .................................................................................................................... 31

Chapter 5 Analysis and results ............................................................................................. 32

5.1 Data statistics ............................................................................................................. 32

5.2 Accuracy, precision, recall of classifiers ..................................................................... 32

5.3 Error Analysis ............................................................................................................. 34

5.4 Experimental findings ................................................................................................ 34

5.4.1 Sentiment distribution amongst the timeline tweets ........................................ 35

5.4.2 Semantic Class Influence ..................................................................................... 35

5.4.3 Posting time ........................................................................................................ 36

5.5 Summary .................................................................................................................... 36

3

Chapter 6 Conclusion ........................................................................................................... 37

6.1 Reflection and Achievements .................................................................................... 37

6.2 Future work ................................................................................................................ 37

References ................................................................................................................................ 39

Appendix A ............................................................................................................................... 42

Appendix B ............................................................................................................................... 44

Appendix C ............................................................................................................................... 46

Appendix D ............................................................................................................................... 47

Terms ........................................................................................................................................ 48

4

List of Tables

Table 1.1 Objectives ................................................................................................................... 7

Table 3.1: Requirements ........................................................................................................... 17

Table 3.2: Constructed Features............................................................................................... 27

Table 3.3: Semantic Classes...................................................................................................... 27

Table 4.1: Test Cases ................................................................................................................ 29

Table 5.1: Data collection statistics.......................................................................................... 32

Table 5.2: Classifiers performance evaluation ......................................................................... 33

Table 5.3: Correctly classified sleep related instances ............................................................. 34

Table 5.4: Misclassified examples ............................................................................................ 34

Table 5.5: Semantic class frequency ........................................................................................ 36

5

List of Figures

Figure 1.1: Encryption pipeline ................................................................................................... 7

Figure 2.1: Text mining pipeline adapted from [20] ................................................................ 10

Figure 2.2: Supervised learning flow diagram ......................................................................... 12

Figure 2.3: Illustration of SVM components in linearly separable case ................................... 13

Figure 3.1: System Architecture ............................................................................................... 18

Figure 3.2: Predicted labels of timeline tweets (GUI) ............................................................... 20

Figure 3.3: Dependency graph of a tweet containing a negated word ................................... 24

Figure 3.4 Annotation Window ................................................................................................ 26

Figure 4.1: K fold cross-validation ............................................................................................ 30

Figure 4.2: Confusion Matrix .................................................................................................... 31

Figure 5.1: Sentiment Distribution ........................................................................................... 35

Figure 5.2: Time of posting trend ............................................................................................. 36

6

Chapter 1 Context

Introduction

This chapter includes introduction to the problem domain, aim and objectives set for this

project and on overall structure of the report. In addition, it underlines the most common

challenges in the area.

1.1 Motivation

Recently, the amount of information on the Web has been constantly growing, giving

researchers the opportunity to analyse big amount of mental health data. Internet live

statistics show that every second on average around 6,000 tweets are generated which

makes 500,000,000 heterogeneous messages per day [2]. Tweets are particularly valuable

source of knowledge, because they contain diversified information at international level,

happening at the speed of thought. In addition, Twitter is particularly useful because it has

well-documented APIs and most importantly, provides its data freely. However, manually

establishing new patterns and trends using typical data analysis methods is an impossible

task for a human being, therefore automatic text mining tools are often employed.

Social media has no geographical boundaries allowing large quantities of data to be analysed

automatically. Text mining systems provide researchers the ability to mine efficiently human

behavioural trends from wide range of angles. Previous studies analysing mental health

related Twitter data concluded that people who experience psychological disorder

symptoms have high rates of social media usage [3]. Another reserach utilized that

quantifiable mental health signals related to bipolar, major depressive and post-traumatic

disorder signals were encoded in Twitter messages[4].

1.2 Aim and objectives

The first task of this project was to build an algorithm which was going to detect Twitter

users, diagnosed with psychosis. The next task, and main aim of this project, was to

implement a system which was going to identify self-reported sleep disturbance in timeline

tweets, published by pre-filtered diagnosed users.

This study also gathered different evidences in order to identify behavioural patterns

occurring before and after the diagnosis, taking into account the frequency of posted

7

tweets, the expressed emotions and the time of posting.

To achieve the set out aim, the project was split into smaller objectives shown in Table 1.

Objectives Collect tweets automatically from Twitter using predefined search queries.

Filter out invalid tweets and spam.

Manually annotate a subset of diagnostic and sleep-related tweets through an annotation tool.

Find different informative features with respect to the problem domain.

Design and develop classifiers to automatically identify sleep related issues.

Analyse data analyse potential trends.

Plot and visualise the results on graphical user interface.

Table 1.1 Objectives

1.3 Ethical Approval

In order to adhere to the formal ethical regulations in social media, an ethical application

form was submitted to the Computer Science Department on 10th of November and

approved on 23rd (Approval Number: CS 218). Sensitive part in this research was the

identification and prevention of ethical issues, therefore the project was developed in

accordance with the guidelines formed by the Association of Internet Researchers and The

British Psychology Society [5] [6].

Moreover, to keep the identity of our users unrevealed, methods of anonymization have

been applied. Firstly, all sensitive Twitter data fields were encrypted with AES-128 ciphers

and stored on external password protected university data storage [7]. Secondly, individual

profiling has been avoided throughout the whole development and the analysis was

performed only on aggregated dataset. Thirdly, occurrences of any user mentions were

automatically removed before the data transfer, so anonymity has been preserved. Access

to the database was granted only to the people involved in the study and each contributor

had his own login credential, assuring and maintaining data protection.

Figure 1.1: Encryption pipeline

8

1.4 Challenges

Despite the fact that there are many sophisticated text mining tools, analysing social media

still presents challenges due to the distinct and persistent characteristics of its data.

Firstly, Twitter restricts the length of the messages to 140 characters, which may not

provide sufficient context information for effective text processing methods. Another

limitation is the lack of external knowledge, which is often used to bridge the gaps in text

representation models [8].

Secondly, social media data contains non-standard language use such as abbreviations,

acronyms, slang and interjections. This is also a very challenging problem, which involves

Natural Language Processing methods eliminating the structural and lexical ambiguities and

identifying the semantic meaning of the text.

Nevertheless, one of the main challenges in data mining social media is obtaining relevant

examples, mainly due to the fact that networking platforms offer a small chunk of the whole

available set [9]. In addition, without knowing the population distribution in the gathered

data, it is very hard to estimate whether the selected information is truly representative of

the larger group.

1.5 Report Structure

The organisation of the report is structured as follows:

Chapter 2 Background - introduction to the problem domain and brief explanation of the related concepts, as well as presenting previous studies in the field.

Chapter 3 Design and Development – provides information about the functional requirements of the software and design rationale.

Chapter 4 Testing and Evaluation – discusses the testing methods used to validate and improve the results of the system.

Chapter 5 Analysis and results – presents observations and results after the performed analysis on the gathered dataset.

Chapter 6 Conclusion – summarizes the major inferences and findings drawn at the end of the project and presents potential future work.

9

Chapter 2 Background

Chapter two describes the fundamental concepts of text mining, machine learning and

natural language processing. Further details can be found in Appendix A and Appendix B.

2.1 Text Mining

Text mining, also known as knowledge discovery, refers the process of deriving unknown

information from unstructured text. The process starts by extracting facts and events from

textual resources and then enables forming new hypothesis based on data mining and data

analysis methods. Text mining includes linguistic, statistical and machine learning techniques

that model and structure the information content for business intelligence, research or

investigation. However, data mining and text mining do not refer to the same thing. Data

mining extracts implicit information from a given input, while in text mining the content is

explicitly stated in the text, but needs automatic techniques to deduce the interpretation of

the content [10].

Text mining was introduced in 1980’s and since then has been applied to a wide range of

fields such as security, marketing and biomedical science [11] [12] [13]. Text mining is

associated with the concepts of natural language processing, information extraction,

information retrieval and sentiment analysis, which are described briefly in the next few

paragraphs. The whole text mining pipeline is presented on Figure 2.1. As can be seen, text

mining focuses on four main tasks – retrieving relevant documents from a large volume of

data, annotating and converting the material into a same format, extracting lexical and

semantic information and finally discovering new knowledge.

Natural Language Processing – process which extracts meaning from a natural language

text. The method seeks to find the answers to the questions – Who, When, Where, How and

Why. To gain insight into the semantic structure of a text, natural language processing

typically involves part-of-speech tagging, lexical dependency parsing and word sense

disambiguation.

Information Retrieval – an extension of document retrieval, describing the process of

returning documents relevant to user preferences.

Information Extraction – important part of the text mining system is the information

extraction process which aims to find events, facts, entities and relationships from

unrestricted natural language text. Unlike information retrieval, information extraction is

not undertaken by people and it does not require deep domain-specific expertise. For

instance, name entity recognition (NER) detects different predetermined classes

10

(organisations, people, locations, time expressions) from unstructured text. The term name

entity does not have a specific meaning, so it has to be defined in the context in which it

appears. If we consider the following example:

“I used to work for Microsoft long time ago.”

“Microsoft” should be flagged as an organisation and the rest of the sentence as temporal

expression. There are different NER tools which provide deep semantic understanding of the

natural text using dictionary, machine learning or Hidden Markov Model (HMM) approaches

[14] [15] [16].

Sentiment Analysis – sentiment analysis, also known as opinion mining, relates to the

automated process of identifying an expressed opinion in a text in a binary or trinary format.

Automatic recognition of people’s opinions and sentiment about a wide range of topics has

been applied in many fields including business, psychology and computer science [17] [18]

[19]. However, sentiment analysis over Twitter data faces many challenges due to the highly

diversified nature of social media content. For that purpose, automatic sentiment analysis

tool, developed on informal language texts, has been used in this research [17].

Sentistrength coped with spelling mistakes, negated words and untypical phrases, which

boosted or decreased the polarity of subsequent words.

Figure 2.1: Text mining pipeline adapted from [20]

11

2.2 Machine Learning

Machine learning is the science that provides computers the ability to learn without being

explicitly programmed. Alpaydin et al. (2014) characterises data mining as subdivision of

machine learning, referring it to the process of extracting knowledge from a large volume of

data, typically a database [20].

Machine learning has been proved as an efficient approach in a variety of fields such as

fraud detection, spam filtering and pattern recognition. It has been especially useful when it

is applied to data mining problems, because its learning does not require deep

understanding of the problem domain. In order to acquire knowledge, the machine learning

algorithms automatically inspect the attributes and their implicit relationships from the

given dataset (training data). After extracting information from the training examples, the

learner’s task is to generate hypothesis about each target class. For that purpose, it applies

the theory of statistics to build mathematical models with adjustable parameters,

dependent on the problem in question. However, there are two types of models – predictive

and descriptive. The predictive models make predictions about future trends, while the

descriptive models gain knowledge from a given dataset. Once the model has been set, real-

life input (testing data) is supplied and classified respectively.

On the other hand, the machine learning tasks are typically classified into two types:

supervised learning and unsupervised learning. The next paragraph outlines the difference

between the two techniques and provides details about two classifiers, used in this project.

2.2.1 Supervised Learning

In supervised learning, each example of the training set is composed of a pair of an input

object, represented by feature vector, and a class label. Unlike supervised approaches,

unsupervised learning does not require the labels of the input instances, but instead clusters

the data. However, acquiring labelled dataset is not always a trivial task and in fact requires

good knowledge of the problem domain. Therefore, semi-supervised algorithms which are a

combination of supervised and unsupervised approaches, are often preferred. For this

particular study, supervised learning was chosen, because of the presence of domain

experts.

The aim of the supervised learning is to build a model which makes predictions based on

relations determined in the training set. Supervised algorithms should be able to efficiently

generalise, optimise and approximate from the given examples to generate accurate

decision rules. All supervised learning methods determine the structure and heterogeneity

(discrete or continuous) of the data and afterwards fit a learning model which classifies new

unknown instances. Figure 2.2 illustrates the different stages of the supervised learning

12

model.

There is a wide range of supervised learning algorithms each with its strengths and

weaknesses, however, their performance highly depend on the extracted relations and on

the complexity of the problem domain.

Support Vector Machine

In Machine Learning, Support Sector Machine (SVM) is used a prediction model based on

the concept of decision planes. A decision plane separates the classes in the classification

problem by finding the maximum margin that divides the group of points, so that a distance

between a decision plane and the nearest point from each class is maximized. As can be

expected, the process needs to find the optimal decision plane which will maximize the

margin of the training data. Points closest to the separating hyperplane are called support

vectors, hence the name Support Vector Machine. A mathematical formulation of the SVM

model is presented below.

For linearly separable case, if {𝑥1, . . . , 𝑥n} are all data entry points and 𝑦𝑖 ∈ {1, −1} are

their input classes then the two respective hyperplanes would be the form of:

𝑥i · 𝒘 + 𝑏 = +1 for H1

𝑥i · 𝒘 + 𝑏 = −1 for H2

where w is the normal to each hyperplane and 𝑏 the intercept term.

Figure 2.2: Supervised learning flow diagram adopted from [1]

13

The learning phase refers to process of calculating the margin 1

||𝑤|| and the Quadratic

Programming optimisation:

𝑚𝑖𝑛 1

2 ||𝑤||2 such that yi(xi · 𝑤 + 𝑏) − 1 ≥ 0 ∀i

𝑏 = 1

Ns ∑(𝑦s − ∑ 𝑎mymxm. xs )

𝑚∈𝑆𝑠∈𝑆

Having the variables 𝑤 and 𝑏 we can define our separating hyperplane’s optimal orientation

and thus the solution of the SVM model. Then each new point x’ would be evaluated based

on the result of the following equation:

𝑦′ = 𝑠𝑔𝑛(𝑤 · 𝑥′ + 𝑏)

Figure 2.3: Illustration of SVM components in linearly separable case

SVM has been a widely used classifier in data mining and machine learning because of its

ability to overcome one of the most common machine learning issues– overfitting. Use of

SVM has been prominent, both dealing with linearly separable and non-linearly separable

data (kernel trick) performing at the state of art level. Figures showing the two different

cases are presented in Appendix A. Another strength of SVM is that it provides fast and

effective means of learning capable of processing multidimensional input, even in cases

reaching more than 10,000 features [21]. Text categorisation using SVM has been applied to

numerous fields, including mental health [22] [23].

14

Naïve Bayes

Naïve Bayes (NB) is a widely used classifier in text classification and anti-spam studies. NB is a supervised probabilistic learning method which assigns labels to data points, represented as feature vectors. The NB classifier is based on the Bayes’ Theorem, describing the probability of an event, dependent on its conditions. The major property of the NB theorem is the interdependence attribute assumption which says that values are independent of each given class and their probabilities are calculated independently from the training dataset. The mathematical representation is illustrated below.

𝑃(𝑐) − the prior probability of a randomly picked document with class c. 𝑃(𝑥|𝑐) − the conditional probability of predictor occurring in class c. 𝑃(𝑥) − the prior probability of predictor. 𝑃(𝑐|𝑥) − the posterior probability of predictor belonging to class c. Some of the NB strengths include handling missing data, enabling categorical attributes and quick re-calculation. This project explored the performance of the other classifiers, which are explained in details

in Appendix A.

2.3 Text Mining Social Media

Previously conducted studies explored the subject of text mining mental health data using various textual resources via social media. For instance, Shepherd et al. (2015) described in his study the role of Twitter as means of space in which people could openly discuss mental health problems [24]. They searched for tweets in particular conversation containing the hashtag

15

#dearmentalhealthprofessionalsand and performed their analysis on unique anonymised dataset. Their results show that 515 communications have been found on Twitter related to mental health problems.

Reavley and Pilkington et al. (2014) reported the usefulness of Twitter content to measure attitudes towards mental illness [25]. In their study, they collected tweets only matching the hashtags #schizophrenia and #depression during 7-day period. After applying filtering, 5907 depression tweets and 451 schizophrenia tweets were stored and used for analysis. From the set of the potentially schizophrenia related tweets, 76 provided evidences for people diagnosed with schizophrenia and 47 tweets expressed personal experience. They concluded that schizophrenia tweets were 43% (193) neutral, 42% (191) supportive, almost 10% (44) explicitly anti-stigma, 5% expressing stigmatising attitude and less than 1% reflecting stigmatising attitude of personal experience.

In another study, McManus et al. (2015) collected Twitter posts to detect individuals with schizophrenia [26]. Using different features such as emoticons, posting time of the day and dictionary terms, they trained and validated a SVM model and managed to achieve 92% precision and 71% recall. Their study identified 96 people with schizophrenia and proved that Twitter can also be a space where people, with mental health problems, could release their emotions. Moreover, they concluded that the peak posting time of tweets from users with psychological disorder experiences lied between the early morning hours.

Belousov et al. (2015) gathered Twitter data to analyse the mental illness schizophrenia and

its symptoms [27]. Their pipeline process consisted of four main tasks: filtering, pre-

processing, feature extraction and classification. During the first two processes all

ambiguities and noise were removed from the collected tweets to assure consistency and

quality. They applied machine learning techniques to predict tweets related to the problem

domain. According to his paper, the Naïve Bayes classification model outperformed the

other text classification classifiers and managed to attain 92% accuracy. They analysis

showed that 485 instances were classified positively, most of which were posted between

the hours 10PM and 2AM.

2.4 Summary

In this chapter, theoretical explanation of text mining and machine learning has been presented, including previous studies evaluating the usefulness of social media content in mental health data analysis.

16

Chapter 3 Design and Development

This section will cover the most challenging aspects involved in the development cycle of the

software. In this chapter diagram of the designed architecture is provided to facilitate the

understanding of the project architecture.

3.1 Design

3.1.1 Requirements

In order to understand the project requirements it was important to define the project

specifics and hence clarify what was going to happen with data, how was that going to be

achieved and what were the most appropriate technologies for that purpose. The very first

step was to collect as much information as it was possible from my supervisor and the

domain experts Rohan Morris and Natalie Berry because of their vast amount of knowledge

in this area, which later prevented making unnecessary assumptions and mistakes. I was

aware of the fact that requirements will change over time, so I complied with one of the

Agile Practices [28] and started with a simple understanding of the problem. As the project

progressed, the requirements evolved and the system became more sophisticated. Spending

time working through the architecture and designing paper prototypes helped to capture

the crucial functionality of the system in advance. Table 3.1 presents the final version of the

functional and non-functional requirements with their respective priority.

Requirements Priority

Define search queries with respect to the problem domain.

High

Create automated scripts for collecting data from Twitter. High

Store results in a local non-relational database.

High

17

Explore data to gain insight into the domain and refine the search queries

Medium

Implement methods to filter out spam and irrelevant tweets.

Medium

Predetermine the subset of data passed for annotation.

High

Transfer processed tweets to an encrypted data storage. High

Design a handy annotation tool for the psychology researchers. High

Develop an effective security strategy.

Medium

Extract valuable and informative features from the selected tweets.

High

Implement different classifiers and evaluate their performance.

High

Craft a graphical user interphase, allowing different modifications to the dataset.

High

Analyse and plot the results.

High

Table 3.1: Requirements

3.1.2 System design

The architecture of the system was broken down into smaller components and each

component was made in isolation from the others in order to prevent conflicting points. The

main property of the system is that it allows quick and smooth integration of other newly

created modules. In addition, the system takes into account the constant modifications to

the database and allows different experiments to be carried out without affecting the

original state of the data.

As previously mentioned, the project aimed to implement a text mining tool for data

analysis, therefore the main components of the system can be classified into five abstract

levels - data collection, pre-processing and knowledge extraction, prediction, analysis and

utilization. The aforementioned steps are depicted in Figure 3.1.

Data collection – retrieving Twitter messages related to psychosis and sleep

disturbance.

Annotation Tool

Project set-up

Text classification and visualisation

18

Text normalisation – preprocessing the natural language text in order to unifies

different variations of the same content by performing various text transformation

methods.

Diagnostic filtering – dictionary-based approach filtering out irrelevant tweets.

Timeline extraction – tweets pulled out from diagnosed users’ timelines.

Classification – automatic detection of sleep-reported disturbance.

Analysis – exploring trends within the timeline dataset.

Figure 3.1: System Architecture

3.1.3 Development environment

Python

Python has been chosen as the main supportive language for the implementation of this

project, because of the rich set of available python-based text processing and manipulation

libraries such as NLTK, matplotlib, scipy, etc. Some of these libraries played an essential role

in the development of the software. Secondly, Python is considerably simple and easy to

understand language and in addition there are numerous books and tutorials on the Web

helping beginners to advance their knowledge quickly.

19

MongoDB

For the purpose of storing the collected tweets, the non-relational database MongoDB has

been selected [29]. The project dataset was expanding constantly and a SQL database could

have created performance degradation. On the other hand, non-relational databases do not

have explicit and structured mechanisms which allows incorporation of all types of data,

while at the same time retain the ability to link efficiently entries from different buckets.

Furthermore, the response from Twitter is in a JSON format which a Mongo database

instance supports, therefore MonogDB was preferable to SQL.

Intellij

The whole project was developed with the integrated development environment Intellij

[30]. The platform provided useful features such as code refactoring, code navigation and

code suggestion which made the implementation process easier.

Classification toolkit

All machine learning classification tasks were developed using the useful scikit-learn library.

It provided free access to loads of different learning algorithms as well as feature

transformation and feature selection modules.

3.1.4 Graphical User Interface

To give users the opportunity to analyse more deeply the collected information, graphical

user interface was implemented enabling different human-computer interactions. The

designed system allows users to inspect the assigned labels as well as presents the most

informative features, extracted during the learning process. In addition, users have the

ability to retrieve dynamically timeline tweets from the project database and are also

allowed to select one or many anonymised accounts for further analysis. A screenshot of the

“Prediction” frame is shown in Figure 3.3, presented in a table-like structure a bunch of

timeline tweets and their respective class.

20

Figure 3.2: Predicted labels of timeline tweets (GUI)

3.2 Development

3.2.1 Collection of potential diagnostic tweets

Collecting data was the very first step of the development and it was important to be

implemented in a way, allowing persistent experiments to be carried out. In this study, both

of the available Twitter APIs have been used [31] [32]. The Search API returned tweets

which were based on relevance and popularity, while the Streaming API exposed a live

stream of tweets. We included repetitive data exploration step to the project pipeline to

examine different attributes in the gathered dataset, and in this way improved the quality of

the retrieved results. Term document-inverse document frequency was used as weighting

method in the data exploration process to reveal uniqueness and word relevance [33].

Detailed explanation and mathematical representation of the algorithm is presented later in

this chapter. In order to further optimise the process of collecting tweets and avoid

occurrences of duplicates, two Twitter parameters were introduced - since_id and max_id.

The first parameter returned tweets published more recent than a tweet with specified id

(since id), while the second one returned results older than a tweet with a specified id

(max_id). The values were automatically updated after each run of the program and were

applied in the next search of tweets.

In addition, to analyse behaviour patterns, the software extracted all timelines tweets from

each user who has been recently diagnosed with psychosis. All gathered tweets including

their embedded attributes (tweet ID, user ID, timestamp, UTC offset, geo coordinates, time

21

zone) were collected through the Search API and stored separately on a database collection.

The aim of this process was to provide quantifiable dataset on in order to perform our

analysis. However, after a quick discussion with the psychology researchers and my project

supervisor, we decided to set time restrictions for the extraction process. We focused only

on tweets which were published 6 weeks before the diagnosis or infinitely many days after.

Since the finding of diagnostic tweets was one of the crucial elements in our system, and

which was going to have a significant impact on the results of the other processes, we

decided to perform the annotation manually. Undoubtedly, manual annotation was time-

consuming but in our case we endeavoured to achieve 100% accuracy of our results, which

was not going to be easily accomplished even with an advanced machine learning classifier.

Nevertheless, all collected tweets were pre-filtered and validated against different semantic

rules before being sent out to the researchers. More information about the pre-filtering

process is given in section 3.6.

The data collection started on 15th of October 2015 and finished a week before the

submission of the practical work - 14th of March 2016. Table 5.1 in Chapter 6 shows some

statistics retrieved from the project database at the last stage of the development.

3.3 Text normalisation

The collected Twitter data went through few text pre-processing methods to ensure high

quality and consistency. This translated non-standard words into their canonical forms,

while at the same time preserving their contextual meaning. Many text normalisation tasks

include stopwords removal, but in this particular study it was omitted as we revealed the

importance of personal pronouns in the process of distinguishing objective and subjective

tweets. Similar text normalisation tasks were found in different text mining social media

studies [34] [35] [36].

3.3.1 Tokenisation

This process aimed to convert raw tweet components into linguistic units such as words,

symbols and phrases. Although this task is generally straight-forward, it becomes

problematic and challenging when it has to be applied to microblogging messages. The main

issues arise from the fact that tweets contains non-standard language such as abbreviations,

acronyms, slang, emoji, hashtags etc. For that purpose, tokenizer trained on Twitter data

has been used to provide better performance and precision [37].

Example:

22

I feel horrible because today I was diagnosed with schizophrenia :(

['I', 'feel', 'horrible', 'because', 'today', 'I', 'was', 'diagnosed', 'with', 'schizophrenia', ':(']

3.3.2 Twitter object removal

After the tokenisation process, identified units as hashtags, links and special characters,

including emoji, were trimmed off as they did not bring any meaningful value to the

problem domain.

Example:

@Chris159 my aunt has schizophrenia but that’s because she used to take a bad batch of

drugs in the 90’s 😭.

[my aunt has schizophrenia but that’s because she used to take a bad batch of drugs in the

90’s.]

3.3.3 Stemming

The goal of stemming was to reduce the inflectional forms of words to common root form in

order to achieve uniform representation. For example, stemming the words “sleep”,

“sleeping” and “sleps” will produce “sleep“, which otherwise will be identified as three

different heterogeneous words. The removal of morphological affixes was done by the

Porter Stemmer provided by the NLTK package [38] .

Example:

The only medication that has succeeded in controlling my bipolar disorder was Olanzapine.

['The', 'onli', 'medic', 'that', 'ha', 'succeed', 'in', 'control', 'my', 'bipolar', 'disord', 'wa',

'Olanzapin', '.' ]

3.3.4 Abbreviations

Another very common issue with Twitter data was the usage of abbreviations. In the

following example, “asap” is the short form of “as soon as possible”, which could be easily

understood by a human being, but in a simple bag-of-words model, these forms would be

denoted as different, although the semantic meaning remains the same. For that purpose,

we built our own dictionary of shortened words with their associated expanded form.

23

Example:

I need to visit a doctor asap… I hear voices when I'm alone at home.

[I need to visit a doctor as soon as possible… I hear voices when i'm alone at home.]

3.3.5 Part of Speech Tagging

In corpus linguistics, part-of-speech tagging (POS) is an essential task corresponding to the

process of classifying words with their respective part of speech tag. POS tags were useful

source of information, because they facilitated the word sense disambiguation and

managed to boost the prediction accuracy of our classification models. Tweet NLP Part-of-

Speech tagger has been selected as a POS tagger for this project. [37]. An example is given

to illustrate the usage of the POS tagger. Explanation of each POS tag is given in Appendix C.

Example:

I've come to terms with my schizophrenia, my friends are not isolating me anymore

[["I've", 'L'], ['come', 'V'], ['to', 'P'], ['terms', 'N'], ['with', 'P'], ['my', 'D'], ['schizophrenia', 'N']

[(',', ',',], ['my', 'D'], ['friends', 'N'], ['are', 'V'], ['not', 'R'], ['isolating', 'V'], ['me', 'O'],

[('anymore', 'R']]

3.3.6 Term Frequency - Inverse Document Frequency

Key role in the filtering and search refining process played the information retrieval method

Term Frequency – Inverse Document Frequency (TF-IDF) [39] . The algorithm assigned

weight to each word of the corpus by computing its term frequency and its inverse

document frequency. The term frequency measured how frequently a word 𝑥 appeared in a

tweet, while the inverse document frequency measured how much information it brought.

The term frequency decreased the scores of words which appeared frequently in a text and

increased the scores of words that occur rarely. On the other hand, pure term frequency

assumes that all terms are equally important. To scale down the words score, inverse

document frequency was computed. It defined the number of documents in the database

which contained the term 𝑥 and thus reducing the effect of frequent trifling terms. If we

take as an example the preposition “a”, which itself does not hold any useful meaning, will

be assigned a very low score, whereas as large discriminatory words like “sleep” or “awake”

will be given high score. For that reason, we excluded all common words from our corpus

before we proceeded to the TF-IDF calculation.

24

Some of the revealed keywords (excluding the words from the search queries) include:

“Depression”, “got”, “ill”, ”make”, ”mind”, ”substance”, “care”, ”mental”, “normal”,

“develop”

𝑤i,j = 𝑡𝑓i,j ∗ log (𝑁

𝑑𝑓i)

𝑤i,j - score of term 𝑖 within t 𝑗

𝑡𝑓i,j - number of occurrences of term 𝑖 in tweet 𝑗

N - total number of tweets

𝑑𝑓i - number of documents containing 𝑖

3.3.7 Negation detection

Although detecting negations presented extremely challenging problem and was outside the

scope of this project, simplified methods were implemented to capture the focus of negated

word(s). For that purpose, dependency parsing [40] was conducted to output the

grammatical relations between the words and then an algorithm inspected each vertex of

the graph for occurrence of negation modifier.

Figure 3.3: Dependency graph of a tweet containing a negated word

3.4 Pre-filtering of diagnostic tweets

The filtering process aimed to determine a qualitative diagnostic dataset, passed to the

psychology researchers for annotation. Although the Twitter APIs provide the user an

opportunity to define his own queries with simple or complicated constraints, it does not

denote the contextual meaning of the words and thus introduce noise. To improve the

accuracy of the results, a semantic search was performed [41]. The filtering layer included a

set of semantic rules based on the definition of the searched terms, which pinpointed valid

examples and minimised the number of irrelevant tweets in our data storage. For example,

after the initial data exploration we noticed that the usage of personal pronouns and drugs,

prescribed for the treatment of psychological disorders, lead to relevant examples. We

applied that knowledge in our algorithm and the number of relevant examples instantly

25

increased. In addition, the lexical diversity analysis and the TF-IDF word scores determined

some valuable verbs such as “get”, “diagnose”, “suffer”, “have” ,“ill” which were also used

as an indicator of potential diagnostic messages.

3.5 Manual Annotation

As already stated, an important part of the classification process is the training set which is

used to discover potential predictive relationships. The training data in supervised learning

is typically in the format of a tuple(𝑥, 𝑦), where 𝑥 is the representation of the documents in

vector space, known as vector space model, and 𝑦 their respective class. Unfortunately, the

specificity of the domain problem led to the inability of using publicly available annotated

data. The lack of information can be explained also by the fact that manually tagging text is

an expensive process which requires domain expertise and a lot of time. Fortunately, Dr

Rohan Morris and Natalie Berry, psychology researchers from the University of Manchester,

volunteered to accomplish this task. Although there were several annotation programs

already available on Web, separate software was designed and developed that facilitated

the annotation process by including only the requisite functionality. To incorporate secure

communication, high-level authentication layer was introduced. Unique login credentials

were provided to the researchers in order ensure data protection and to display appropriate

content.

Users had to classify each tweet as either negative, neutral or positive, providing their level

of confidence from 1 to 10. After signing in, each researcher was asked to annotate a set of

tweets - 75% of which were pre-filtered and the other 25% were selected randomly. The

dataset included random tweets to eliminate the selection bias and to increase the chances

of finding new unknown patterns, missed in the first filtered set.

In statistical study such as this one, it was important to calculate the reliability agreement

between the annotators, therefore a statistical measure was evaluated. 100 examples were

shared between the two annotators, so that we can assess their accordance. Fleiss Kappa

was chosen as such measure, because of the ability to rate the agreement between

unlimited number of participants, in comparison with other kappas which work only for

particular number of people [42]. The value was regularly updated and synchronized with

the researchers to guarantee the high quality of the annotated dataset. Example showing

how Fleiss’ Kappa is computed can be seen in Appendix C.

User experience techniques were also integrated into the software development process to

enhance the usability and accessibility of the program.

26

3.6 Classification of self-reported sleep disturbance

3.6.1 Feature Construction

Many machine learning algorithms require the input text to be translated into specific

format. The most common model is the Vector Space Model which ignores the linguistic

structure of the text and creates multidimensional vectors corresponding to each separate

word from the text. The performance evaluation of each model is presented in chapter

“Analysis and Results”. In regards to machine learning and data mining, feature

construction, also known as feature engineering, addresses the problem of creating a set of

features using domain knowledge. During the feature construction phase, it is important not

to lose any useful information, therefore a common practice is to compare the performance

of the constructed feature with its raw form. The overall process can be described as follows

[43]:

1. Generate an initial feature space Fo

2. Transform the initial feature space into a new feature space Fn

3. Select a subset of features Ft from Fn satisfying predefined criteria.

4. Use Ft

In the next few sections, the top three most informative features are described, however, the whole set of generated features throughout the work of this project is shown in Table 3.2. Feature importance charts are presented in Appendix B.

Initial Constructed Features

Word Frequencies

Word scores

Time of posting as minutes after midnight

Figure 3.4 Annotation Window

27

Mentions of drugs used for Insomnia

Semantic classes

Sentiment polarity scores

POS tags

Table 3.2: Constructed Features

Semantic classes

Semantic classes were defined as words which shared a semantic property. The psychology

researchers manually labelled such terms in the annotation process and expanded the table

of semantic classes from Belousov’s research (2015). Table 3.3 depicts all semantic classes,

the number of the entities they contain and some examples. All matched entities were

collected via dictionary look-up method, taking into account possible spelling variations.

Semantic Class Count Examples

Electronics 50 voice message, record, TV, radio, speaker

Fear Expression 13 scary, afraid, nervous, shocked, creepy

Health Problems 10 insane, mad, crazy, batty, irrational

Location and Content of Hallucination

9 whispering, in my mind, uncanny, seeing things

Physical Space Location 18 house, office, room, apartment

Psychosis Problems 6 bipolar disorder, psychotic, delirium

Relationships 129 child, roommate, brother, fiancé, sister

Religious Terms 19 christian, bible, church, jesus, allah

Sleep Terms 5 fever, cant sleep, lack of sleep, sleep paralysis

Swear Words 503 * * *

Table 3.3: Semantic Classes

Posting time of Tweets

During the process of collecting data, all geographical metadata embedded in the messages

was gathered and stored safely on a university server. The information was then processed

to find the exact upload local time of each tweet. In an ideal case, the algorithm took the

geographical attributes (longitude, latitude) to identify the time zone and then converted

the initial timestamp to the user local time, taking into account daylight saving time (DST). In

case when there was a lack of geographical information, the algorithm approximated the

local time by subtracting/adding the UTC offset from the respective Twitter account.

28

Sentiment Polarity Score

For this feature, a separate function was implemented which assigned a sentiment polarity

score to each individual tweet. SentiStrength was not used for this particular case because

our method applied semantic rules related to the project specifics. For instance, the

function incorporated some important words, found in the data exploration process, and in

the presence of such words the scores of subsequent words were either boosted or

reduced. The sentiment score of the each term was obtained by to the sum of all sentiment

scores of its synonyms, derived from the lexical database WordNet [44]. However, there

were words which had more than one meaning, so the algorithm used additionally the part-

of-speech tags in order to approximate the word sense. For normalisation purposes, the

final score was divided by the number of synonyms in the respective word network.

3.6.2 Feature Selection

Feature selection, also known as variable selection, is a critical part of the feature

engineering process seeking to select only the subset of features related to the problem

domain. Feature selection requires deep understanding of all the important aspects of the

dataset and could be very challenging task. For that purpose, we took full advantage of the

rich set of feature selection tools, which the data mining library sklearn provided [45], and

employed low-variance selection algorithm. We set our own variance threshold and

removed all features whose variance was less than the specified value.This method was

especially useful when the multidimensional TF-IDF and bag-of-words models were

developed, eliminating all redundant features (words) and thus preventing overfitting.

3.7 Summary

In this chapter we discussed the implementation methodology and we gave an overview of

the functionality of the developed software.

29

Chapter 4 Testing and Evaluation

This chapter will touch upon the testing and evaluation techniques applied to validate the

results of the study.

4.1 Software Testing

4.1.1 Unit testing

Each major component of the software was verified and tested via Python Unit Tests [46].

Table 4.1 presents the test cases which have been inspected during the development

process. Each test case introduced a separate function, contributing to full test coverage and

validity. The test script was regularly run and updated to guarantee the normal behaviour of

the software. PyUnit was chosen as a testing framework because it provided a rich set of

tools for constructing and maintaining the test suite.

Test Case Assertion Result

Detect only English tweets True

Check if tweet contains links True

Check if tweet is a retweet True

Remove user mentions and spam True

Remove special characters (emoji) True

Remove punctuation from tweets True

Expand abbreviations from a collected dictionary

True

Table 4.1: Test Cases

4.1.2 Usability testing

The annotation tool was repeatedly modified and improved by means of usability testing.

The domain experts provided verbal feedback after each alteration to the software, which

was quite helpful as we had the opportunity to reflect on their comments in the next

iteration. This gave direct input on how the real users see the system and helped to identify

performance issues as well as indicating participant’s overall satisfaction with the product.

30

4.2 Classifiers’ performance evaluation

4.2.1 Cross-fold validation

The cross-fold validation method in this research operated as access to the statistical

relevance to the classifiers and outputted how well its predictive model generalized on

independent dataset. The explanation of this method is depicted in Figure 4.1, where a

given dataset is randomly portioned into k folds, of which a single one is used for testing and

the rest k-1 for training. The process is repeated k times enabling all data points from the

selected dataset to be used for training and testing.

Issues of imbalance data have been experienced throughout, therefore a variation of the k-

fold validation, known as stratified, has been employed. Using the specified modification,

the equal distribution of classes in each individual fold was ensured [47]. The performance

of the classifier was measured against an independent dataset, because relying on patterns

visible only in the training data was going to cause overfitting [48]. For that purpose, I asked

the researchers to annotate an additional set of tweets, which was used later to confirm the

effectiveness of the classifier.

In this empirical study, the value of K was determined to be 10.

4.2.2 Confusion Matrix

In machine learning, confusion matrices are often used to evaluate the accuracy of a

classifier based on the actual and predicted labels. Each column of the confusion matrix

represents a predicted class, while each row shows the actual label. Typical confusion matrix

reports the false positive, true positives, false negatives and true negatives instances after

Figure 4.1: K fold cross-validation

31

the classification process. Explanation of each term can be found on the last page of the

report.

4.2.3 Precision, Recall, F-score.

The accuracy itself is not typically inspected in isolation, but it is evaluated simultaneously

with some other statistical measures such as precision, recall and f-score.

Recall – is the fraction of relevant examples from the whole selected dataset.

Precision – measures the result relevancy.

F-score – conveys the balance between precision and recall.

Precision = 𝑇𝑃

𝑇𝑃+𝐹𝑃

Recall = 𝑇𝑃

𝑇𝑃+𝐹𝑁

F-score = 2𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑟𝑒𝑐𝑎𝑙𝑙

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙

4.3 Summary

In summary, this chapter presented an overview of the conducted testing methods and

provided technical details about the used classification performance metrics.

Figure 4.2: Confusion Matrix

32

Chapter 5 Analysis and results

The role of this chapter is to outline the most noteworthy findings and results, obtained

after the performed analysis. Furthermore, the chapter provides performance metrics of the

predictive models.

5.1 Data statistics

As can be seen from Table 5.1, we managed to collect 29,890 potential diagnostic tweets

and 46,953 examples of potential sleep related tweets. However, significant amount from

those instances were noise or spam, which led to data imbalance issue. The psychology

researchers manually annotated 547 tweets from our potential diagnostic dataset and

classified positively 43 (8%) examples. On the other hand, the number of tweets expressing

self-reported sleep disturbance was significantly higher. From 507 labelled examples, 354

(69%) has been positively tagged. The difference in the number of positive examples

between the two datasets could be explained with the fact that discussing mental health

problems such as schizophrenia and bipolar disorder is often considered a taboo subject on

social media, whereas tweets about sleep-related issues are more commonly met and less

likely to cause social anxiety. Nevertheless, it is important to state that all of the numbers

mentioned in this section were derived from two independent datasets and the examples

were used only for training purposes.

Diagnostic Tweets Sleep-related Tweets Timeline Tweets

Streaming API 8812 44588 0

Search API 21078 2365 3886

Number of unique words

23505 29622 7908

Total Tweets 29890 46953 3886

Table 5.1: Data collection statistics

5.2 Accuracy, precision, recall of classifiers

In information retrieval context, it is of vital importance to retrieve maximum number of

relevant examples. This often presents huge challenge in studies exploring social media

content, because of the high volume of irrelevant and noisy data. Frequent false positive

33

examples could even potentially overwhelm the effect of the correct results, therefore f-

score is commonly used, which considers both precision and recall.

Table 5.2 shows the classification performance of the four used classifier in this project.

Accuracy, precision, recall and f-measure operated as statistical measures of their

effectiveness. To determine whether the classifier was being overfitted, we ran 10 fold

stratified cross-validation and the results are illustrated in the table below. As can be seen,

SVM, trained on the combination of TF-IDF scores and semantic classes, achieved the best

accuracy of 0.89% and the highest precision 0.89%. Surprisingly, the accuracy of the same

classifier, developed on the two aforementioned features separately, dropped down to

0.83% and 0.66% respectively. In addition, POS tags were also identified as an informative

feature achieving almost 69% precision and 68% accuracy with the Random Forest classifier.

Decision Tree Naïve Bayes

Support Vector Machines

Random Forest

Features p r f a p r f a p r f a p r f a

Bag-of-words 0.83 0.83 0.83 0.83 0.86 0.85 0.85 0.84 0.87 0.87 0.87 0.87 0.82 0.82 0.82 0.81

TF-IDF 0.87 0.87 0.87 0.87 0.83 0.82 0.82 0.83 0.81 0.83 0.82 0.83 0.80 0.80 0.81 0.80

POS 0.65 0.65 0.65 0.65 0.64 0.64 0.64 0.64 0.69 0.64 0.67 0.68 0.68 0.68 0.67 0.68

POS Frequency 0.66 0.63 0.65 0.67 0.62 0.63 0.63 0.64 0.68 0.60 0.65 0.66 0.68 0.64 0.67 0.68

POD 0.77 0.64 0.68 0.66 0.63 0.60 0.62 0.62 0.75 0.73 0.74 0.75 0.68 0.69 0.69 0.69

Sentiment Score 0.74 0.74 0.74 0.74 0.64 0.65 0.65 0.65 0.66 0.64 0.65 0.65 0.68 0.62 0.66 0.66

Semantic classes 0.70 0.66 0.63 0.66 0.76 0.64 0.57 0.64 0.72 0.67 0.63 0.66 0.69 0.66 0.62 0.65

Semantic Classes & TF-IDF 0.87 0.87 0.87 0.87 0.79 0.71 0.68 0.71 0.89 0.83 0.87 0.89 0.87 0.86 0.86 0.86

Table 5.2: Classifiers performance evaluation

p - precision r - recall f - f-score a - accuracy

34

Some correctly classified tweets as self-reported sleep disturbance are listed in Table 5.3.

Class

Tweet Predicted Actual

“I’m Satan and I can’t sleep, no rest of the wicked” Positive Positive

“As usual I cant fall asleep” Positive Positive

“I probably won’t, I need to take a pill to fall asleep”

Positive Positive

“I feel like sleeping on bed made of broken glass” Positive Positive

“My dream keeps me awake Positive Positive

Table 5.3: Correctly classified sleep related instances

5.3 Error Analysis

To do error analysis, after cross-fold validation, the labels of the testing examples were

manually examined for signs of systematic error trends. Once our model has been trained

and tested, an error analysis was used to indicate whether the learning algorithm was

suffering from high bias or variance. Table 5.1 presents some of the misclassified sleep

related tweets by the SVM classifier. The first example was incorrectly predicted, because it

contains two typically strong features – “time” and “sleep”. “Time” was even part of our

semantic class table, which also boosted the confidence of the classifier that the example

expresses sleep-related disturbance. However, we cannot omit the fact that the classifier

managed to interpret successfully subjectivity and objectivity, which can be seen in all of the

three examples.

Text Predicted Class

“Alright time to go back to sleep” Positive

“Noah sleep talking is creepy af” Positive

“I can’t think straight” Positive

Table 5.4: Misclassified examples

5.4 Experimental findings

To investigate sleep-related phenomena amongst the collected timeline tweets, we

conducted a few experiments and presented the results in this section.

35

5.4.1 Sentiment distribution amongst the timeline tweets

To investigate the expressed emotions before and after the diagnosis, we performed

sentiment analysis. We noticed two emotional peaks occurring within the range of 40 to 50

days before the diagnosis and 115 to 125 days after. Figure 5.3 also provides another

interesting finding, people seem to use less social media after they have been diagnosed,

which is supported by the drastic decline in posted tweet within the period of 1 - 100 days

after the diagnosis.

5.4.2 Semantic Class Influence

As previously stated, our research identified and stored different semantic entities found in

the gathered timeline tweets. The results show that the semantic class “Relationships”

occurred most frequently 325 times (33%), of which “friend“ , “ family“, “mom “, “ baby”,

“parent”, “father”, “brother” were the most common entities.

Semantic Class Count

Relationships 325

Swear Words 278

Electronics 180

Physical Space Location 81

Figure 5.1: Sentiment Distribution

36

Religious Terms 30

Health Problems 27

Fear Expression 24

Psychosis Problems 14

Location and Content of Hallucination

1

Sleep Terms 1

Table 5.5: Semantic class frequency

5.4.3 Posting time

One of our most significant experimental findings is depicted in Figure 5.3. It illustrates the posting trends generated from our diagnostic dataset, according to which, the peak posting time lies between the hours 8PM – 3AM with 42 (38%) examples out of 109 totally. This a strong in

Figure 5.2: Time of posting trend

5.5 Summary

This chapter presented the outcomes of the conducted research providing paraphrased

tweets showing sleep-related experiences. In addition, classifiers’ performance evaluation

was given, exposing their weakness and strengths.

37

Chapter 6 Conclusion

The chapter gives an overview of the achieved objectives and presents proposed future

work as an extension of the current software.

6.1 Reflection and Achievements

In terms of performance metrics, the project managed to achieve similar results to other

studies of text mining social media. We provided evidences that Twitter content is valuable

source of information for mental health analysis and confirmed that sleep-related trends

could be hidden in psychosis-like phenomena. In addition, graphical user interphase was

implemented to assist future researchers in their analysis. Despite the numerous challenges

and obstacles, our system managed to achieve satisfactory performance of 89% accuracy

with the machine learning classifier SVM. Therefore, the project aim, to establish a robust

system capable of automatic text mining can be considered as success. The major

achievement of this project are:

Development of efficient diagnostic filter.

Implementation of annotation tool employing server-client model.

Automatic classification of self-reported sleep disturbance.

Implementation of analysis tool.

Last but not least, an abstract was submitted and subsequently approved by the Population

Data Linkage Conference, based on the proposed methodology in this report [49]. The event

was focused on research papers, dealing with data linkage and data science approaches,

seeking to improve healthcare services.

6.2 Future work

Although we incorporated numerous informative features, we did not manage to test the

performance of the classifiers, trained on name entities. There exists already developed

tools which chunk and classify text elements into pre-defined categories such as

organisations, temporal expressions, percentages etc. We found particularly useful the

recognisers ManTime [50] and Stanford Parser [51] because of their ability to extract time

patterns from general domain texts, but due to time limitations the feature importance was

not evaluated. I believe that such temporal entities could improve the classification of sleep-

related tweets, since we have spotted frequent usage of temporal expressions in the

positively labelled tweets.

38

Another possible task that could be added to the implementation of the existing software is

the process of word-sense disambiguation. Although we approximated the sense of the

words by applying various natural language processing techniques, the process could be

further improved and advanced. For example, an algorithm could be used to detect the

contextual meaning of a word by comparing all its senses as well as taking into consideration

its surrounding words. There are more advanced algorithms which exploit the structural

properties of a particular word network by applying graph-based [52] or tree-based [53]

approaches.

6.3 Summary

To summarise, the project managed to establish robust and user-friendly system in order to

retrieve, analyse and visualize sleep-related phenomena amongst set of tweets from people

experiencing mental health problems. The proposed methodology coped with the

inconsistent and heterogeneous nature of social media by carrying out different text

normalisation and transformation tasks. The implemented software would be handed over

to the psychology researchers from the University of Manchester for further data

exploration.

39

References

1. Astroml. Machine Learning 101: General Concepts — Machine Learning for Astronomy with Scikit-learn. 2016 [cited 2016 4 Apr]; Available from: http://www.astroml.org/sklearn_tutorial/general_concepts.html.

2. Twitter Usage Statistics - Internet Live Stats. 2016; Available from: http://www.internetlivestats.com/twitter-statistics/.

3. Birnbaum, M.L., et al., Role of social media and the Internet in pathways to care for adolescents and young adults with psychotic disorders and non‐psychotic mood disorders. Early intervention in psychiatry, 2015.

4. Coppersmith, G., M. Dredze, and C. Harman. Quantifying mental health signals in twitter. in Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality. 2014.

5. Buchanan, E., Ethical decision-making and internet research. 6. Hewson, C. and T. Buchanan. Ethics Guidelines for Internet-mediated Research. 2013. The

British Psychological Society. 7. Pub, N.F., 197: Advanced encryption standard (AES). Federal Information Processing

Standards Publication, 2001. 197: p. 441-0311. 8. Hu, X. and H. Liu, Text Analytics in Social Media, in Mining Text Data, C.C. Aggarwal and

C. Zhai, Editors. 2012, Springer US: Boston, MA. p. 385-414. 9. Morstatter, F., et al., Is the sample good enough? comparing data from twitter's

streaming api with twitter's firehose. arXiv preprint arXiv:1306.5204, 2013. 10. Witten, I.H., Text mining. 11. Gegick, M., P. Rotella, and T. Xie. Identifying security bug reports via text mining: An

industrial case study. in Mining Software Repositories (MSR), 2010 7th IEEE Working Conference on. 2010.

12. Sullivan, D., Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing, and Sales. 2001: John Wiley; Sons, Inc. 560.

13. Percha, B., Y. Garten, and R.B. Altman, DISCOVERY AND EXPLANATION OF DRUG-DRUG INTERACTIONS VIA TEXT MINING. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, 2012: p. 410-421.

14. Cohen, W.W. and S. Sarawagi. Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. 2004. ACM.

15. Baluja, S., V.O. Mittal, and R. Sukthankar, Applying Machine Learning for High‐Performance Named‐Entity Extraction. Computational Intelligence, 2000. 16(4): p. 586-595.

16. Zhou, G. and J. Su, Named entity recognition using an HMM-based chunk tagger, in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 2002, Association for Computational Linguistics: Philadelphia, Pennsylvania. p. 473-480.

17. Thelwall, M., et al., Sentiment strength detection in short informal text. Journal of the American Society for Information Science and Technology, 2010. 61(12): p. 2544-2558.

http://www.astroml.org/sklearn_tutorial/general_concepts.html

http://www.internetlivestats.com/twitter-statistics/

40

18. Saif, H., Y. He, and H. Alani, Semantic Sentiment Analysis of Twitter, in The Semantic Web – ISWC 2012: 11th International Semantic Web Conference, Boston, MA, USA, November 11-15, 2012, Proceedings, Part I, P. Cudré-Mauroux, et al., Editors. 2012, Springer Berlin Heidelberg: Berlin, Heidelberg. p. 508-524.

19. Rambocas, M. and J. Gama, Marketing research: The role of sentiment analysis. 2013, Universidade do Porto, Faculdade de Economia do Porto.

20. Alpaydin, E., Introduction to machine learning. 2014: MIT press. 21. Leopold, E. and J. Kindermann, Text Categorization with Support Vector Machines. How

to Represent Texts in Input Space? Machine Learning. 46(1): p. 423-444. 22. Joachims, T., Text categorization with support vector machines: Learning with many

relevant features. 1998: Springer. 23. Diederich, J., A. Al-Ajmi, and P. Yellowlees, Ex-ray: Data mining and mental health.

Applied Soft Computing, 2007. 7(3): p. 923-928. 24. Shepherd, A., et al., Using social media for support and feedback by mental health service

users: thematic analysis of a twitter conversation. BMC psychiatry, 2015. 15(1): p. 1. 25. Reavley, N.J. and P.D. Pilkington, Use of Twitter to monitor attitudes toward depression

and schizophrenia: an exploratory study. PeerJ, 2014. 2: p. e647. 26. McManus, K., et al., Mining twitter data to improve detection of schizophrenia. AMIA

Summits on Translational Science Proceedings, 2015. 2015: p. 122. 27. Belousov, M., Identifying signs of schizophrenia in Twitter using text mining techniques.

2015. 28. Cao, L. and B. Ramesh, Agile Requirements Engineering Practices: An Empirical Study.

IEEE Software, 2008. 25(1): p. 60-67. 29. MongoDB. 2016; Available from: https://www.mongodb.org/. 30. IntelliJ IDEA. 2016; Available from: https://www.jetbrains.com/idea/. 31. The Search API. 2016; Available from: https://dev.twitter.com/rest/public/search. 32. The Streaming APIs. 2016; Available from: https://dev.twitter.com/streaming/overview. 33. Ramos, J. Using tf-idf to determine word relevance in document queries. 34. Coppersmith, G., et al., From ADHD to SAD: Analyzing the language of mental health on

Twitter through self-reported diagnoses. NAACL HLT 2015, 2015: p. 1. 35. Clark, E. and K. Araki, Text normalization in social media: progress, problems and

applications for a pre-processing system of casual English. Procedia-Social and Behavioral Sciences, 2011. 27: p. 2-11.

36. Bontcheva, K., et al. TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text. in RANLP. 2013.

37. Owoputi, O., et al. Improved part-of-speech tagging for online conversational text with word clusters. 2013. Association for Computational Linguistics.

38. Bird, S., E. Klein, and E. Loper, Natural language processing with Python. 2009: " O'Reilly Media, Inc.".

39. Kim, J., et al., Noise Removal Using TF-IDF Criterion for Extracting Patent Keyword, in Soft Computing in Big Data Processing, M.K. Lee, S.-J. Park, and J.-H. Lee, Editors. 2014, Springer International Publishing: Cham. p. 61-69.

https://www.mongodb.org/

https://www.jetbrains.com/idea/

https://dev.twitter.com/rest/public/search

https://dev.twitter.com/streaming/overview

41

40. Klein, D. and C.D. Manning. Accurate unlexicalized parsing. in Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1. 2003. Association for Computational Linguistics.

41. Guha, R., R. McCool, and E. Miller. Semantic search. in Proceedings of the 12th international conference on World Wide Web. 2003. ACM.

42. Falotico, R. and P. Quatto, Fleiss’ kappa statistic without paradoxes. Quality & Quantity, 2014. 49(2): p. 463-470.

43. Sondhi, P., Feature construction methods: a survey. 2009. 44. Miller, G.A., WordNet: a lexical database for English. Communications of the ACM, 1995.

38(11): p. 39-41. 45. Pedregosa, F., et al., Scikit-learn: Machine learning in Python. The Journal of Machine

Learning Research, 2011. 12: p. 2825-2830. 46. Unit testing framework Python 3.5.1 documentation. 2016 [cited 2016 5 April]; Available

from: https://docs.python.org/3/library/unittest.html. 47. Refaeilzadeh, P., L. Tang, and H. Liu, Cross-Validation, in Encyclopedia of Database

Systems, L. Liu and M.T. ÖZsu, Editors. 2009, Springer US: Boston, MA. p. 532-538. 48. Elkan, C., Evaluating classifiers. 2012. 49. IPDLN Conference 2016. 2016; Available from:

http://ipdlnconference2016.org/CallForAbstracts. 50. Filannino, M. and G. Nenadic, Temporal expression extraction with extensive feature type

selection and a posteriori label adjustment. Data & Knowledge Engineering, 2015. 100, Part A: p. 19-33.

51. Finkel, J.R., T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. in Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. 2005. Association for Computational Linguistics.

52. Agirre, E., et al. A study on similarity and relatedness using distributional and wordnet-based approaches. in Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. 2009. Association for Computational Linguistics.

53. Hatori, J. and Y. Miyao. Word Sense Disambiguation for All Words using Tree-Structured Conditional Random Fields.

https://docs.python.org/3/library/unittest.html

http://ipdlnconference2016.org/CallForAbstracts

42

Appendix A

Decision Tree – supervised learning method modelled as a decision tree, which maps labels

to input variables, based on decision rules. In decision trees, each node has one unique

input variable and each leaf node is assigned a class label. The class labels depend on the

values of the input variables, which are located on the explored path from the root node. A

simple representation of a decision tree is shown in the figure below.

The advantages of decision trees include:

Handling multi-output problems.

Coping with categorical and numerical input data.

Linearly Separable SVM

Linearly Separable SVM

Non-linearly Separable SVM case

Non-linearly Separable SVM case

43

The complexity of using decision trees is logarithmic to the number of points used for

training.

Implicit feature selection.

Random Forest – ensemble learning method for classification problems which fits number of decision trees, constructed by a random subset of the training data. After a large number of trees is generated, a majority vote is performed for class selection. From computational point of view, Random Forest are appealing because:

Does not get overfitted.

Handle multi-class classification.

Considerably fast to train and predict.

Measure variable importance.

44

Appendix B

46

Appendix C

After the participants assigned labels to each tweet from our pre-annotated dataset, the

Fleiss’ Kappa value showed how consistent they were using the following formula:

where:

- 𝑘 is the kappa coefficient, where k< 0 shows no agreement, 0 <=k <=19 is poor, 0.20

<=k<=39 is fair, 0.40 <=k <=0.59 is moderate, 0.60 <=k <=0.79 is substantial and

0.80 <=k <=1.00 is almost perfect agreement.

describes the degree of agreement attainable above chance.

describes the degree of the actual agreement achieved above chance.

The following example illustrates the practical use of Fleiss Kappa, when there are 4 categories, 10

tweets and 3 participants.

1 2 3 4 Pi

1 0 2 1 0 0.33

2 3 0 0 0 1.00

3 1 1 1 0 0

4 0 0 1 2 0.33

Total 4 3 3 2

pi 0.33 0.25 0.25 0.16

= 12

𝑘 =0.242

0.826

𝑘 = 0.29

47

Appendix D

Tag Description

, Punctuation

D Determiner

L Nominal + verb

N Noun

O Pronoun

P Pre or postposition

R Adverb

S Nominal + possessive

V Verb

48

Terms

Annotation – manual classification of sleep related /diagnostic tweets into positive, negative

or neutral class respectively.

False negative (FN) – the predicted class was negative, but should have been positive.

False positive (FP) – the predicted class was positive, but should have been negative.

Overfitting – occurs when a classification model captures noise and so does not perform

well on the evaluation dataset.

Retweets – reposting someone else’s Tweet (normally retweeted tweet is identified with

“RT”).

Self-reported sleep disturbance – any abnormal sleep experiences expressed by the user

who shares the message.

Stopwords – common words which do not bring meaningful value such as prepositions,

determiners, personal pronouns, etc.

Sentiment – identified expressed opinion in a given tweet.

Timeline – chronologically sorted stream of tweets posted by a single user.

Training data – manually annotated tweets by the domain experts, used for machine

learning classification purposes.

True negative (TN) – the predicted class was negative, which corresponds to the right label.

True positive (TP) – the predicted class was positive, which corresponds to the right label.

mining twitter data pertaining to psychosis for self...

Documents