multimodal feature extraction for psychiatric disorder
TRANSCRIPT
A Thesis for the Degree of Ph.D. in Engineering
Multimodal Feature Extraction for
Psychiatric Disorder Screening
November 2020
Graduate School of Science and Technology
Keio University
SUMALI, Brian
i
Preface
Mental health examinations or screenings are commonly performed by licensed
psychiatrists in a health care location, but the advances in technology enable the
development of clinical decision support systems. In the beginning, the features inputted
to the machine include the answers from the examination conducted by the psychiatrist.
However, recently the focus shifted into predicting a person’s mental health without the
need of conducting extensive tests.
Two most common psychiatric disorders include depression and dementia.
Depression is a mood disorder traditionally associated with persistent feeling of sadness
whilst dementia is a collection of symptoms commonly caused by progressive
neurological disorders. Both are popular research objectives of many automated
psychiatric disorder screening studies. Unfortunately, most conventional studies did not
consider the existence of “pseudodementia”. Although the symptoms of depression and
dementia are different, sometimes dementia-like symptoms might be observed in a
depression patient, which is termed “pseudodementia”. Distinguishing pseudodementia
is hard even for expert psychiatrists and feature analysis may help solve the classification
problem. The focus of this dissertation is the similarities, differences, and characteristics
of the two psychiatric disorders: depression and dementia. Feature extraction and analysis
are performed to a audiovisual database of clinical psychiatric patients. Automated
psychiatric disorders using machine learning is also proposed.
Chapter 1 introduces the background of psychiatric illness and automatic psychiatric
disorder screening. Short review of conventional automatic psychiatric disorder diagnosis
algorithms is described along with the limitations and significance of this study.
Chapter 2 reviews of automatic psychiatric disorder screening. The background of
automatic screening and telemedicine in addition with common features and algorithms
utilized are described in this chapter.
In chapter 3 the facial feature analysis of both depression patients and dementia
patients was performed. Conventional facial landmarks were extracted and analyzed. The
ii
visualizations of features correspond to depression, dementia, and features most
important to distinguish depression and dementia is described in this chapter.
In chapter 4 proposes an improvement of the facial landmark extraction for real-time
tracking. Conventional facial landmark extraction techniques are notoriously inaccurate
for non-forward-facing pose and it is pointless to instruct a psychiatric patient to be facing
the camera at all times. In this chapter, an improvement for facial landmark extraction
was proposed. The proposed facial landmark extraction algorithm is based on Cascaded
Compositional Learning and is robust even for random facial poses.
Chapter 5 and 6, are similar to chapter 3. In chapter 5, the focus was the speech
features from the psychiatric patients, while in chapter 6, both facial features and speech
features were utilized. Additionally, the similarities of facial landmarks and speech
features are examined in chapter 6.
Finally, this dissertation is summarized and concluded in chapter 7.
iii
Table of Contents
Acknowledgements ........................................................................................................ ix
1. Introduction ........................................................................................................... 1
1.1. Background ........................................................................................................ 1
1.1.1. Depression .................................................................................................. 2
1.1.2. Dementia ..................................................................................................... 2
1.1.3. Similarities between depression and dementia ........................................... 3
1.2. Psychiatric disorder screening ........................................................................... 3
1.2.1. Automated psychiatric screening................................................................ 4
1.2.2. Review of conventional automated psychiatric screenings ........................ 5
1.3. Main contributions of this research.................................................................... 9
1.4. Thesis outline ................................................................................................... 10
2. Review of automated psychiatric disorders screening ..................................... 12
2.1. Introduction ...................................................................................................... 12
2.2. Sensory features for automatic psychiatric disorder screening........................ 13
2.2.1. Facial Features .......................................................................................... 13
2.2.2. Biosignals ................................................................................................. 14
2.2.3. Auditory features ...................................................................................... 15
2.3. Machine learning algorithms for automatic psychiatric disorder screening .... 15
2.3.1. Support vector machine ............................................................................ 16
2.3.2. Gradient Boosting Machine ...................................................................... 18
2.3.3. Random Forest .......................................................................................... 20
2.3.4. Naive Bayes .............................................................................................. 20
2.3.5. K-Nearest Neighborhood .......................................................................... 21
iv
2.4. Summary .......................................................................................................... 23
3. Facial landmark analysis from static images.................................................... 24
3.1. Introduction ...................................................................................................... 24
3.2. Data Acquisition .............................................................................................. 24
3.3. Analysis............................................................................................................ 25
3.3.1. Preprocessing ............................................................................................ 27
3.3.2. Feature extraction ..................................................................................... 27
3.3.3. Statistical analysis..................................................................................... 28
3.3.4. Feature selection and machine learning.................................................... 28
3.4. Results .............................................................................................................. 29
3.5. Discussion ........................................................................................................ 31
3.6. Summary .......................................................................................................... 32
4. Robust facial landmark tracking algorithm ..................................................... 33
4.1. Introduction ...................................................................................................... 33
4.2. Conventional facial tracking analysis .............................................................. 33
4.3. Proposed method .............................................................................................. 35
4.3.1. Supervised descent method ...................................................................... 35
4.3.2. Compositional vector estimation .............................................................. 36
4.3.3. Training composite vectors ...................................................................... 37
4.4. Experiment ....................................................................................................... 38
4.5. Results and discussion ..................................................................................... 39
4.6. Conclusion ....................................................................................................... 40
5. Speech feature analysis for classification of depression and dementia .......... 42
5.1. Introduction ...................................................................................................... 42
5.2. Data Acquisition .............................................................................................. 42
5.3. Analysis............................................................................................................ 42
v
5.3.1. Audio signal analysis ................................................................................ 43
5.3.2. Statistical analysis..................................................................................... 46
5.3.3. Machine Learning ..................................................................................... 46
5.3.4. Evaluation Metrics .................................................................................... 48
5.4. Results .............................................................................................................. 49
5.4.1. Statistical analysis..................................................................................... 50
5.4.2. Machine learning ...................................................................................... 51
5.5. Discussion ........................................................................................................ 54
5.6. Conclusions ...................................................................................................... 56
6. Multimodal feature analysis in depression patients and dementia patients .. 57
6.1. Introduction ...................................................................................................... 57
6.2. Data acquisition ............................................................................................... 57
6.3. Analysis............................................................................................................ 58
6.3.1. Facial feature analysis .............................................................................. 58
6.3.2. Audio feature analysis .............................................................................. 58
6.3.3. Statistical analysis..................................................................................... 58
6.3.4. Machine learning ...................................................................................... 59
6.4. Results and Discussion .................................................................................... 59
6.4.1. Demographics ........................................................................................... 59
6.4.2. Statistical analysis..................................................................................... 60
6.4.3. Machine learning ...................................................................................... 61
6.5. Conclusion ....................................................................................................... 62
7. Conclusion ............................................................................................................ 64
7.1. Summary of this thesis ..................................................................................... 64
7.2. Conclusion ....................................................................................................... 65
7.3. Suggestions for future research ........................................................................ 66
vi
References...................................................................................................................... 68
vii
List of Figures
Figure 2.1 ........................................................................................................................ 14
Figure 2.2 ........................................................................................................................ 17
Figure 2.3 ........................................................................................................................ 18
Figure 2.4 ........................................................................................................................ 19
Figure 3.1 ........................................................................................................................ 26
Figure 4.1 ........................................................................................................................ 39
Figure 4.2 ........................................................................................................................ 40
Figure 5.1 ........................................................................................................................ 43
Figure 5.2 ........................................................................................................................ 48
viii
List of Tables
Table 1.1 ........................................................................................................................... 8
Table 3.1 ......................................................................................................................... 30
Table 3.2 ......................................................................................................................... 31
Table 4.1 ......................................................................................................................... 40
Table 5.1 ......................................................................................................................... 45
Table 5.2 ......................................................................................................................... 49
Table 5.3 ......................................................................................................................... 50
Table 5.4 ......................................................................................................................... 51
Table 5.5 ......................................................................................................................... 52
Table 5.6 ......................................................................................................................... 52
Table 5.7 ......................................................................................................................... 53
Table 5.8 ......................................................................................................................... 54
Table 6.1 ......................................................................................................................... 60
Table 6.2 ......................................................................................................................... 62
ix
Acknowledgements
This dissertation is the culmination of my studies as a Ph.D student at Mitsukura
laboratory, Keio University. First and foremost, I wish to express my deepest appreciation
to my supervisor, Prof. Dr. Yasue Mitsukura, for all her guidance and support from the
start of my Ph.D program up until its completion. Thank you very much for your
inspiration, encouragement, patience, support, confidence, and trust. I cannot thank you
enough for all the learning opportunities you have given to me.
I would also like to express my thanks to Prof. Emer. Dr. Nozomu Hamada for the help,
advice and comments during my research and writing of this dissertation. You are one of
the most inspirational figures I have ever met and I aim to improve myself to be a top-
notch researcher like you.
I extend my sincere appreciation to the colleagues and alumni of Mitsukura laboratory,
especially Mr. Takahiro Asano, Mr. Motonobu Fujioka, Mr. Toshiya Nakaigawa,
Mr. Hideto Watanabe, and Dr. Suguru Kanoga, for their ideas, encouragement, and advice.
Unfortunately, it is not possible to list all of them in this limited space.
I am also grateful to Dr. Taishiro Kishimoto the members of Kishimoto laboratory,
epsecially Dr. Kuo-ching Liang, Dr. Michitaka Yoshimura, and Dr. Momoko Kitazawa,
for their support during the collaborative research period. Your advices and support have
been very invaluable.
I would also like to thank Prof. Dr. Yoshimitsu Aoki, Prof. Dr. Toshiyuki Tanaka, and
Prof. Dr. Toshiyuki Murakami. Thank you very much for agreeing to be part of my
dissertation thesis committee, and thank you very much giving me your time for
discussing the revisions of the dissertation. Without your support and advice. this
dissertation would not have been the same as presented here.
I am also indebted to Ministry of Education, Culture, Sports, Science and Technology
(MEXT), Japan for funding my Ph.D study. Without the funding, I would not be able to
pursue my Ph.D study in Japan.
Last but certainly not leasr, I would like to thank my family and friends for their endless
support. I am sorry that I could not visit Indonesia frequent enough during my study here
and thank you very much for the encouragement and moral support.
October 2020
Brian Sumali
1
Chapter 1
Introduction
1.1. Background
According to WHO, five most common psychiatric disorders affecting human worldwide
are: depression, dementia, bipolar disease, psychosis including schizophrenia, and
developmental disorders including autism [1]. Of those five, depression ranks first on the
number of patients (estimation of 264 million), followed by dementia (estimation of 50
million). In Japan, dementia is a serious problem brought with the population ageing
affecting the country. Ministry of Health, Labor, and Welfare (MHLW) approximates at
least 6 million cases of dementia in Japan (15% of population) by the year of 2020 [2].
On the other hand, depression is said as the common cause of suicide, which is also a
plight in Japan [3] [4]. Correctly diagnosing these psychiatric disorders is especially
important for Japanese society.
The book detailing guideline for classification of psychiatric disorders is called
“Diagnostic and Statistical Manual of Mental Disorders (DSM)”. It is published by
American Psychiatric Association (APA) and is used by clinicians, researchers,
psychiatric drug regulation agencies, health insurance companies, pharmaceutical
companies, the legal system, and policy makers. It was first published 1952 and since
then had revisions – inclusions of new mental disorders and removal of entries which is
no longer considered as mental disorders. Its latest edition was published in 2013, called
DSM-5 [5].
In a clinical setting, “mental health screening tools” are commonly utilized by psychiatrist
as a guideline for diagnosing a certain mental health issue. These tools are based on the
DSM-5 and are specialized for diagnosing a specific psychiatric illness and consist of a
guideline for interviews and tests. For example, for diagnosing dementia a psychiatrist
commonly used a combination of mini-mental state examination (MMSE) [6], clock-
drawing test (CDT) [7], clinical dementia rating (CDR) [8], and logical memory test (LM)
from Wechsler Memory Scale [9]. The mental health examinations are similar in timing
2
with physical examination, in that they are not performed every day but with a specific
interval, for example every two or three months.
1.1.1. Depression
Clinical depression, or major depressive disorder (MDD), is a mental disorder
characterized by two weeks or more of prolonged feeling of sadness, low spirits, and
helplessness, accompanied with changes in interest or sleep schedule [5]. Depression is
also the major cause of suicide, affecting around 50% of the suicide victims. Additionally,
Japan’s suicide rate is rated as the sixth highest worldwide and second highest among
eight industrialized nations in 2017, making depression a serious predicament for the
nation.
The cause of depression is not conclusively known. It is believed to be a combination of
genetic, environmental, and psychological factors [10] [11] [12]. Risk factors include a
family history of the condition, major life changes, certain medications, chronic health
problems, and substance abuse. About 40% of the risk appears to be related to
genetics [13] [14].
The diagnosis of MDD is mainly based on the person's mental health screening. There is
no laboratory test for the diagnosing MDD; interviews and patient’s history are the
primary grounds for diagnosis. Hamilton depression rating scale (HAMD / HDRS) [15]
is one of the commonly used guidelines of mental health screening for depression
However, mental and physical tests may be done to rule out conditions that may cause
similar symptoms. The conventional treatment options for MDD are counseling and
antidepressants [16] [17].
1.1.2. Dementia
Dementia is not a disease but a collection of symptoms. Dementia usually refers to loss
of memory, language, problem-solving and other thinking abilities [5]. The cognitive
deficit of dementia is impactful in daily life, for example, forgetting one’s home address
or in one of the worst cases, how to turn on the stove.
Dementia does not only affect the patient alone; it also heavily affects the community
both socially and economically. It can be emotionally overwhelming to the families of
3
the patients and for their caregivers. The care for dementia is also costly; in Japan 2014,
it was estimated that 14.5 trillion JPY was spent for dementia care and the cost per person
with dementia appeared to be 5.95 million JPY [18].
Most of the underlying cause of dementia are uncurable. Alzheimer's disease makes
up ,pre than 50% of cases of dementia. Other common causes include vascular dementia
(25%), dementia with Lewy bodies (15%), and frontotemporal dementia [19]. A person
with dementia may have more than one underlying cause.
Diagnosis of dementia is usually based on history of the illness and cognitive testing with
medical imaging or genetic testing. The MMSE is one commonly used cognitive test.
There is no known cure for dementia. Most treatment are only symptomatic and have
limited effectiveness. Prevention of dementia by reducing the risk factors is the main
consensus of the experts.
1.1.3. Similarities between depression and dementia
The symptoms of depression and dementia can be very similar or even overlap. But
despite their similarities, they are two distinct illnesses. The term pseudodementia
typically refers to dementia-like symptoms caused by curable mental disorders [20] [21]
[22] [23]. The most common cause of pseudodementia is depression. The shared
symptoms of depression and dementia include impaired cognitive ability, reduced
concentration, and feelings of apathy. Accordingly, both diseases seem to affect the
patient’s memory and cognition. Results from extensive tests have shown that
pseudodementia patients perform better on memory tests compared to true dementia
patient. Additionally, because depression is treatable, it is very important to distinguish
depression from dementia. Currently, diagnosing pseudodementia is difficult even for
expert psychiatrists, and extensive testing of both depression and dementia are needed to
clinically diagnose psedudodementia [24].
1.2. Psychiatric disorder screening
Each psychiatric disorder screening is different based on the patient and their symptoms.
A conventional screening conducted by a psychiatrist in a health care service typically
include interview followed by tests. The verbal tests conducted by the psychiatrist follow
4
certain guidelines: “mental health screening tools”. Although DSM5 contains numerous
potential mental health disorders, administering detailed assessments for all the problems
is simply impossible.
Popular mental health screening tools for depression include Hamilton depression rating
scale (HAMD / HDRS) [15], Montgomery–Asberg depression rating scale (MADRS)
[25], Beck’s depression inventory (BDI) [26], and Young’s mania rating scale (YMRS)
[27]. Sometimes, the Pittsburgh sleep quality index (PSQI) [28] might be included to help
score the patient’s sleep health. As an example, a psychiatrist using HAMD for
diagnosing a depression patient tries to score the patient’s depressed mood, feelings of
guilt, suicide ideation, etc. indirectly via structured interview, like a conversation. After
the scoring, the psychiatrist will tell the diagnosis result to the patient along with the
treatment options.
On the other hand, dementia screening tools are vastly different from the depression ones.
The tools include mini-mental state examination (MMSE) [6], clinical dementia rating
(CDR) [8], neuropsychiatric inventory questionnaire (NPI-Q) [29], clock-drawing test
(CDT) [7], logical memory test (LM) [9], and Boston cookie theft task [30]. Albeit the
seemingly straightforward nature of tests, diagnosing someone with a mild dementia is
often a challenge. Additionally, the aspects of natural intelligence and knowledge differs
person-to-person. In many cases, conclusive dementia screening needs additional tests
such as physical examination, brain imaging, and laboratory tests.
1.2.1. Automated psychiatric screening
Assessment and outcome monitoring are critical for the effective detection and treatment
of mental illness. Traditional methods of capturing social, functional, and behavioral data
are limited to the information that patients report back to their health care provider at
selected points in time. As a result, these data are not accurate accounts of day-to-day
functioning, as they are often influenced by biases in self-report. Telemedicine or
telehealth, the practice of utilizing electronic information and technologies to support
health care with the objective of removing the effect of physical distance between the
patient and health care providers has been proposed [31]. Recent development of mobile
technology such as mobile applications on smartphones has the potential to overcome
5
problems with traditional assessment and improve telehealth quality by providing
information about patient symptoms, behavior, and functioning in real time [32] [33] [34].
Although the use of sensors and apps are widespread, most of the tools are not clinically
validated, and the reliability of the apps and sensors remain unverified.
1.2.2. Review of conventional automated psychiatric screenings
Conventional psychiatric screening relies solely on patient reporting and psychiatrist
observations. When conducting the assessment or interpreting the findings, it is important
to consider the cultural background of the patient, as behavioral patterns vary between
cultures. This results in some criticism about the subjectivity of conventional mental
health
assessment [35] [36] and proposes machine learning and artificial intelligence (AI) as an
objective mental health screening tool. In a recent survey [37], many psychiatrists believe
that artificial intelligence (AI) and machine learning will significantly transform the way
they work. Psychiatrists also predicted that AI and machine learning could help ensure
more accurate diagnosis, reduce administrative burden, provide ceaseless monitoring,
personalize drug targets to reduce adverse effects, and integrate new streams of data from
various data source.
In general, the most popular features utilized in the automatic mental health assessment
are as follows [38]:
1. Self-report or questionnaire
The mental health questionnaire and self-reporting survey are the oldest methods
of technology-based data collection. The contents of these questionnaires consist
largely of standardized mood and disability questions. With the advancement of
technology, mobile and web platforms are commonly utilized as the media. The
question in the platform mimics some of the clinical screening tools. For example,
depression-based self-report platform may display questions from the 9- and 2-
question Patient Health Questionnaires (PHQ-9 and PHQ-2) [39].
2. Task performance result
Computer assisted psychiatric diagnosis has been proposed since the 1980s [40].
The most common computer-assisted diagnosis software in recent years [41] [42]
6
[43] are game-like tasks. These tasks commonly measure a subject’s attention,
concentration, and working memory. The subject’s reaction time, number of
errors and retries, and other performance measures are collected and processed to
perform a diagnosis. Some of the diagnosis algorithms are adaptive, that is they
do not have a fixed threshold but rather adapt the diagnosis threshold based on the
current user of the software. The adaptive algorithms also consider practice or
experience effects which may boost the performance score of a subject.
3. Data from sensors
Wearable sensors such as those embedded in smartphones and smartwatches can
measure behavioral features such as physical activity and location. Wearable
sensors can also detect physiological data such as heart rate, galvanic skin
response, respiration rate, etc. These behavioral and physiological data were then
utilized for diagnose a person’s mental state [44] [45].
Electroencephalogram (EEG) sensors are also an option for automatic screening
[46] [47] and the development of mobile EEG sensors also enables the remote
screening [48] [49] [50]. Other noninvasive sensors such as video camera and
audio recording device may detect patient’s emotions, body languages, cognitive
burden, and mood [45] [51]. Portable magnetic resonance imaging (MRIs) has
been announced in early 2020 [52], opening the possibility of utilizing MRI for
remote screening.
4. Data from social media
The use of social media for medical research is rising in popularity [53] [54]. The
contents extracted from social media have been utilized for personality assessment
[55], screening for depression [56], and suicide risk factors research [57]. Social
data collected from the social media websites and smartphones may include a
combination of incoming and outgoing call and text frequency, length of texts and
calls, and number of people contacted, as well as the content of public messages
sent via social medias. Although these data could serve as indicators of
psychopathology, the use of social media data is controversial and privacy issues
are inseparable from the researches utilizing social data [58] [59].
7
To summarize, telehealth researches are gaining popularity, especially for psychiatric
telehealth services. Remote physical examinations exist but face more challenge
compared to mental health assessment via virtual consultation. For example, stethoscope
for telemedicine are available, but requires an additional purchase. Meanwhile since the
basic psychiatric examinations consist of interviews and communication- or paper-based
tests, they did not require additional tools.
In general, the advantages to the collection of mood and behavioral data from
smartphones, wearable sensors, and noninvasive sensors are many, and several studies
have developed their own algorithm or machine learning models to learn accurately
predict one’s mental state. In Table 1.1, the short reviews of recent and influential studies
are described.
From the table, it can be concluded that the counts of machine learning focusing on early
detection or algorithms utilizing non-invasive inputs are rising in popularity. Additionally,
the databases of dementia studies are mostly not public. AVEC dataset utilized in
depression studies are an audiovisual database of participants performing human-
computer interaction task in quiet settings. The labeling of AVEC subjects was performed
by computing BDI score, a self-reporting questionnaire for depression.
8 Table 1.1 Conventional studies for automated psychiatric screening
Authors Year Objectives Features Methodology Accuracy Dataset
Neevaleni
and Devasana
[60]
2020 Alzheimer’s
disease
Parameters (task
result)
SVM, Decision tree 85% Clinical, not public
Yalamanchili
et al. [61]
2020 Depression Realtime Speech
(sensors)
SMOTE + SVM (best) 93% DAIC-WOZ
(AVEC2016)
He et al. [62] 2019 Depression
severity
Face landmarks
(sensors)
Dirichlet process
Fisher vector + BoW
RMSE = 9.20
MAE = 7.55
AVEC2013
AVEC2014
Zhou et al.
[63]
2019 Depression
severity
Raw face
(sensors)
Deep learning (ResNet
+ hidden layers)
RMSE = 6.37
MAE = 8.43
AVEC2014
Khatun et al.
[64]
2019 MCI EEG (FPz) from
audio ERPs
(sensors)
SVM RBF kernel 87.9% Not public
Lodha et al.
[65]
2018 Alzheimer’s
Disease
MRI
(sensors)
Neural networks 98.36% ADNI project
Konig et al.
[66]
2015 Alzheimer’s
Disease
Speech
(sensors)
SVM 87% (healthy
vs AD)
Clinical, not public.
Kloppel et al.
[67]
2008 Alzheimer’s
Disease
MRI
(sensors)
SVM 95% (healthy
vs AD)
Clinical, not public
9
1.3. Main contributions of this research
One big problem when diagnosing dementia and depression is the existence of
“pseudodementia” – a dementia-like symptoms that was caused by depression [20]. This
is a serious problem as depression is a curable mood disorder and dementia on the other
hand is typically signifies an underlying progressive neurological disorder, most of which
are not curable. Additionally, for elderly patients, late-life depression is a risk factor,
symptom, or a prodrome for dementia, or in worst case a comorbidity illness. Albeit the
progresses in automated psychiatric screening, minimal attention is given for
pseudodementia research. As shown in the Table 1.1, the researchers were focused only
in screening one disease and disregard two-dimensional aspect of comorbidities or shared
symptoms. Additionally, the depression screening researches utilizing a patient’s
audiovisual features seems to be non-clinical; the label for the datasets are from self-
reported tools and without the supervision of clinician. Noninvasive dementia diagnosis
studies are still rare; most of the research seems to be focused utilizing biosignals and
brain imaging.
This dissertation focuses on analyzing the differences between depression and dementia,
then report the differing features. The limitations of this study are as follows:
• The database was obtained from The Project for Objective Measures Using
Computational Psychiatry Technology (PROMPT) by Dr. Kishimoto [68].
• The subjects are actual patients affected with depression, dementia, or
comorbidity of dementia and depression.
• There is 70cm of distance between recording apparatus and subjects.
• The recording apparatuses utilized in this dissertation are microphone for speech
recording and camera recorder for observing patient’s face and movement.
• The microphone utilized for recording patient’s speech was Classis RM30W
(Beyerdynamic GmbH & Co. KG) with 16kHz of sampling rate.
• The video recording device were RealSense R200 (Intel Corporation) and
Microsoft Kinect for Windows v2 (Microsoft Corporation). Some patients were
recorded with Kinect and some others were recorded using RealSense.
10
• The labelling of each dataset and its test scoring was performed by licensed
clinical psychiatrists.
The main contributions of this dissertation are as follows:
1. Analysis of acoustic, facial, and fusion of acoustic and facial features from
depression patients, dementia patients.
2. Analysis of similarity and difference between acoustic features and facial features
for each patient group.
3. Explored the possibility of classification of depression, dementia, and dementia
with depression with simple machine learning models.
4. Proposal of a pose-robust facial landmark tracking algorithm which is beneficial
for both automatic screening and telehealth in general,
1.4. Thesis outline
Chapter 2 reviews of automatic psychiatric disorder screening in detail. The background
of automatic screening and telemedicine in addition with common features and
algorithms utilized are described in this chapter.
In chapter 3 the facial feature analysis of both depression patients and dementia patients
was performed. Conventional facial landmarks were extracted and analyzed. The
visualizations of features correspond to depression, dementia, and features most
important to distinguish depression and dementia is described in this chapter.
In chapter 4 proposes an improvement of the facial landmark extraction for real-time
tracking. Conventional facial landmark extraction techniques are notoriously inaccurate
for non-forward-facing pose and it is pointless to instruct a psychiatric patient to be facing
the camera at all times. In this chapter, an improvement for facial landmark extraction
was proposed. The proposed facial landmark extraction algorithm is based on Cascaded
Compositional Learning and is robust even for random facial poses.
Chapter 5 and 6, are similar to chapter 3. In chapter 5, the focus was the speech features
from the psychiatric patients, while in chapter 6, both facial features and speech features
were utilized. Additionally, the similarities of facial landmarks and speech features are
examined in chapter 6.
11
Finally, this dissertation is summarized and concluded in chapter 7.
12
Chapter 2
Review of automated psychiatric
disorders screening
2.1. Introduction
A psychiatric screening is when a licensed psychiatrist examines a patient for possible
psychiatric disorders. Similar to a physical examination, a psychiatric screening session
consists of interviews and tests with the purpose of diagnosing the mental health of the
patient. It is a structured way of observing and describing the psychological functions of
a patient. Psychological aspects conventionally monitored during a psychiatric screening
include the patient’s attitude, behavior, mood, emotion, thought process, perception,
cognition, and judgement, which were inferred from the patient’s facial expression, body
language, and speech [69].
Conventionally, patients need to go to a hospital or a health care service to be diagnosed
and treated. However, with the advance of information technology and communication
technology, some health care services are now available remotely. Telehealth is the use
of digital information and communication technologies, such as computers and mobile
devices, to access health care services remotely and manage a person’s health. It also
enables health care service for patients with limitations in transportation or mobility, for
example, patients inside a rural area or areas with travel limitations or bans. With the
recent coronavirus pandemic in 2020, more attention is being paid for telehealth research,
both for psychiatric examination and physical examination.
The advances of telehealth also open the bridge for automatic telehealth screening, both
for physical and mental disorders. Automated testing or self-testing could be used to
identify, and perhaps assess, individuals with hearing loss. Self-testing with an intelligent
automated system could offer accurate results and measure background noise to ensure
validity. Such systems are now being developed by researchers worldwide.
13
2.2. Sensory features for automatic psychiatric disorder screening
As reported in the previous chapters, self-report or questionnaire, task performance result,
data from sensors, and data from social media are the most common features utilized for
automatic mental health screening. Sensory features commonly employed for automated
mental health screening include
a. Facial features (gaze, blink, emotion detection, etc.)
b. Biosignals (electroencephalogram, heart rate, respiration, etc.)
c. Auditory features (intensity, tone, speed of speech, etc.).
2.2.1. Facial Features
Patients with a number of psychiatric conditions may display abnormal facial expressions.
Facial expression or emotion detection has always been an easy task for humans but
achieving the same task with a computer algorithm is quite challenging and has been the
objective for computer vision field. With the recent advancement in computer vision and
machine learning, it is possible to detect emotions from images. The detection and
processing of facial expression are achieved through various methods such as optical flow
[70] [71], hidden Markov models [72], or artificial neural networks [73].
Aside from facial expression, gaze, blink, and eye movements are said to be important
clues to mental health clinicians [74] [75]. One such method involves using an eye-
tracking device to monitor the duration of a patient’s gaze when presented with
emotionally evocative stimuli, such as photos of happy or sad faces [76]. It is said that
people with depression tend to have an attentional bias for negative information, which
may be one factor that increases vulnerability to depressive episodes.
It must be noted that there is no uniform facial landmark map. The most popular facial
landmark map is the perhaps from “Multi-PIE database” [77]. It is a database of human
faces, with 68 facial landmarks pre-annotated. An example from such scheme is given in
Figure 2.1. Nevertheless, the effectiveness of facial landmarks is often dependent on the
objective and several studies were reported using their own facial landmark mapping
scheme [78] [79], even a 3D-based facial landmarks [80].
14
Figure 2.1 Multi-PIE 68-point facial landmark scheme
2.2.2. Biosignals
An electroencephalogram (EEG) is a method to record electrical activity of the brain. It
is noninvasive – the electrodes are placed in the scalp. The invasive method for recording
electrical brain activities are called electrocorticography (ECoG). EEG has been utilized
by mental health professionals for diagnosis [47]. During an EEG recording session,
small electrodes are placed on the scalp of the head, and they are attached to a computer.
This computer measures all the electrical impulses that brain cells trade with one another,
and all the portions of the brain that are at work. As the test moves forward, the provider
of the test might ask people to either relax or perform certain types of activities, like
problem solving or storytelling, and then measure how the electrical activity changes due
to those behaviors. During “relax” session, the subjects are typically instructed to keep
their eyes closed to reduce eyeblink artifacts.
Another biomarker is brain imaging from MRI. An MRI uses electric currents and radio
waves to develop a three-dimensional view of a body part. Obtaining this of information
is relatively easy and an MRI scan is typically completed in one hour. However, MRI
scans are typically costly, and the result might not be meaningful [81]. Nevertheless, some
studies report that MRI scans alone might be able to accurately predict psychiatric
disorders [82] [83].
15
Electrocardiograms (ECGs) and heart rate variability (HRV) have been proven to be
beneficial for diagnosing bipolar disorder. In one study, the researchers computed what
is known to cardiologists as respiratory sinus arrhythmia (RSA). At the baseline
(beginning of the study), the subjects with major depression had significantly higher RSA
than those with bipolar disorder [84].
2.2.3. Auditory features
Similar to facial expression, emotional speech also seems to be beneficial for diagnosing
mental health [85]. Various changes in the autonomic nervous system can indirectly alter
a person's speech and are beneficial for recognizing emotion. For example, speech
produced in a state of excitement (fear, anger, or joy) becomes fast, loud, and precisely
enunciated, with a higher and wider range in pitch, whereas low mood emotions such as
tiredness, boredom, or sadness tend to generate slow, low-pitched, and slurred speech
[86].
Speech patterns have been known to provide indicators of mental disorders. One study in
1921 stated that depressed patients' voices tended to have lower pitch, more monotonous
speech, lower sound intensity, and lower speech rate as well as more hesitations,
stuttering, and whispering [87]. The advantages of using speech features compared to
other features is that the symptoms are often hard to disguise and the possibility of
generalization across languages, considering similar human vocal anatomy [88].
Nevertheless, cultural effects on human behavior should be considered during
interpretation of speech analysis.
2.3. Machine learning algorithms for automatic psychiatric disorder
screening
Machine learning algorithms commonly employed for automatic psychiatric disorder
screening include [89]:
a. Support Vector Machines (SVM),
b. Gradient Boosting Machine (GBM),
c. Random Forest,
d. Naive Bayes, and
16
e. K-Nearest Neighborhood (KNN)
2.3.1. Support vector machine
A support vector machine (SVM) is a supervised machine learning model that uses
classification algorithms for two-group (binary) classification problems [90]. To
understand what an SVM is, the understanding of following keywords is required:
hyperplane, support vector, and margin.
Hyperplane: The objective of an SVM is to find a hyperplane that best divides a dataset
into two classes. A hyperplane in 2D data is a line. This line is the decision boundary, any
data that falls to one side of it we will classify as class zero, and anything that falls to the
other as class one, where “class zero” and “class one” are the two possible class labels in
this example.
Support vector: Support vectors are the data points nearest to the hyperplane, the points
of a data set that, if removed, would alter the position of the dividing hyperplane.
Margin: The distance between the hyperplane and the nearest data point from either set
is known as the margin. As the nearest datapoints to hyperplane is the support vectors,
margins are the distance between support vectors and a hyperplane. The optimal
hyperplane of an SVM is defined by the hyperplane which produces smallest
classification error while producing greatest possible margins.
After giving an SVM model sets of labeled training data for each category, the model is
able to categorize new text. The main idea is that based on the labeled data (training data)
the algorithm tries to find the optimal hyperplane which can be used to classify new data
points. In two dimensions the hyperplane is a simple line and Figure 2.2 shows the
illustration of support vectors determination among the two-class problem in two-
dimensional data points.
17
Figure 2.2 An example of SVM for binary classification. Red dots represent the class
“0” and cyan dots represent the class “1”. The dashed green line represents the
hyperplane.
A support vector machine takes the red and cyan data points and outputs the hyperplane
that best separates the tags. As shown in the Figure 2.2, in case where the data are not
completely linearly-separable, the hyperplane is chosen to minimize misclassification
based on the given training data and label.
However, since SVM is inherently a linear separator, it is unable to solve a nonlinearly
separable data as shown in Figure 2.3a. In this case, a “kernel trick” is required. Adding
a third dimension, for example z = x² + y² causes the data to be projected into a new space,
and a slice of that space is now linearly separable (see Figure 2.3b). Common kernels to
be utilized with SVM include polynomial kernel (2nd and 3rd order), radial basis function
(RBF), sigmoid, and gaussian.
18
(a) (b)
Figure 2.3 The kernel trick. (a) is a nonlinearly separable dataset, and (b) the dataset
after adding a third dimension, by applying the kernel z = x² + y² is now linearly
separable by a linear plane
Computing an SVM classifier is equivalent to solving:
𝑓(�⃗⃗� , 𝑏) = [1
𝑛∑max(0, 1 − 𝑦𝑖(�⃗⃗� ∙ 𝑥 𝑖 − 𝑏))
𝑛
𝑖=1
] + 𝜆‖�⃗⃗� ‖2 (2.1)
Here, �⃗⃗� is the normal vector to the hyperplane and 𝑏 is the bias. 𝑦𝑖 is either -1 or 1,
indicating the class to which 𝑥 𝑖 belongs to. The term 𝜆 is the tradeoff parameter. Since
𝑓(�⃗⃗� , 𝑏) is a convex function of �⃗⃗� and b, optimization algorithms such as gradient descent
algorithm can be used.
2.3.2. Gradient Boosting Machine
Gradient boosting is an ensemble machine learning technique [91]. It produces a strong
model by converting an ensemble of weak prediction models, typically decision trees. A
decision Tree consists of nodes and edges. Terminal nodes that predict the outcome are
called “leaf nodes”. An illustration of a decision tree is shown in Figure 2.4. Different
from a standard decision tree, the decision trees utilized in boosting algorithms are
typically
“stump” – a one-level decision tree. A stump makes a prediction based on the value of
just a single input feature.
19
Figure 2.4 An illustration of a decision trees. Conditions are checked iteratively until the
decision in leaf nodes are reached. In this figure, the objective is a classification and the
leaf nodes mark the class which the input belongs. A stump is a decision tree with only
one height: it consists only of the root node and leaf nodes.
Boosting is a method of converting weak learners into strong learners. In boosting, each
new tree is a fit on a modified version of the original data set. It builds the strong model
by sequentially adding the weak models. Gradient boosting machine (GBM) or gradient
tree boosting is a generalization of adaptive boosting algorithm (AdaBoost) [92].
AdaBoost works by weighting the observations, putting more weight on difficult to
classify instances and less on those already handled well. New weak learners are then
added sequentially. The weights from previous training sessions caused the new learners
to focus more on the difficult instances. Predictions are made by weighted majority voting
from the weak learners’ predictions, weighted by their individual accuracy.
Gradient boosting re-defines boosting as a numerical optimization problem where the
objective is to minimize the loss function of the model by adding weak learners using a
gradient-descent like procedure. As gradient boosting is based on minimizing a loss
function, different types of loss functions can be used resulting in a flexible technique
that can be applied to regression, multi-class classification, etc.
20
The first learner to predict the observations in the training dataset is trained and the error
is calculated. In AdaBoost, the data is assigned weights based on the misclassification for
the second learner. In GBM, the error is defined as a loss function and the second learner
is built based on the residual error produced by the first learner to predict the loss after
the first step and continue to do so until it reaches a threshold.
2.3.3. Random Forest
Random forest is also an ensemble learning algorithm concerning with decision trees [93].
A random forest divides the training data into random subsets of features and random
subsets of data points, then decision trees were trained based on those subsets. This is
defined as “bagging” (bootstrap aggregating). The random sampling increases diversity
in the decision trees and leads into to more robust overall predictions. Both random
sampling and usage of numerous decision trees lead to the name “random forest”. During
the output prediction, majority voting is taken from the trained decision trees. In case of
regression, average or median value from the predictions might be considered.
Disadvantages of random forest algorithm include the computational cost for very deep
trees. As random forest parallelly trains numerous decision trees, a setting of deep
decision trees may cause burden in the memory and processing power. It is said that the
computational cost increases in tandem with the depth of decision tree more than the
number of decision trees. It is also affected by inherent weakness of bagging algorithms:
the weakness to imbalanced dataset.
2.3.4. Naive Bayes
Naïve Bayes is a probabilistic model to find the value that achieves the maximum
probability computed from a conditional probability chain. For example, the probability
of an object to be an “apple” if it is red, round, and diameter is around 8cm. Naïve Bayes
is based on Bayes theorem:
𝑃(𝑐|𝑥) =𝑃(𝑥|𝑐)𝑃(𝑐)
𝑃(𝑥) (2.2)
Where 𝑃(𝑐|𝑥) is the probability of condition c given predictor (feature) x. For
independent feature vector 𝑋 = {𝑥1, 𝑥2, … , 𝑥𝑁}, the above equation can be written as
21
𝑃(𝑐|𝑋) =𝑃(𝑥1|𝑐)𝑃(𝑥2|𝑐)…𝑃(𝑥𝑁|𝑐)𝑃(𝑐)
𝑃(𝑥1)𝑃(𝑥2)…𝑃(𝑥𝑁) (2.3)
During classification, 𝑃(𝑐|𝑋) is checked for each class c. For classification problem,
Naïve Bayes’ algorithm will classify the inputted features as the class with highest
probability. Since the denominator 𝑃(𝑥1)𝑃(𝑥2)…𝑃(𝑥𝑁) is same for all cases of c, it can
be safely ignored when comparing the likelihood for classification.
Naïve Bayes is computationally inexpensive and performs well if the assumption of
independence holds. However, the main limitation of Naive Bayes also lies in the
assumption of independent predictor features. In the real life, completely independent
predictors are almost impossible.
2.3.5. K-Nearest Neighborhood
The k-nearest neighbors algorithm (k-NN) is a method originally proposed by Thomas
Cover in 1967 that could be used for both classification and regression [94]. k-NN is a
type of instance-based learning, where the function is only approximated locally, and all
computation is deferred until function evaluation.
k-NN algorithm works by computing the distance between one query datapoint with all
other datapoints, then appoints the k amount of nearest datapoints as neighbors. The label
of the queried datapoint is then determined by the neighbors, often by majority voting.
The k in k-NN refers to the number of nearest neighbors and is a parameter to be
optimized along with the distance metric. Popular distance metrics include:
1. Euclidean distance
Euclidean distance is the straight-line distance between two points in Euclidean space
and defined by the following equation:
𝑑(𝑥, 𝑦) = √∑(𝑥𝑖 − 𝑦𝑖)2𝑛
𝑖=1
(2.4)
Here, 𝑑(𝑥, 𝑦) is the distance between two points: x and y. The points x and y are of
n-dimension.
22
2. Manhattan distance
The Manhattan distance, also known as city block distance or taxicab distance, is
defined as the sum of the lengths of the projections of the line segment between the
points onto the coordinate axes.
𝑑(𝑥, 𝑦) = ∑|𝑥𝑖 − 𝑦𝑖|
𝑛
𝑖=1
(2.5)
Here, 𝑑(𝑥, 𝑦) is the distance between two points: x and y. The points x and y are of
n-dimension
3. Hamming distance
Hamming distance is used for categorical features. Its mathematical formula is
similar to Manhattan distance, but with a catch. For cases where 𝑥𝑖 = 𝑦𝑖, then 𝑥𝑖 −
𝑦𝑖 = 0 and 𝑥𝑖 ≠ 𝑦𝑖, then 𝑥𝑖 − 𝑦𝑖 = 1.
4. Minkowski distance
The Minkowski distance is the generalization of of both the Euclidean distance and
the Manhattan distance. The Minkowski distance of order P is defined as:
𝑑(𝑥, 𝑦) = (∑|𝑥𝑖 − 𝑦𝑖|𝑃
𝑛
𝑖=1
)
1𝑃
(2.6)
Where, 𝑑(𝑥, 𝑦) is the distance between two points: x and y. The points x and y are of
n-dimension. For P=1, it is equal to Manhattan distance and for P=2, it is equal to
Euclidean distance. P must be a positive integer as P<1 violates the triangle
inequality.
The main weaknesses of kNN algorithm are the fact that the computational cost does not
scale well with large number of samples and the need to determine an optimal value k. As
the computational cost scales with the number of samples, testing optimal k values also
become computationally expensive. It is also very affected by curse of dimensionality
and outliers. Feature scaling also needs to be performed to ensure homogeneity of the
features.
23
2.4. Summary
In this chapter, conventional inputs and machine learning algorithms for automatic mental
health screening is discussed. In section 2.2 the conventional inputs are described and in
section 2.3 the conventional machine learning algorithms are discussed.
24
Chapter 3
Facial landmark analysis from static
images
3.1. Introduction
This chapter describes the feature analysis of facial landmarks from depression patients
and dementia patients. In section 3.2, the data acquisition protocol is described. In section
3.3, the analysis procedure is described, from preprocessing up to feature analysis. The
machine learning experiment setup is also described in this section. In section 3.4, the
results of the analysis and machine learning were described. Discussion of the results are
in section 3.5. The chapter is then concluded in section 3.6.
3.2. Data Acquisition
The data utilized in this chapter is from PROMPT database, as specified in chapter 1. The
PROMPT database’s facial recordings were obtained by videotaping a patient’s clinical
interview session with a therapist. The recording apparatus was RealSense R200 (Intel
Corporation) and Microsoft Kinect for Windows v2 (Microsoft Corporation), both with
frame rate of 30 frame-per-second (FPS).
The recordings for database were conducted on Keio University Hospital and Joint
Medical Research Institute. The experiment was approved by Keio University Hospital
Ethics Committee (20160156, 20150427). A full psychiatric screening session consisted
of around 10 minutes of free talk session followed by 20 or more minutes rating session.
The description of interview setup and the details of the screening session are as follows:
Interview setup: During the interview, the patient and the psychiatrist were seated across
a table. The psychiatrist controls the start/end of the recordings. The distance between the
video device and the patient is around 70cm.
Free talk session: The psychiatrist conducts a typical interview concerning the patient’s
daily life and mood. If the current session is the patient’s first visit, the psychiatrist may
25
also ask the patient’s clinical background such as family, history of other illness, etc. The
results of this session typically do not contribute for the assessment of the patient and its
main objective is to prepare patient for the rating session while also possibly also
obtaining the clinical background, in case there was no such information available.
Despite the name of “free talk”, this session has guidelines and is a semi-structured
clinical interview. The length of this segment is around 10 minutes.
Rating session: In the rating session, the patient is interviewed based on a clinical
assessment tools related to their mental health history, which may include some additional
tasks and tests such as clock-drawing test and memory test for dementia screening or
some personal questions such as their sleep habit (PSQI) and depressive mood in the
recent weeks which are related to depression screening. The duration of a single rating
segment typically lasts more than 20 minutes.
3.3. Analysis
To prevent the model learning age feature instead of disease’s features and to increase the
contrast between the features, we screened the dataset and only include datasets which
satisfy the following criteria:
(1) Recording length of 5 minutes or more. The objective of this limitation is to ensure
enough information is recorded in the dataset.
(2) Age between 57 and 84 years-old. This is to remove the effect of aging, which is
known to be positively correlated with dementia symptoms
(3) For dementia patients: mini-mental state examination (MMSE) score of 24 or less
accompanied with geriatric depression scale (GDS) score of 4 or less; This is to
ensure that the dementia patients are symptomatic and are not afflicted with
depression co-morbidity.
(4) For depression patients: 17-item Hamilton depression rating scale (HAMD17) of
8 or more; Similar with dementia patient criteria, this is to ensure that the
depression patients are symptomatic. Depression patients that have co-morbidities
with dementia are always classified as dementia patient.
26
The qualifying dataset were 65 datasets from 46 subjects, consisting of 36 dementia
datasets (21 subjects) and 29 depression datasets (26 subjects). To protect the identity of
the subjects, facial landmarks extraction was performed using Omron’s OKAO Vision
[11] to extract the X-Y coordinates, as seen in Figure 3. These 40 points of facial
landmarks are processed and analyzed instead of the raw face images. The red squares
are the eyebrows, the green diamonds are the eyes, red crosses mark the nose, and yellow
circles are the subject’s mouth. The blue dot in the middle of the eyes are the glabella,
while the other blue dots mark the outline of the subject’s face.
OKAO vision used an AdaBoost algorithm to construct a cascade of facial region learners
from confidence-rated look-up-table (LUT) of Haar feature. The facial feature points
extraction utilized Gabor wavelet transform coefficients as feature values and SVM as a
classifier. The SVM classifier is trained using a database to output 1 at the predefined
feature point and output 0 otherwise. While detecting the feature point, the SVM is used
to search around the eye or mouth area, the position with the highest confidence is
considered as the facial feature point.
Figure 3.1 Facial landmarks extracted with OKAO vision (40 points).
27
3.3.1. Preprocessing
The obtained facial landmarks were then normalized, such that the center of face lies on
origin coordinates (0,0) and then each landmark coordinate set was divided according to
the face width and height: X-coordinates were divided by face width and Y-coordinates
were divided by face height as shown on Equation 3.1 and 3.2.
𝑋�̂� =(𝑋𝑖 − 𝑋𝑐)
𝑊𝑖𝑑𝑡ℎ (3.1)
𝑌�̂� =(𝑌𝑖 − 𝑌𝑐)
𝐻𝑒𝑖𝑔ℎ𝑡 (3.2)
Here, 𝑋�̂� and 𝑌�̂� denotes the normalized X and Y coordinates, respectively. 𝑋𝑖, 𝑋𝑐, 𝑌𝑖 , 𝑌𝑐
are the X-coordinates of features, X-coordinate of face center, Y-coordinates of features,
and Y-coordinate of face center, respectively.
Another preprocessing step was performed to remove outliers, by removing frames which
landmarks have value less than 1st percentile and greater than 99th percentile.
3.3.2. Feature extraction
After the preprocessing, the feature extraction was performed. Features extracted in this
study were: speed statistics of each landmark, speed statistics of the face center, area of
mouth, and the standard deviation of eye pupils’ position. The speed of each landmark
was computed using the Equation 3.3 and 3.4.
𝑆𝑖 = √(�̂�𝑖 − �̂�𝑖+1)2+ (�̂�𝑖 − �̂�𝑖+1)
2 (3.3)
𝑆�̂� = ∑ 𝑆𝑖
30𝑗
𝑖=30(𝑗−1)+1
(3.4)
where �̂�𝑖 and �̂�𝑖 denotes a preprocessed landmark’s coordinate on frame 𝑖. Here, 𝑆𝑖 is the
landmark 𝑖’s speed per frame and 𝑆�̂� denotes the landmark 𝑗’s speed per second. The
28
constant 30 represents the camera’s frame-per-second (FPS) rate of 30, as stated in data
acquisition part.
The area of left eye, right eye, and mouth were computed using the equation 3.5, which
is a general equation of computing the area of an arbitrary polygon in 2D space.
𝐴 =1
2|∑ �̂�𝑖𝑌𝑖+1 − �̂�𝑖+1𝑌𝑖
𝑛−1
𝑖=0
| (3.5)
when 𝑖 = 𝑛 − 1, then 𝑖 + 1 is expressed as 0.
3.3.3. Statistical analysis
To investigate the relationship between audio features and clinical symptoms, linear
correlations of the acoustic features against the corresponding clinical rating tools were
computed. The clinical rating tools were HAMD17 for depression subjects and MMSE
for dementia subjects. In addition, Wilcoxon rank-sum test were also performed to check
statistically significant difference between the facial features of dementia and depression
groups.
3.3.4. Feature selection and machine learning
Feature selection is an important process to reduce the number of features used as a
machine learning input. Several important reasons to perform feature selection are to
mitigate overfitting problem, to make the model easier to interpret, and to improve the
speed of a model’s learning session. Here, we utilized Least Absolute Shrinkage and
Selection Operator (LASSO) algorithm [95] as the feature selection algorithm.
By its definition, LASSO is a regression algorithm which shrinks the variable coefficients,
some even set to zero or near-zero. This is effectively a feature selection since variables
with near-zero coefficients can be ignored from the computation. Mathematically,
LASSO solves
min𝛼,𝛽
(1
2𝑁∑(𝑦𝑖 − 𝛼 − ∑𝛽𝑗𝑥𝑖𝑗
𝑗
)
2
+ 𝜆∑|𝛽𝑗|
𝑗
𝑁
𝑖=1
) (3.6)
29
where 𝛼 is a scalar and 𝛽 is a vector of coefficients, N is the number of observations, 𝑦𝑖
is the response at observation i, 𝑥𝑖𝑗 is the vector of predictors at observation i, and 𝜆 is a
nonnegative regularization parameter. High value of 𝜆 results in stricter feature selection
and for this analysis, it is computed automatically such that it is the largest possible value
for nonnull model.
Only features that is selected at least 10% of the times with this algorithm are utilized for
machine learning; the performance of LASSO is not considered, only the coefficients
mattered in this case. This is to ensure that the LASSO filters out some features that only
selected in very rare occasion. The total number of features before feature selection is 41
(40 normalized facial landmarks + facial center point).
We then examined the possibility of differentiating depression patient and dementia
patient based on facial features by utilizing supervised machine learning with 10-fold
cross validation to measure the model’s performance. The machine learning model we
used was support vector machines (SVM) with various kernels: linear, polynomial with
order of 2, and radial basis function (RBF). Additionally, random undersampling boosted
trees (RUSBoost) and AdaBoost trees are also utilized. Neural networks are not
considered as the number of samples in the dataset is scarce.
3.4. Results
Wilcoxon rank-sum test between the dementia and depression group found statistically
significant difference (p<0.05) at the following features: left and right eye features
including pupil but not eyebrows, glabella, nose features, distance between eyelids. None
of the mouth features were considered significant difference between depression and
dementia group, along with jaw and ear features. Pearson’s correlation analysis between
the clinical screening tools and the features found almost no statistically significant
correlation (p<0.05), one feature in depression group: area of mouth showed significant
negative correlation with HAMD17 (p=0.0154, R=-0.4457) and one feature in dementia
group: the distance between eyelids showed with significant negative correlation MMSE
(p=0.0146; R=-0.4037).
30
Twenty-two (22) features were selected with the LASSO algorithm and these features are
selected for all feature selection cases, hence the high number. These features were
utilized for machine learning. The selected features are listed on Table 3.1.
Table 3.1 Features selected with LASSO
Statistical
Feature Landmark
Average
speed
Left pupil, left eye (bottom),
right pupil, left nose, right nose, right eyebrow (bottom & left), right
jaw, right ear (bottom)
Median
speed Right pupil, upper lip (bottom)
Standard
deviation
of speed
Left eye (top), right eye (top & bottom), glabella, left eyebrow
(bottom), right eyebrow (top, bottom, and left), left jaw, right jaw, left
ear (bottom), right ear (top)
95th
percentile
of speed
Right nose, left eyebrow (top & left), right eyebrow (bottom). left ear
(bottom)
The best performances of the SVM models were the one with polynomial kernel with
order of 2. Its average accuracy is 81.37±15.03%. The second best was SVM with Linear
kernel, with average accuracy of 77.86±16.54% whilst RBF kernel performed the worst,
with the average accuracy of 69.46±15.96%. The detailed result is described in Table
3.2.
31
Table 3.2 Machine learning Results
Models Accuracy
SVM-polynomial 81.37±15.03%
SVM-linear 77.86±16.54%
SVM-RBF 69.46±15.96%.
RUSBoost 69.19±20.92%
AdaBoost 80.71±12.22%
3.5. Discussion
The Pearson’s correlation analysis result of the mental health screening tools and the
facial features was very interesting. For each group, only one feature showed statistically
significant correlation but none of those features were not significant difference between
the two groups. With this result, it seemed that the facial features chosen in this chapter
were not a good predictor for depression or dementia, but it might still be possible to
differentiate depression and dementia patient using those features as supported by the
result of Wilcoxon rank-sum test. Eye, glabella, and nose features seemed to be the best
features for differentiating depression and dementia as those features were also chosen
by LASSO in the machine learning experiment.
The best performances of the SVM models were the one with polynomial kernel with
order of 2 and the worst performance was from RBF kernels. AdaBoosted trees perform
quite well, with AdaBoost taking overall second best with the fewest variance, however
RUSBoosted trees performed the worst. RUSBoost is a machine learning algorithm
aiming to solve class imbalance problem. In this case, the classes was 46 against 29 and
appeared to be imbalanced, but RUSBoost did not seem to work properly. As stated in
the chapter 1, there are no algorithm aiming to separate depression and dementia to date.
Automatic depression screening researches with facial feature as an input reports the
accuracy of 87.67% [96], 82% [97], and 74.5% [98]. Dementia researches from facial
features are almost non-existent but the automatic dementia machine learning from other
type of inputs report accuracies of 94.73% (MRI) [99] and 88% (questionnaire) [100].
32
Overall, both AdaBoost and SVM with polynomial kernel might be considered the best
models in this case.
3.6. Summary
The results from facial feature analysis show that the dementia patients and depression
patients have statistically significant difference in facial features utilized. The two groups
can be differentiated even with traditional machine learning technique such as SVM. This
result suggests the possibility of automatic pseudodementia screening by utilizing
machine learning. The interesting result from Pearson’s correlation analysis suggests that
other facial features, possibly facial expression and such might be considered instead of
speed statistic features of facial landmarks and mouth area.
33
Chapter 4
Robust facial landmark tracking
algorithm
4.1. Introduction
In contrast with other chapters, this chapter focuses on proposal of robust facial landmark
tracking algorithm. Real-time monitoring or even faster diagnosis are always preferable.
The trade-off between performance and number of samples (and speed, indirectly) has
been the inherent problem for machine learning and it is even more apparent in real-time
processing problem. One huge advantage for utilizing faster machine learning based
psychiatric disorder screening beside the human-computer interaction factors is the
possibility of applying adaptive machine learning algorithm instead of static models.
In addition to real-time analysis, facial feature point tracking for that is robust against
camera shake and face orientation changes are very important. It is impractical and
pointless to ask a clinical patient to stay still in front of the recording apparatus. Not only
it places burden to the patient, the features needed for analysis are also polluted with the
patient’s conscious to stay in facing the camera. Therefore, a facial feature point tracking
algorithm that is robust to head movement is desirable.
The rest of this chapter is structured as follows. In section 4.2 the criticism of conventional
facial landmark tracking algorithms is described. In section 4.3 the proposed
improvement is described. The experiment is reported in section 4.4 and the results is
discussed in section 4.5. Finally, the chapter is concluded in 4.6.
4.2. Conventional facial tracking analysis
Facial feature point detection conventional methods typically estimate the coordinates of
future feature points by a regression process. One of such algorithms is called Supervised
Descent Method (SDM) [101]. In SDM, the position of estimated feature points are
improved by iteratively applying the original estimation of feature points to input images,
34
calculating local binary pattern (LBP) values around the original estimation, and
multiplying the LBP with the pre-trained weights by means of machine learning. This
results in the estimated feature points progressively converge towards the actual face on
the image. The SDM optimizes the error function between estimated face shape and the
correct shape and thus do not need recursive algorithms such as Newton’s method but
weights from training images. In recent years, various improvements have been proposed
concerning the feature extraction method alternatives [102], weight learning algorithms
such as deep learning [103] and the positioning of initial points [104] of the SDM.
The SDM will fail to learn the weights for updating feature points if there are no unique
solution in the feature space, non-front facing picture for example, or if the optimization
falls into a local optimum. To prevent this problem, [105] proposed to divide the training
set such that objective function always fall into global maxima by means of principal
component analysis or similar algorithms. However, this causes the necessity to evaluate
the regions at each step, in order to update the facial feature estimation, in the testing
phase as well. Additionally, the real-time implementation is impossible as the
computational cost is high.
To solve this problem, [106] proposed Cascaded Compositional Learning (CCL). CCL is
an algorithm which is a direct improvement of [105]. During the training phase, the
weights are learned for each region similar to [105] but in the test phase, all possible facial
feature points from all regions are obtained and then combined by applying the likelihood
of each feature points. As a result, it is possible to predict face shape while avoiding local
pitfalls and superior in both computational cost and performance. The initial stage of each
frame is suggested to be the average value of face feature points.
Although it is beneficial for still images, in case where prior information are available
such as tracking problem, it is inefficient to recalculate all the steps from the start. Also,
in case which multiple people are in a single picture, there is possibility that the algorithm
switched to the other person during the next frame. Therefore, it is theoretically possible
to improve the algorithm for case of tracking problem by incorporating transition
information between frames. Now, a simple solution would be to assume a linear motion
model such as constant one direction movement. However, this would cause the tracking
35
to fail if an unexpected motion such as camera shake occurs. Additionally, in the field of
computer vision tracking, optical flow has been reported to yield satisfactory results [107].
Therefore, an algorithm of CCL-based facial feature point tracking method using optical
flow is developed and reported in this chapter. For the very first frame, the algorithm used
the average face, similar to the conventional studies. On the second frame onwards, the
initial value is estimated using optical flow and then CCL is performed to achieve
effective facial feature landmark tracking.
4.3. Proposed method
In general, the facial feature point tracking by CCL can be divided into three parts: facial
feature point mapping, regression learning of weights for each region, and shape
estimation by combined vectors. Aside from that, a pre-trained random forest model for
estimating feature points Φ from input image I and face shape S is necessary.
4.3.1. Supervised descent method
In SDM, facial feature points are expressed as 𝑆 = [𝑥1, 𝑦1, … , 𝑥𝑀, 𝑦𝑀] where 𝑀 is the
number of feature points. In the general SDM, when initial face shape 𝑋0 and input image
𝐼 are given, shape prediction is performed by 𝑁-step regression. The regression equation
of the 𝑛-th stage is as shown in following equation:
𝑆𝑛 = 𝑆𝑛−1 + 𝑊𝑛𝛷𝑛(𝐼, 𝑆𝑛−1) (4.1)
where 𝛷𝑛(𝐼, 𝑆𝑛−1) is a feature value at stage 𝑛 based on the input image 𝐼 and the face
shape S at the 𝑛−1 th stage, and 𝑊𝑛 is a weight vector for updating the face shape.
The feature value 𝛷 can be obtained by means of scale-invariant feature transform (SIFT)
or histogram of oriented gradients (HOG), difference between pixel value between the
two images. But for the case of side-view face and occlusion, the Local Binary Features:
(LBF) are said to be effective [102].
In SDM, the equation 4.1 is not solved by means of recursive algorithms such as Newton’s
algorithm, but by utilizing pre-trained weights with the aim to shorten the computation
time. However, as described above, SDM tends to fail into local optima when there are
36
no unique solution. Therefore, by dividing the loss function into K domains of
homogenous descent (DHD), optimal weights 𝑊𝑘𝑛 can be obtained per each domain,
negating the pitfall of local optima. The division into DHDs is achieved by dimensional
reduction of Principal Component Analysis (PCA). Conventionally, the weight 𝑊𝑘𝑛 in
each region is calculated by the ridge regression as described in the following equation.
min𝑊𝑘
𝑛∑‖�̂�𝑖 − 𝑆𝑖
𝑛−1 − 𝑊𝑘𝑛𝛷𝑛(𝐼, 𝑆𝑛−1)‖
2
2+ 𝜆‖𝑊𝑘
𝑛‖𝐹2
𝑖∈𝑇𝑘
(4.2)
Here, the integer 𝑖 refers to the 𝑖-th training image �̂�𝑖 is the correct face shape, 𝑇𝑘 is the
training image set of the 𝑘-th region, 𝑊𝑘𝑛 is the n-th weight in the 𝑘-th region In addition,
𝜆 is a regularization parameter. The 𝑊𝑘𝑛 is estimated by Support Vector Regression
(SVR) algorithm with kernel trick [108].
As the testing phase revolves around moving pictures, each frame can be assumed to be
dependent from the previous frame. Therefore, in the test phase of the proposed method,
the feature points predicted by the optical flow from the previous frame are set as initial
values.
𝑆𝑖0 = 𝑆𝑖
𝑛 + 𝑣(𝑆𝑡−1) (4.3)
𝑣(𝑆𝑡−1) is the optical flow between the current and previous frame. It is defined by
𝑣(𝑆𝑡−1) = [∆𝑥1, ∆𝑦1, … , ∆𝑥𝑀, ∆𝑦𝑀]
4.3.2. Compositional vector estimation
The feature value Φ mentioned in the previous section is a quantity derived from Local
Binary Features: (LBF). the LBF, are adaptively extracted by using the fern model for the
difference in brightness between two points taken randomly from around the facial feature
point S of the input image I. In CCL, class labels considering the appearance of feature
points are also input to LBF at the same time in order to extract feature quantities based
on Hough Forest [108]. Hough Forest is a method of selecting two evaluation functions
according to hierarchy when defining a branching function, and it is possible to obtain
features robust to such as occlusion. In this example, two branch functions were selected
according to the ratio of label of sample input to the node.
37
For each image in each iteration, the DHD area is first computed and then the weights for
each DHD area 𝑊𝑘𝑛 is estimated. However, this estimation requires the correct face shape
�̂�𝑖 and it is not practical for the testing phase. Therefore, in CCL, the regions are not
estimated, and instead compositve vector 𝑝 = [𝑝1, 𝑝2, … , 𝑝𝑘] which estimates the
predicted shape 𝑆𝑘. To calculate the composite vector, the information such as the offset
vector of the feature quantity 𝛷 extracted from the tentative predicted shape 𝑆𝑘 is deleted,
and the composite vector feature quantity 𝛷’ of compressed class label information is
trained to the fern model.
4.3.3. Training composite vectors
The correspondence relationship between the composite vector feature quantity 𝛷’ and
the composite vector described in the previous section is represented by 𝑔 in the following
equation.
min𝑔
∑‖�̂�𝑖 − 𝑆𝑖𝑛−1‖
2
2
𝑖∈𝑇
𝑠. 𝑡.𝑆𝑖 = ∑ 𝑝𝑖𝑘𝑆𝑖𝑘
𝐾
𝑘=1
𝑝𝑖 = 𝑔(𝛷′), 𝑝𝑖 ≥ 0, ‖𝑝𝑖‖ = 1
(4.4)
The fern model is used to learn the function 𝑔 corresponding to 𝛷′ and 𝑝𝑖. The branching
parameter 𝜃𝑗 that minimizes the sum of the errors and the resulting composite vector
𝑝 (𝑑 = {𝐿, 𝑅}) are stored respectively.
min𝜃,𝑝𝑑
∑ ∑‖�̂�𝑖 − 𝑆𝑖𝑛−1‖
2
2
𝑖∈𝑄𝑑=𝐿,𝑅
𝑠. 𝑡.𝑆𝑖 = ∑(𝑝(𝑑))𝑘𝑆𝑖𝑘
𝐾
𝑘=1
𝑝(𝑑) ≥ 0, ‖𝑝(𝑑)‖1= 1
(4.5)
In the test phase, the feature 𝛷𝑡𝑒𝑠𝑡′ is inputed. and branched based on 𝜃𝑗, and then 𝑝 is
computed by solving the following error minimization problem by using the set of
training images held by the nodes that have reached the terminal nodes 𝑄𝑠𝑒𝑡
38
min𝑝
∑ ‖�̂�𝑖 − 𝑆𝑖𝑛−1‖
2
2
𝑖∈𝑄𝑠𝑒𝑡
𝑠. 𝑡.𝑝 ≥ 0, ‖𝑝‖1 = 1, 𝑝𝑘 = 0|𝑘 ∉ K
(4.6)
Face shape in each area obtained by multiplying the obtained 𝑝 and the weight for each
area calculated during learning. This process is repeated N times to ensure an accurate
predicted face shape
4.4. Experiment
In order to verify the effectiveness of the proposed method, a comparative experiment
was conducted using the benchmark data set.
AFLW dataset [109] was utilized as the training image. It is one of the most challenging
data sets. AFLW is a data set of face images in the real environment obtained from an
online photo sharing service called Flickr. In the AFLW data set, 21 facial feature points
including right and left eyes, eyebrows, nose, mouth both ears and jaw are included in
each image. These are annotated manually by humans.
Approximately 20% of the images show the occlusion of some facial feature points due
to eyeglasses, hands, etc. In the experiment, 1,000 frames were randomly selected from
all 24386 frames, and as in the previous study, a total of 19 points out of 21 points,
excluding 2 points of both ears, were used. Each face feature point is also classified into
positive (+) and negative (-) labels for Hough Forest depending on whether or not
occlusion is occurring and whether or not the distance from the average face shape is
within the search range at the time of feature extraction.
The test dataset is the database 300VW [110] [111] [112]. The 90th movie was used for
evaluation as a moving image including face direction change. In addition to all 649
frames of the moving image, data of feature points is associated with each frame. Among
the 68 points included in the 300-VW data, only the 19 points corresponding to the ones
selected in the AFLW data set are used as reference face shape for evaluation.
In the proposed method, based on face feature point detection by CCL, the optical flow
from the previous frame is used as the initial value. The initial face shape in the
comparison method is determined by fitting the average face to the face area established
39
by the Viola-Jones algorithm. We used the mean error rate, which is the normalized
average Euclidean distance between the correct face shape and the estimated face shape
at all feature points, as an evaluation method. In this experiment, a PC with Intel (R) Core
(TM) i7-6700 HQ CPU @ 2.60 GHz and Matlab R2016a were used for evaluation. The
parameters utilized are shown in Table 4.1.
4.5. Results and discussion
Figure 4.1 shows the facial feature points utilized in this experiment. The blue dots
indicate feature points from 300VW and red diamonds are 19 of 21 facial feature points
utilized in AFLW dataset.
The experimental results are shown in Figure 4.2. The green dots are the facial feature
points, and the yellow rectangles are the bounding boxes for face detection. The value of
3.97 obtained with the conventional method decreases to 3.02 with the proposed method,
indicating the effectiveness of the proposed method.
Figure 4.1 Comparison of facial feature points from AFLW dataset and 300-VW. The
blue dots indicate feature points from 300VW and red diamonds are 19 of 21 facial feature
points utilized in AFLW dataset. In this chapter, only these 19 landmarks are tracked.
40
Table 4.1 Parameters utilized in the experiment
Parameter Value
L Number of face features 19
N Number of iterations 5
K Number of DHDs 8
Number between two points to sample 500
Sample overlap (%) 20
Number of decision trees 5
Depth of decision trees 4
Conventional
Proposed
Figure 4.2 Experiment result
4.6. Conclusion
In this chapter, a new facial feature point tracking with optical flow based CCL was
described. This showed that continuous tracking is possible even in situations where
tracking failed due to face orientation fluctuation and occlusion so far. In the experiment,
we compared the error with the conventional method using the 300-VW data set and
confirmed the effectiveness of the proposed method. Since parameter optimization has
41
not been performed and tracking speed has not been taken into consideration for the time
being, we will search parameters with the best accuracy in real time and verify the
effectiveness in the real environment in the future.
42
Chapter 5
Speech feature analysis for classification
of depression and dementia
5.1. Introduction
This chapter describes the feature analysis of acoustic features from depression patients
and dementia patients, in particular: spectral and temporal features of speech. In section
5.2, the data acquisition protocol is described. In section 5.3, the analysis procedure is
described and the results are available in section 5.4. Discussion of the results are
described in 5.5. The chapter is then concluded in section 5.6.
5.2. Data Acquisition
Similar to chapter 3, the data utilized in this chapter is from PROMPT database. The
PROMPT database’s audio recordings were obtained by recording a patient’s clinical
interview session with a therapist. The recording apparatus is Beyerdynamic Classis
RM30W GmbH & Co. KG with 16 kHz sampling rate. For uniformity, the database
filtration criteria are largely similar to chapter 3.
5.3. Analysis
The analysis is largely divided into two parts: statistical analysis and machine learning
experiment. Machine learning experiment is further divided into three stages. For
statistical analysis, the first, and the second parts of machine learning, several datasets
were removed from the PROMPT database in consideration of age features and the
presence of symptoms. This is similar to chapter 3’s data filtration criteria with one
difference: the minimum recording length. Successful visual recordings obtained in
chapter 3 are fewer in number than successful audio recordings, and the average minimum
length of the audio recordings are longer than the visual recordings. In this chapter, the
minimum length for a recording to be utilized is 10 minutes, twice the requirements
utilized in chapter 3.
43
The third part of machine learning experiment utilized the dataset which were filtered out
for statistical analysis, first and second parts of machine learning experiment. Figure 5.1
illustrates the dataset filtering for the statistical analysis and machine learning phases.
Figure 5.1 Dataset Filtration in Statistical Analysis
5.3.1. Audio signal analysis
In some rare cases, the recordings contained some outliers, possibly caused by random
errors, and preprocessing of the raw data needs to be conducted. We defined the outliers
by using inter-quartile range (IQR). A point in the audio recording is defined to be an
outlier if it satisfies one of the following conditions:
1. 𝑋 < 𝑄1 − 1.5𝐼𝑄𝑅
2.𝑋 > 𝑄3 + 1.5𝐼𝑄𝑅
44
Here, X is the signal, Q1 is the lower (1st) quartile, Q3 is the upper (3rd) quartile, and
IQR is the inter-quartile range, computed by subtracting Q1 from Q3. We then apply
cubic smoothing spline fitting to the audio signal, without the outliers. The objective of
this method is twofold: (1) to interpolate the removed outliers, (2) subtle noise removal.
Additionally, intensity normalization was also performed. This was to ensure that the data
is in equal scale to each other and to reduce clipping in audio signals. The normalization
was conducted by rescaling the signal such that the maximum absolute value of its
amplitude is 0.99. Continuous silence in form of trailing zeroes at front and end of the
recordings were also deleted.
A subtotal of ten acoustic features were extracted from raw data. They were: Pitch,
harmonics-to-noise ratio (HNR), zero-crossing rate (ZCR), Mel-frequency cepstral
coefficients (MFCC), Gammatone cepstral coefficients (GTCC), mean frequency,
median frequency, signal energy, spectral centroid, and spectral rolloff point, with details
in Table 5.1. These features were chosen as they represent both temporal and spectral
features of a signal. Additionally, some of these features relate to closely to speech which
is a common biomarker for both depression and dementia [113] [114] [115]. These
features were computed once in every 10 ms by applying a 10 ms window with no overlap.
We then performed feature extraction to the windowed signals. The total count of audio
feature is 36, with 14 MFCCs and GTCCs. As we used data with length of at least 10 min,
a minimum of 60.000 datapoints were obtained, for each feature. We then computed the
mean, median, and standard deviation (SD) of the datapoints and used them for statistical
analysis and machine learning, resulting in total feature count of 108.
45
Table 5.1 Acoustic features utilized in this chapter
Feature Mathematical function and references
Pitch [116]
HNR [117]
ZCR
𝑍𝐶𝑅(𝑋) =1
2𝑁∑|𝑠𝑔𝑛(𝑋𝑖) − 𝑠𝑔𝑛(𝑋𝑖−1)|
𝑁
𝑖
MFCC [118]
GTCC [119]
Mean frequency Mean of power spectrum from the signal
Median frequency Median of power spectrum from the signal
Signal energy 𝐸(𝑋) =
𝜎(𝑋)
𝜇(𝑋)
Spectral centroid 𝑐 =
∑ 𝑓𝑖𝑠𝑖𝑏2𝑖=𝑏1
∑ 𝑠𝑖𝑏2𝑖=𝑏1
[120]
Spectral rolloffpoint ∑ 𝑠𝑖𝑟𝑖=𝑏1
=𝑘
100∑ 𝑠𝑖
𝑏2𝑖=𝑏1
[120]
For ZCR: N, sgn, and 𝑋𝑖 denotes the length of signal, signum function extracting the sign
of a real number (positive, negative, or zero), and i-th sequence of signal 𝑋𝑖 respectively.
For mean frequency and median frequency: power spectrum from the signal was applied
by performing Fourier transform. For signal energy: E(X) is the signal energy of signal 𝑋,
𝜎(𝑋) denotes the function of standard deviation of signal X and 𝜇(𝑋) indicates the
function of mean of signal X. For spectral centroid: c denotes the spectral centroid, 𝑓𝑖 is
the frequency in Hertz corresponding to bin i, 𝑠𝑖 is the spectral value at bin i, and 𝑏1 and
𝑏2 are the band edges, in bins, over which to calculate the spectral centroid. For spectral
rolloff point: r is the spectral rolloff frequency, 𝑠𝑖 is the spectral value at bin i, and 𝑏1 and
𝑏2 are the band edges, in bins, over which to calculate the spectral spread.
46
5.3.2. Statistical analysis
To investigate the relationship between audio features and clinical symptoms, linear
correlations of the acoustic features against the corresponding clinical rating tools were
computed. The clinical rating tools were HAMD for depression subjects and MMSE for
dementia subjects. In addition, two-tailed t-test were also performed to check statistical
significance. The values were adjusted using Bonferroni correction. Additionally,
correlation between age and sex with clinical rating tools were also evaluated for
validation purposes.
5.3.3. Machine Learning
Machine learning was performed in three stages: (1) to examine the possibility of
automatic pseudodementia diagnosis with unsupervised learning, (2) to examine the
possibility of automatic pseudodementia diagnosis with supervised classifier, and (3) to
validate its robustness against non age-matched datasets. The unsupervised learning
algorithm utilized for the first stage was k-means clustering. The parameters for k-means
clustering were k = 2 with squared Eucledian distance metric. For stages 2 and 3, the
machine learning model utilized was a binary classifier: support vector machine (SVM)
with linear kernel , 3rd order polynomial kernel, and radial-basis function (RBF) kernel
[90]. The hyperparameters for both linear kernel and polynomial kernel is the cost
parameter C while RBF kernel has two hyperparameters: C and gamma. The optimization
of hyperparameters was performed using grid search algorithm with values ranging from
10-3 to 1000. Linear kernel was chosen as it allows the visualization of feature
contributions, as opposed to SVM with nonlinear kernels. For the second phase, the
machine learning session was performed using nested 10-fold cross-validation. It is
defined as follows:
1. Split the datasets into ten smaller groups, maintaining the ratio of the classes
2. Perform ten-fold cross validation using these datasets.
For each fold:
(a) Split the training group into ten smaller subgroups.
(b) Perform another ten-fold cross-validation using these subgroups.
47
For each inner fold:
i. Perform LASSO regression [95] and obtain the coefficients. The
performance of the LASSO model is not considered.
ii. Mark the features with coefficient of less than 0.01.
(c) Perform feature selection by removing features with 10 marks obtained from
step 2-b-ii.
(d) Train an SVM model based on features from (c).
3. Compute the average performance and standard deviation of the models.
In the third phase, a SVM model was trained using age-matched subjects and selected
features from the second phase. Resulting model’s performance is evaluated against the
filtered-out subjects: young depression and old dementia subjects. In both cases, the
dementia patients were labelled as class 0 (negative) and depression patients were labelled
as class 1 (positive). The illustration of the phases is shown in Figure 5.2.
48
Figure 5.2 Flowchart of supervised machine learning procedure. The first and second
phase used age-matched symptomatic depression and dementia subjects. Since the first
phase consists only of unsupervised machine learning clustering, it is omitted here. The
second phase consists of conventional training and evaluation. The third phase involves
of utilizing machine learning model trained from age-matched subjects against non-age
matched subjects.
5.3.4. Evaluation Metrics
We utilized eight metrices to evaluate the effectiveness of the machine learning model,
all of which are computed based on the ratio of true positive (TP), false positive (FP), true
negative (TN), and false negatives (FN). In this study, the class depression was labelled
as “positive” and dementia was labelled as “negative”. All of the TP, FP, TN, and FN
values were obtained from confusion matrix. Based on the confusion matrices, the
evaluation metrices of observed accuracy, true positive rate (TPR / sensitivity), true
negative rate (TNR / specificity), positive predictive value (PPV / precision), negative
predictive value (NPV), F1-score, Cohen’s kappa, and Matthew’s correlation coefficient
(MCC) can be then computed. The formulas for computing these metrics are described in
49
Table 5.2. These metrics were conventional evaluation metrics utilized in performance
evaluation. Metrics related to inter-rater reliability such as Cohen’s kappa and MCC were
included to ensure validity of measurement in cases of imbalanced sample problem.
Table 5.2 Accuracy metrices
Metric Formula
Accuracy (ACC) 𝐴𝐶𝐶 =
𝑇𝑃 + 𝑇𝑁
(𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁)
True Positive Rate
(TPR) 𝑇𝑃𝑅 =
𝑇𝑃
(𝑇𝑃 + 𝐹𝑁)
True Negative Rate
(TNR) 𝑇𝑁𝑅 =
𝑇𝑁
(𝑇𝑁 + 𝐹𝑃)
Positive Predictive
Value (PPV) 𝑃𝑃𝑉 =
𝑇𝑃
(𝑇𝑃 + 𝐹𝑃)
Negative Predictive
Value (NPV) 𝑁𝑃𝑉 =
𝑇𝑁
(𝑇𝑁 + 𝐹𝑁)
F1 score 𝐹1 =
2 ∙ 𝑃𝑃𝑉 ∙ 𝑇𝑃𝑅
(𝑃𝑃𝑉 + 𝑇𝑃𝑅)
Cohen’s kappa 𝐸𝑋𝑃 =
(𝑇𝑃 + 𝐹𝑃)(𝑇𝑃 + 𝐹𝑁) + (𝑇𝑁 + 𝐹𝑁)(𝑇𝑁 + 𝐹𝑃)
(𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁)2
𝐾𝑎𝑝𝑝𝑎 = (𝐴𝐶𝐶 − 𝐸𝑋𝑃)/(1 − 𝐸𝑋𝑃)
Matthew’s Correlation
Coefficient (MCC) 𝑀𝐶𝐶 =
(𝑇𝑃 ∗ 𝑇𝑁) − (𝐹𝑃 ∗ 𝐹𝑁)
√(𝑇𝑃 + 𝐹𝑃)(𝑇𝑃 + 𝐹𝑁)(𝑇𝑁 + 𝐹𝑃)(𝑇𝑁 + 𝐹𝑁)
5.4. Results
A total of 419 datasets (300 of depression and 119 of dementia) from 120 subjects
(depression n = 77, dementia n = 43) were available from the PROMPT dataset. After
age-matching, only 177 datasets (89 of depression and 88 of dementia) from 53
participants (depression n = 24, dementia n = 29) were qualified for the first and second
phase of machine learning. The test dataset for second phase of machine learning
consisted of young depression patients and old dementia patients and was used in the third
phase of machine learning. There were 242 datasets (211 of depression and 31 of
dementia) from 67 patients (depression n = 53, dementia n = 14). Details of subject
demographics were described in Table 5.3.
50
Table 5.3 Subject demographics
Demographics Depression Dementia
Symptomatic n (dataset / subject) 300 / 77 119 / 43
age (mean ± s.d. years) 50.4 ± 15.1 80.8 ± 8.3
sex (female %) 54.5 72.1
Age-matched n (dataset / subject) 89 / 24 88 / 29
age (mean ± s.d. years) 67.8 ± 7.1 77.0 ± 7.5
sex (female %) 83.3 72.4
Young
depression, old
dementia
n (dataset / subject) 211 / 53 31 / 14
age (mean ± s.d. years) 42.5 ± 10.4 88.5 ± 1.9
sex (female %) 41.5 71.4
5.4.1. Statistical analysis
Pearson’s correlation found significant correlations with clinical interview tools in
features of GTCCs 1, 3, 12 and MFCCs 1, 3, 4, 7, 12. The average absolute correlation
coefficient R was 0.264 and its SD was 0.049. The highest absolute correlation value with
statistical significance (p < 0.05) was |R| = 0.346 for depression and |R| = 0.400 for
dementia. Features with significant correlation related to depression tend to yield weak to
moderate negative Pearson correlation values (average absolute R ± SD = 0.289 ± 0.05)
while features with significant correlation related to dementia tend to yield weak to
moderate positive Pearson correlation values (average absolute R ± SD = 0.281 ± 0.06).
Corrected two-tailed t-test shows significant differences of features in HNR, ZCR, GTCC
coefficients 4–14, mean frequencies, median frequencies, MFCC coefficients 4–13,
spectral centroid, and spectral rolloff points. No significant difference was found in Pitch
and Energy.
There was no significant correlation was found between sex and clinical assessment tools
(depression R = 0.021, p = 0.853; dementia R = 0.142, p = 0.928). Age has no significant
correlation with depression’s clinical assessment tools (R = 0.097, p = 0.403) but
significant, moderate correlation between age and dementia’s clinical assessment tools
was found (R = 0.424, p = 0.0046).
51
5.4.2. Machine learning
In this section, the results of machine learning experiment were presented. The evaluation
results from unsupervised learning with kMeans algorithm was shown on Table 5.4. For
the SVM with linear kernels, 26 features were completely rejected in the feature selection,
resulting in their removal during creation of the model for second phase. The rejected
features were related to pitch, GTCCs 1–3, MFCCs 1–3, signal energy, spectral centroid,
and spectral cutoff point. Feature selection in SVM with 3rd order polynomial kernel
results in removal of 28 features. The rejected features were related to pitch, GTCCs and
MFCCs (1–3, 12–13), signal energy, spectral centroid, and spectral cutoff. LASSO with
RBF-SVM similarly rejects 28 features related to low-order (1–4) and high-order (10–
13) MFCC and GTCC coeffcients, pitch, signal energy, spectral centroid, and spectral
cutoff. Machine learning evaluation results for phase 2 were shown on Table 5.5-Table
5.7, and the results for phase were shown on Table 5.8. Results with and without LASSO
algorithm also shown in these tables to confirm effectiveness of feature selection. Here,
the label “positive” represents depression patients and “negative” is for dementia patients.
Table 5.4 Unsupervised machine learning result
Metric Score (%)
Accuracy (ACC) 62.7
True Positive Rate (TPR) 89.9
True Negative Rate (TNR) 35.2
Positive Predictive Value (PPV) 58.3
Negative Predictive Value (NPV) 77.5
F1 score 70.8
Cohen’s kappa 25.2
Matthew’s Correlation Coefficient (MCC) 30.0
52
Table 5.5 Supervised machine learning result - SVM with linear kernel
Metric Training (mean ± SD %) Testing (mean ± SD %)
No LASSO With LASSO No LASSO With
LASSO
Accuracy (ACC) 90.1 ± 2.4 95.2 ± 0.7 84.2 ± 5.3 93.3 ± 7.7
True Positive Rate
(TPR)
94.4 ± 0.9 98.3 ± 0.9 88.8 ± 10.5 97.8 ± 4.7
True Negative Rate
(TNR)
85.7 ± 4.6 92.6 ± 1.2 79.6 ± 11.5 89.4 ± 13.7
Positive Predictive
Value (PPV)
87.1 ± 3.5 92.1 ± 1.2 82.5 ± 8.8 90.4 ± 11.7
Negative Predictive
Value (NPV)
93.8 ± 1.0 98.4 ± 0.8 88.8 ± 8.9 98.0 ± 4.2
F1 score 90.6 ± 2.0 95.1 ± 0.7 84.8 ± 5.5 93.5 ± 7.2
Cohen’s kappa 80.2 ± 4.7 90.5 ± 1.4 68.3 ± 10.5 86.7 ± 15.0
Matthew’s Correlation
Coefficient (MCC)
80.5 ± 4.4 90.6 ± 1.4 69.8 ± 10.3 87.8 ± 13.5
Table 5.6 Supervised machine learning result - SVM with 3rd order polynomial kernel
Metric Training (mean ± SD %) Testing (mean ± SD %)
No LASSO With LASSO No LASSO With
LASSO
Accuracy (ACC) 91.5 ± 3.1 94.6 ± 8.1 79.1 ± 7.6 89.7 ± 11.4
True Positive Rate
(TPR)
96.4 ± 2.4 99.1 ± 1.0 85.3 ± 10.8 96.7 ± 5.4
True Negative Rate
(TNR)
86.5 ± 4.0 90.0 ± 16.1 72.6 ± 14.3 83.1 ± 22.9
Positive Predictive
Value (PPV)
87.9 ± 3.5 92.3 ± 9.9 76.9 ± 8.3 87.6 ± 13.8
53
Metric Training (mean ± SD %) Testing (mean ± SD %)
No LASSO With LASSO No LASSO With
LASSO
Negative Predictive
Value (NPV)
95.9 ± 2.7 98.9 ± 1.2 84.1 ± 9.9 96.9 ± 5.0
F1 score 91.9 ± 2.9 95.3 ± 6.1 80.3 ± 6.9 91.1 ± 8.2
Cohen’s kappa 82.9 ± 6.3 89.2 ± 16.2 58.0 ± 15.2 79.7 ± 21.7
Matthew’s Correlation
Coefficient (MCC)
83.3 ± 6.2 90.1 ± 13.7 59.4 ± 14.6 81.8 ± 17.9
Table 5.7 Supervised machine learning result - SVM with RBF kernel
Metric Training (mean ± SD %) Testing (mean ± SD %)
No LASSO With LASSO No LASSO With
LASSO
Accuracy (ACC) 90.4 ± 6.2 95.6 ± 1.9 75.3 ± 12.4 88.7 ± 7.9
True Positive Rate
(TPR)
96.4 ± 2.9 98.8 ± 1.0 77.5 ± 16.6 91.0 ± 10.3
True Negative Rate
(TNR)
84.3 ± 10.2 92.4 ± 3.0 72.9 ± 17.3 86.1 ± 13.1
Positive Predictive
Value (PPV)
86.7 ± 7.9 93.0 ± 2.6 75.6 ± 13.8 88.3 ± 10.4
Negative Predictive
Value (NPV)
95.7 ± 3.7 98.6 ± 1.2 77.6 ± 14.6 91.3 ± 8.9
F1 score 91.2 ± 5.4 95.8 ± 1.7 75.7 ± 12.5 89.1 ± 7.9
Cohen’s kappa 80.8 ± 12.3 91.2 ± 3.7 50.5 ± 24.8 77.3 ± 15.9
Matthew’s Correlation
Coefficient (MCC)
81.5 ± 11.7 91.4 ± 3.6 51.8 ± 25.0 78.3 ± 15.4
54
Table 5.8 Machine learning result against non-age matched dataset
Metrics Linear Polynomial RBF
All LASSO All LASSO All LASSO
Accuracy (ACC) 83.5 82.6 80.2 81.4 65.7 81.0
True Positive Rate
(TPR)
87.7 83.9 82.5 82.9 66.8 82.9
True Negative Rate
(TNR)
54.8 74.2 64.5 71.0 58.1 67.7
Positive Predictive
Value (PPV)
93.0 95.7 94.1 95.1 91.6 94.6
Negative Predictive
Value (NPV)
39.5 40.4 35.1 37.9 20.5 36.8
F1 score 90.2 89.4 87.9 88.6 77.3 88.4
Cohen’s kappa 36.5 42.8 34.6 39.3 13.9 37.3
Matthew’s Correlation
Coefficient (MCC)
37.2 45.7 37.0 42.2 17.3 39.9
5.5. Discussion
As a result, we found that significant correlations with clinical interview tools in features
of GTCCs 1, 3, 12 and MFCCs 1, 3, 4, 7, 12. The sign from the Pearson’s rho is different;
negative correlation was observed for HAMD and positive correlation was observed for
MMSE. This suggests that the features were important for both depression and dementia,
and also important for differentiating depression and dementia. Another thing to note that
the highest absolute correlation value with significance (p < 0.05) was 0.346 for HAMD
and 0.400 for MMSE, suggesting a weak to moderate correlation between the audio
features and clinical rating scores.
The corrected t-test between these features in Figure 5 showed statistical differences only
in certain features. Interestingly, the standard deviation of a rather high-order MFCC
coefficient showed significant difference. Normally, most of the information are
55
represented in the lower order coefficients and their distributions were important for
speech analysis.
Statistical comparison of acoustic features between two groups found significant
differences in both temporal and spectral acoustic features. No significant difference
between the two groups can be found in pitch and energy, both in the family of temporal
features. Although the result from unsupervised clustering algorithm was not satisfactory,
both the accuracy and inter-rater agreement show that the performance was better than
chance, denoting the underlying patterns in the data. In the second part of machine
learning, feature selection was performed using LASSO algorithm. Here, both pitch and
signal energy features were rejected alongside with other spectral features. Considering
that both pitch and signal energy also showed no statistical significance in the t-test, it
can be inferred that these features do not contribute for classification of depression and
dementia. In contrast, GTCCs 4–14 and MFCCs 4–14 had statistically significant
difference and were also selected by LASSO algorithm. GTCCs and MFCCs are similar
features, related to tones of human speech. Although GFCCs was not developed for
speech analysis, both are commonly used for speech recognition systems [121] [122].
This finding is consistent with the fact that a person’s speech characteristics might be
related with their mental health. [85].
Surprisingly, the best result of the SVM was obtained in SVM with linear kernel, although
the scores were only slightly superior to the nonlinear SVMs. Additionally, the
effectiveness of LASSO algorithm for feature selection was evaluated and interesting
result was found. For the second phase, all the SVM models benefited from having
LASSO feature selection, but for the third phase, nonlinear SVMs seemed to be the most
benefited with the feature selection. This might be related by the LASSO algorithm. As
LASSO regression is a linear regression with penalty and the feature selection step was
basically to discard features that give zero contribution to LASSO regression, linear SVM
might be similar to it and was redundant in this case.
Nevertheless, high accuracy and interrater agreement were obtained from the models in
both machine learning phases. For comparison, studies [123] [124] [125] [50], and [126]
have 87.2%, 81%, 81.23%, 89.71% and 73% as accuracy for predicting depression,
56
respectively. [127] reports 73.6% accuracy for predicting dementia and [128] reports
99.9% TNR and 78.8% TPR. However, most of these studies compared healthy subjects
against symptomatic patients, while our study compared patients afflicted with different
mental problem. Additionally, most conventional studies measure depression by
questionnaire and not with clinical examination, so this cannot be said to be a fair
comparison. Low NPV scores and inter-rater during the third phase maybe due to the fact
that evaluation in third phase was utilized with heavily imbalanced dataset and with
higher number of samples compared to the training phase. These results suggest the
possibility of using audio features for automatic pseudodementia screening.
5.6. Conclusions
We recorded the audio of clinical interview session of depression patients and dementia
patients in a clinical setting using an array microphone. Statistical analysis shows
significant differences in audio features between depressed patients and dementia patients.
A machine learning model was constructed and evaluated; considerable performance was
recorded for distinguishing depression patients and dementia patients. Feature
contribution analysis reveal features MFCC and GTCC features to be the highest
contributing features. The top contributing features were 9th and 4th MFCC features.
Based on our findings, we conclude that automated pseudodementia screening with
machine learning is feasible.
57
Chapter 6
Multimodal feature analysis in
depression patients and dementia
patients
6.1. Introduction
This chapter describes the feature analysis of both visual and acoustic features from
depression patients and dementia patients. In section 6.2, the data acquisition protocol is
described. In section 6.3, the analysis procedure is described and the results are described
and discussed in section 6.4. The chapter is then concluded in section 6.5.
6.2. Data acquisition
Similar to chapter 3 and 5, the data utilized in this chapter is from PROMPT database.
The PROMPT database’s audio recordings were obtained by recording a patient’s clinical
interview session with a therapist. The recording apparatus is Beyerdynamic Classis
RM30W GmbH & Co. KG with 16 kHz sampling rate. The visual recordings were
obtained using the following devices: RealSense R200 (Intel Corporation) and Microsoft
Kinect for Windows v2 (Microsoft Corporation), both with frame rate of 30 frame-per-
second (FPS). A data screening in consideration of the possible effect of age and gender
distribution to the severity of the psychiatric disorders was performed. To maintain
uniformity with chapter 3 and 5, the screening criteria are largely unchanged. The
minimum length of recordings utilized in this chapter was 5 minutes, matching with the
chapter 3. The reason why 5 minutes is selected instead of 10 minutes is because the
number of video recordings with length of 10 minutes or more are to increase the number
of samples.
After the screening, we successfully obtained data from 174 participants (MDD N = 97,
age mean ± s.d. = 49.39 ± 14.97, female ratio = 54.64%; dementia N = 77, age mean ±
58
s.d. = 79.86 ± 8.15, female ratio = 63.64%). The dataset is then split into male subgroup
and female subgroup: Out of the 174 participants, only 46 recordings from female
subjects (MDD N=22, dementia N=24) have multimodal features and from male subjects
there were only 6 recordings (MDD N=1, dementia N=5).
6.3. Analysis
6.3.1. Facial feature analysis
The features are similar to the ones utilized in chapter 3 and 5. As in chapter 3, Facial
landmarks extraction was performed using Omron’s OKAO Vision and a preprocessing,
an outlier removal was performed by applying cubic spline interpolation in frames which
landmarks have values less than 1st percentile or greater than 99th percentile. After the
preprocessing, the feature extraction was performed. Features extracted from facial
landmarks were speed statistics of each landmark and speed statistics of the face center.
6.3.2. Audio feature analysis
Similar to chapter 5, audio intensity normalization is performed to the audio recordings,
such that the maximum absolute of the signal is 0.99 as a preprocessing. Additionally,
continuous silence in form of trailing zeroes at the front and the end of the recordings
were also removed. Afterwards, feature extraction of audio recordings was performed
with a sliding window of 10ms with no overlap and three features were extracted: Mel-
frequency cepstral coefficients (MFCC), harmonics-to-noise ratio (HNR), zero-crossing
rate (ZCR), and signal power.
The fewer choice of features reflects the result of the chapter 5. Only features that were
retained by LASSO algorithm in chapter 5 are utilized. MFCC was chosen because it
corresponds to human speech while HNR, ZCR, and signal power were chosen to
represent the temporal features of the audio recordings. Finally, the statistical features of
MFCC, HNR, and ZCR were computed and used as features: mean, median, standard
deviation, kurtosis, and skewness. GTCC was not used because their features largely
correlates with MFCC.
6.3.3. Statistical analysis
59
The statistical analysis is similar with chapter 3 and 5. For each feature, Pearson’s
correlation was measured with the clinical assessment result. Then, Wilcoxon rank-sum
test was performed for features extracted from MDD group against features extracted
from dementia group. The objective of correlation analysis is to find features that may
positively contribute to one group but negatively contribute to the other group, hence its
utility as pseudodementia marker. Wilcoxon test was performed to check features that
have statistically significant difference between MDD group and dementia group. Then,
the statistical comparison between audio features and facial features are performed.
6.3.4. Machine learning
Various machine learning models were compared to check which models perform the
best for automatic pseudodementia screening. Machine learning models utilized were
support vector machines with various kernels (linear, polynomial, radial-basis function),
AdaBoost, random forest, and random under-sampling (RUS) boosted trees. Similar to
chapter 3 and 5, LASSO was utilized for feature selection with similar criteria (chosen at
least 10% of the time). Machine learning experiment with multimodal inputs are also
tested. However, since the available male patients’ recordings are heavily imbalanced,
the multimodal experiment was not performed with male recordings. In its place, the
performance of machine learning with multimodal features are examined using the
combined dataset of both male and female subgroups.
6.4. Results and Discussion
6.4.1. Demographics
The demographics of the screened dataset is shown in Table 6.1. As expected, the number
of datasets from female patients were numerous compared to dataset from male patients.
Such imbalance may affect the result to be biased to recordings from female patients.
Conversely, small number of sample from male subgroup also decreases the statistical
power, and type II error is likely to occur.
60
Table 6.1 Number of recordings
Number of
Recordings
Female Male
MDD Dementia MDD Dementia
Audio 57 49 13 14
Face 24 34 3 8
Both 22 24 1 5
6.4.2. Statistical analysis
Pearson’s correlation with clinical assessment tools yield interesting result. Facial
features of MDD-female group show very strong significant correlation (p < 0.05)
between the median of face center speed with HAMD17 (R = 0.9998). MDD-male group
shows no significant correlation between facial features with HAMD17. Audio features
show significant correlation (p < 0.05) with moderate strength (average |R| = 0.6645) in
both female subgroup and male subgroup, albeit in different features. The features with
significant correlation in male subgroup are signal strength, HNR, MFCCs 1 and 2, while
in female subgroup, the features are ZCR, MFCCs 1, 2, 3, and 4.
Facial features of dementia group show no significant correlation (p < 0.05) between the
facial features of male subgroup with MMSE. Facial features of female subgroup show
significant inverse correlation with moderate strength (average |R| = 0.4481) kurtosis and
skewness of right pupil, right eye (top and bottom), glabella, and bottom nose (left point
and right point). Audio features of dementia group show significant correlation (p < 0.05)
in both male and female subgroup. In male subgroup, median of HNR (R = 0.6645),
skewness of MFCC 2 (R=-0.6324) showed significant correlation. In female subgroup,
all features showed significant correlation: signal power, ZCR, HNR, MFCCs 1-7
(average |R| = 0.4331). The results from Pearson’s correlation analysis implies that
dementia and MDD has different features, especially in audio features of MFCC. The
coefficients of MFCC tends to yield positive correlation with MMSE (dementia score)
and negative correlation with HAMD17 (MDD score).
Wilcoxon rank sum test between female group of dementia against MDD show significant
difference in all facial and audio features whilst male group show only significant
61
difference in audio features. This implies that audio features might be a better predictor
for screening pseudodementia. Nevertheless, the number of samples from male subgroup
is very few in number and might be biased.
6.4.3. Machine learning
The accuracies of machine learning models are described in Table 6.2. The validation
method differs between female group and male group. This is caused by the insufficient
number of samples of male group. A leave-one-out validation is applied to the male group
and 10-fold cross-validation is applied for female group. As shown in the result,
acceptable classification accuracy was achieved both in female group and male group.
For unimodal features, the best performing models are nonlinear SVMs and random
forests, while boosted trees only perform well in male group with facial features as
predictors. Overfitting seems to be a problem here, as boosted trees seem to have lower
accuracy than random forest or RUSboosted trees.
Multimodal features seem to improve the accuracy of the models compared to the worst
unimodal cases, but in models trained from unimodal features are still the best performing
models. Overall, multimodal features did not seem to improve the accuracy if compared
with the best cases. However, some combinations of features and machine learning model
did not work well – for example, for male patients’ facial features did not seem to perform
well but with multimodal features, the performance was generally better. It seemed that
utilizing multimodal features improves the worst-case performances but did not improve
the average performance. One hypothesis is that audio features and face features actually
correlated with each other and the combination of both features did not drastically
improve the machine learning performance.
62
Table 6.2 Machine learning results (accuracy)
Models
Female
10-fold
Male
Leave-one-out
Both
10-fold
Audio Face
Audio
+Face Audio Face
Audio
+Face
SVM Linear 84.9 79.3 84.3 92.6 63.6 89.8
SVM Cubic 84.9 79.3 84.2 96.3 63.6 70.5
SVM RBF 85.8 81 64.9 92.6 72.7 80.1
AdaBoost 53.8 58.5 70.0 51.9 72.7 69.7
Random Forest 80.2 86.2 73.8 81.5 45.5 62.2
RUSBoosted Trees 73.6 74.1 73.0 70.4 36.4 76.3
6.5. Conclusion
We recorded the clinical interview session of MDD patients and dementia patients in a
clinical setting extracted audio and facial features from the recordings. An imbalanced
dataset of mostly female patients was obtained. Statistical analysis shows significant
correlation in both MDD patient scores and dementia patient scores with audio and facial
features. MFCC features show positive correlation with dementia score and inverse
correlation with MDD score, suggesting its effectiveness as pseudodementia screening
tool. Statistical differences in audio features between depressed patients and dementia
patients was also found. Several machine learning model were constructed and evaluated.
The resulting performance was considerable, with the worst performance in male-group’s
machine learning using facial feature predictors and the best performance was in male-
group’s machine learning using audio feature predictors. Nevertheless, the sample size of
male group is comparatively smaller than the female group and more reliable result can
be inferred from female group’s result. The female group’s machine learning result was
63
85.8% by using audio features, 86.2% by using facial features, and 84.2% by using both
features.
Based on our findings, we conclude that automated pseudodementia screening with
machine learning is feasible. Suggestion for future works include addressing overfitting
problem must be considered as adding predictors increases the overfitting bias.
64
Chapter 7
Conclusion
7.1. Summary of this thesis
Diagnosis from a licensed medical practitioner is very important. The current norm is that
it is ideal for the patient and the practitioner meet face-to-face in order to perform
examination. However, with the advances in communication science and technology, less
and less examinations require medical practitioner to be physically present near the
patient. The practice of performing diagnosis, or sometimes even treatment, of a disease
without the medical practitioner being present physically is called telehealth or
telemedicine.
In telehealth, the medical practitioner communicates with the patient via internet or any
other remote meeting platforms; or in even some rarer cases, the medical practitioner is
replaced by a human-computer interface software. This is possible because smart devices
are becoming the norm of everyday life and some of those smart devices are equipped
with sensors to detect human biomarkers. A person’s blood pressure and heart rate, for
example, can be easily monitored using smartwatches. Needless to say, there is still only
few devices and software that are clinically validated, and the data from non-clinically
validated devices or software may lead to misdiagnosis if it were to be used. The clinical
validation of such devices is important and critical. Before the device is publicly available
for sale, appropriate clinical validation must be performed and license from appropriate
licensing agency must be obtained (in Japan case, the request must be performed to
PMDA: Pharmaceuticals and Medical Devices Agency).
The goals of telemedicine are mainly for accessibility, communication quality, and self-
care. Accessibility in normal times refers to rural or isolated communities or people with
limited mobility, time or transportation options. But with recent coronavirus (COVID19)
pandemic in 2020, another potential for telehealth advantage is found: accessibility for
those in the quarantine. During the pandemic, telemedicine becomes more alluring, both
65
for patients and medical care providers. At the very least, the usage of telemedicine
reduces or eliminates the risk of COVID19 transmission.
Psychiatric screening using telehealth is also important. The social distancing and remote
work recommendation from the government across the world not only impacted the
economy but also the mental health of the workers. Public health actions, such as social
distancing, can make people feel isolated and lonely and can increase stress and anxiety.
Additionally, fear and anxiety about a new disease and what could happen can be
overwhelming and cause strong emotions in adults and children. Also, factitious disorders
or medical mimics exist even in the psychiatry field. This research is one step in the
direction of telehealth and engineering for diagnosing psychiatric disorders which
symptoms that may mimic each other.
Another type of telemedicine is called remote patient monitoring (RPM). Remote patient
monitoring allows healthcare providers to monitor patients’ health data from a far, usually
while the patient is at their own home. RPM can significantly cut down on the time a
patient needs to spend in the hospital, instead letting them recover under monitoring at
home. Remote patient monitoring is especially effective for chronic conditions, such as
heart disease to diabetes to asthma. Technology that allows patients to monitor
themselves for these conditions has existed for many years, but today, vital health data
can be shared with doctors and other healthcare professionals remotely. Cutting-edge
equipment can transmit basic medical data to doctors automatically, allowing them to
provide a much better level of care and keep an eye out for the earliest signs of trouble.
An improvement for facial tracking algorithm, which is very important for both automatic
mental health assessment and remote patient monitoring was also covered in this
dissertation.
7.2. Conclusion
In chapter 3, 5, and 6, analysis of acoustic, facial, and fusion of acoustic and facial features
from depression patients, dementia patients were performed. Correlation analysis
between features and mental health assessment tools found that acoustic features have
statistically significant correlation with moderate strength, be it positive or negative
correlation. On the other hand, only two facial features chosen in this dissertation have
66
significant correlation with the assessment tools. Speed statistics of facial landmarks
might not be suitable for mental health assessment and emotion recognition might be
more beneficial in this case. Nevertheless, the minimum recording length of 5 minutes
may also play a role in both statistical analysis and the machine learning results. Less
information is available in 5 minute facial recordings compared to 10 minutes audio
recordings and might result in less accuracy or less correlation value.
Similarity and difference between acoustic features and facial features for depression
patient and dementia patient was performed. In contrast with correlation analysis, several
acoustic and facial features have been found to be significantly differs between the two
groups. This implies the possibility of classification using the features chosen in this
dissertation.
Machine learning experiment for classifying of depression and dementia patients was
performed. The machine learning models are relatively simple and not complex.
Nevertheless, satisfactory result of more than 80% accuracy was obtained for all cases.
With more data and more complex machine learning algorithms higher accuracy might
be obtainable.
A pose-robust facial landmark tracking algorithm which is beneficial for both automatic
screening and telehealth in general was also proposed. The algorithm yields low RMSE
error between estimated facial landmarks and ground truth.
7.3. Suggestions for future research
This research is by no means complete. Future research directions include:
1. Usage of generalized database
Current database is from PROMPT study which was conducted in Japan. As a
result, the subjects are all Japanese subjects. Although it is said that some features
such as speech transcends language and culture, validation is still important.
Validation of this research using database from non-East Asian countries is
recommended.
2. Usage of deep learning algorithms
67
In this dissertation, all machine learning models are relatively simple. The aim
was that simple machine learning models are often interpretable and beneficial for
feature analysis. Now, the characteristics of depression and dementia have been
found and more attention to machine learning performance is recommended.
Additionally, deep learning algorithms generally performs better compared to
traditional machine learning algorithms such as the ones utilized in this
dissertation.
3. Usage of more modalities or advanced features
The modalities utilized in this dissertation are speech and facial landmarks. They
are basic features and more advanced features might be beneficial for improving
performance. Suggestions for other modalities include the content of conversation
such as from natural language processing, social network analysis, wearable
sensors, and internet-of-things sensors for human body.
4. Considerations of more psychiatric disease
As stated above, numerous medical conditions frequently encountered in the ED
can mimic psychiatric disorders, not counting the psychiatric disorders that can
mimic one another. This is especially important considering most of the automatic
diagnosis only considers a particular type of disease.
68
References
[1] World Health Organization, "Mental disorders," [Online]. Available:
https://www.who.int/news-room/fact-sheets/detail/mental-disorders. [Accessed June 2020].
[2] Ministry of Health, Labour and Welfare of Japan (厚生労働省), "認知症施策の総合的な推
進 に つ い て ( 参 考 資 料 ) ," June 2020. [Online]. Available:
https://www.mhlw.go.jp/content/12300000/000519620.pdf. [Accessed July 2020].
[3] E. Ohnuki-Tierney, Illness and culture in contemporary Japan : an anthropological view, New
York: Cambridge University Press, 1984.
[4] National Police Agency (警察庁), "令和元年中における自殺の状況," 17 March 2020.
[Online]. Available:
https://www.npa.go.jp/safetylife/seianki/jisatsu/R02/R01_jisatuno_joukyou.pdf. [Accessed
July 2020].
[5] American Psychiatric Association, Diagnostic and statistical manual of mental disorders (5th
ed.), Arlington: American Psychiatric Association, 2013.
[6] M. Folstein, S. Folstein and P. McHugh, "“Mini-mental state”: A practical method for grading
the cognitive state of patients for the clinician," Journal of Psychiatric Research, vol. 12, no.
3, pp. 189-198, 1975.
[7] A. Budson and P. Solomon, "Chapter 2 - Evaluating the Patient with Memory Loss or
Dementia," in Memory Loss, Alzheimer's Disease, and Dementia A Practical Guide for
Clinicians, Edinburgh, Elsevier, 2017, pp. 5-38.
[8] C. Hughes, L. Berg, W. Danziger, L. Coben and R. Martin, "A new clinical scale for the staging
of dementia," Br J Psychiatry, vol. 140, no. 6, pp. 566-572, 1982.
[9] D. Wechsler, WMS-R: Wechsler Memory Scale–Revised, San Antonio: Psychological
Corporation, 1987.
69
[10] X. Meng, A. Brunet, G. Turecki, A. Liu, C. D'Arcy and J. Caron, "Risk factor modifications
and depression incidence: a 4-year longitudinal Canadian cohort of the Montreal Catchment
Area Study," BMJ Open, vol. 7, no. 6, 2017.
[11] X. Zhou, B. Bi, L. Zheng, Z. Li, H. Yang, H. Song and Y. Sun, "The Prevalence and Risk
Factors for Depression Symptoms in a Rural Chinese Sample Population," PLoS One, vol. 9,
no. 6, 2014.
[12] H. Razzak, A. Harbi and S. Ahli, "Depression: Prevalence and Associated Risk Factors in the
United Arab Emirates," Oman Med J. , vol. 34, no. 4, p. 274–282., 2019.
[13] M. Shadrina, E. Bondarenko and P. Slominsky, "Genetics Factors in Major Depression
Disease," Front Psychiatry, vol. 9, p. 334, 2018.
[14] F. Lohoff, "Overview of the Genetics of Major Depressive Disorder," Curr Psychiatry Rep.,
vol. 12, no. 6, pp. 539-546, 2010.
[15] D. Hamilton, "A RATING SCALE FOR DEPRESSION," J Neurol Neurosurg Psychiatry, vol.
23, no. 1, pp. 56-62, 1960.
[16] S. Gautam, A. Jain, M. Gautam, V. Vahia and S. Grover, "Clinical Practice Guidelines for the
management of Depression," Indian J Psychiatry, vol. 59, no. Suppl 1, 2017.
[17] J. Jakobsen, C. Gluud and I. Kirsch, "Should antidepressants be used for major depressive
disorder?," BMJ Evidence-Based Medicine, vol. 25, no. 130, 2020.
[18] M. Sado, A. Ninomiya, R. Shikimoto, B. Ikeda, T. Baba, K. Yoshimura and M. Mimura, "The
estimated cost of dementia in Japan, the most aged society in the world," PLoS One, vol. 13,
no. 11, 2018.
[19] P. Lichtenberg, D. Murman and A. Mellow, Handbook of Dementia: Psychological,
Neurological, and Psychiatric Perspectives, John Wiley & Sons, 2004.
[20] L. Kiloh, "PSEUDO-DEMENTIA," Acta Psychiatrica Scandinavica, vol. 37, no. 4, pp. 336-
351, 1961.
[21] A. Burns and D. Jolley, "Pseudodementia: History, mystery and positivity," in Troublesome
Disguises: Managing Challenging Disorders in Psychiatry: Second Edition, Wiley-Blackwell,
2015, pp. 218-230.
70
[22] B. Pitt and G. Yousef, "Depressive pseudodementia," Current Opinion in Psychiatry, vol. 10,
no. 4, pp. 318-321, 1997.
[23] S. Sahin, T. O. Onal, N. Cinar, M. Bozdemir, R. Cubuk and S. Karsidag, "Distinguishing
Depressive Pseudodementia from Alzheimer Disease: A Comparative Study of Hippocampal
Volumetry and Cognitive Tests," Dementia and Geriatric Cognitive Disorders Extra, vol. 7,
no. 2, pp. 230-239, 2017.
[24] H. Kang, F. Zhao, L. You, C. Giorgetta, V. D, S. S and P. R, "Pseudo-dementia: A
neuropsychological review," Ann Indian Acad Neurol, vol. 17, no. 2, pp. 147-154, 2014.
[25] S. Montgomery and M. Asberg, "A New Depression Scale Designed to be Sensitive to
Change," The British Journal of Psychiatry, vol. 134, no. 4, pp. 382-389, 1979.
[26] A. Beck, Depression: Causes and Treatment, Philadelphia: University of Pennsylvania Press,
1972.
[27] R. Young, J. Biggs, V. Ziegler and D. Meyer, "A rating scale for mania: reliability, validity
and sensitivity," British Journal of Psychiatry, vol. 133, no. 5, pp. 429-435, 1978.
[28] D. Buysse, C. Reynolds, T. Monk, S. Berman and D. Kupfer, "The Pittsburgh sleep quality
index: A new instrument for psychiatric practice and research," Psychiatry Research, vol. 28,
no. 2, pp. 193-213, 1989.
[29] J. Cummings, M. Mega, K. Gray, S. Rosenberg-Thompson, D. Carusi and J. Gornbein, "The
Neuropsychiatric Inventory: comprehensive assessment of psychopathology in dementia,"
Neurology, vol. 44, no. 12, 1994.
[30] E. Giles, K. Patterson and J. Hodges, "Performance on the Boston Cookie theft picture
description task in patients with early dementia of the Alzheimer's type: Missing information,"
Aphasiology, vol. 10, no. 4, pp. 395-408, 1994.
[31] M. Field, "Telemedicine: A Guide to Assessing Telecommunications in Health Care.," J Digit
Imaging, vol. 10, no. 28, 1997.
[32] M. Langarizadeh, M. Tabatabaei, K. Tavakol, M. Naghipour, A. Rostami and F. Moghbeli,
"Telemental Health Care, an Effective Alternative to Conventional Mental Care: a Systematic
Review," Acta Inform Med, vol. 25, no. 4, pp. 240-246, 2017.
71
[33] G. Diamond, L. Suzanne, K. Bevans, J. Fein, M. Wintersteen, A. Tien and T. Creed,
"Development, validation, and utility of internet-based, behavioral health screen for
adolescents," Pediatrics, vol. 126, no. 1, 2010.
[34] Z. Adams, E. McClure, K. Gray, C. Danielson, F. Treiber and K. Ruggiero, "Mobile devices
for the remote acquisition of physiological and behavioral biomarkers in psychiatric clinical
research," Journal of Psychiatric Research, vol. 85, pp. 1-14, 2017.
[35] M. Castiglioni and F. Laudisa, "Toward psychiatry as a ‘human’ science of mind. The case of
depressive disorders in DSM-5," Front Psychol, vol. 5, 2014.
[36] M. Fakhoury, "Artificial Intelligence in Psychiatry," in Frontiers in Psychiatry. Advances in
Experimental Medicine and Biology, Singapore, Springer, 2019, pp. 119-125.
[37] P. Doraiswamy, C. Blease and K. Bodner, "Artificial intelligence and the future of psychiatry:
Insights from a global physician survey," Artificial Intelligence in Medicine, vol. 102, 2020.
[38] P. Arean, K. Ly and G. Andersson, "Mobile technology for mental health assessment,"
Dialogues Clin Neurosci, vol. 18, no. 2, pp. 163-169, 2016.
[39] K. Kroenke, R. Spitzer and J. Williams, "The PHQ-9: validity of a brief depression severity
measure," Journal of General Internal Medicine, vol. 16, no. 9, pp. 606-613, 2001.
[40] S. Hardt and D. MacFadden, "Computer assisted psychiatric diagnosis: experiments in software
design," Comput Biol Med, vol. 17, no. 4, pp. 229-237, 1987.
[41] V. Bagga, K. Kahol and S. Chandra, "Game Design for Pre-screening Patients with Mental
Health Complications Using ICT Tools," in International Conference on Ambient Media and
Systems, Athens, 2013.
[42] A. Dezfouli, H. Ashtiani, O. Ghattas, R. Nock, P. Dayan and C. Ong, "Disentangled
behavioural representations," in Neural Information Processing Systems Conference,
Vancouver, 2019.
[43] R. Mandryk, M. Birk, A. Lobel, M. Rooij, I. Granic and V. Abeele, "Games for the Assessment
and Treatment of Mental Health," in The ACM SIGCHI Annual Symposium on Computer-
Human Interaction in Play, Amsterdam, 2017.
[44] A. Sano, A. Philips, A. Zu, A. McHill, S. Taylor, N. Jaques, C. Czeisler, E. Klerman and R.
Picard, "Recognizing academic performance, sleep quality, stress level, and mental health
72
using personality traits, wearable sensors and mobile phones," in 12th International
Conference on Wearable and Implantable Body Sensor Networks, Cambridge, 2015.
[45] S. Abdullah and T. Choudhury, "Sensing Technologies for Monitoring Serious Mental
Illnesses," IEEE Multimedia, vol. 25, pp. 61-75, 2018.
[46] U. Archaya, S. Oh, Y. Hagiwara, J. Tan, H. Adeli and D. Subha, "Automated EEG-based
screening of depression using deep convolutional neural network," Computer Methods and
Programs in Biomedicine, vol. 161, pp. 103-113, 2016.
[47] J. Newson and T. Thiagarajan, "EEG Frequency Bands in Psychiatric Disorders: A Review of
Resting State Studies," Front Hum Neurosci, vol. 12, p. 521, 2018.
[48] Z. Wan, H. Zhang, J. Huang, H. Zhou, J. Yang and N. Zhong, "Single-Channel EEG-Based
Machine Learning Method for Prescreening Major Depressive Disorder," International
Journal of Information Technology & Decision Making, vol. 18, no. 5, pp. 1579-1603, 2019.
[49] G. Giannakakis, D. Grigoriadis and M. Tsiknakis, "Detection of stress/anxiety state from EEG
features during video watching," in 37th Annual International Conference of the IEEE
Engineering in Medicine and Biology Society, Milan, 2015.
[50] R. Nakamura and Y. Mitsukura, "Feature analysis of electroencephalography in patients with
depression," in IEEE Life Sciences Conference, Montreal, 2018.
[51] A. Ozdas, R. Shiavi, S. Silverman, M. Silverman and D. Wilkes, "Investigation of vocal jitter
and glottal flow spectrum as possible cues for depression and near-term suicidal risk," IEEE
Transactions on Biomedical Engineering, vol. 51, no. 9, pp. 1530-1540, 2004.
[52] J. Shah, B. Cahn, S. By, E. Welch, L. Sacolick, M. Yuen, M. Mazurek, C. Wira, A. Leasure,
C. Matouk, A. Ward, S. Payabvash, R. Beekman, S. Brown, G. Falcone, K. Gobeske, N.
Petersen, A. Jasne, R. Sharma, J. Schindler, L. Sansing, E. Gilmore, G. Sze and Rose, "Portable,
Bedside, Low-field Magnetic Resonance Imaging in an Intensive Care Setting for Intracranial
Hemorrhage (270)," Neurology, vol. 94, no. 15 Supplement, 2020.
[53] M. Conway and D. O'Connor, "Social Media, Big Data, and Mental Health: Current Advances
and Ethical Implications," Curr Opin Psychol, vol. 9, p. 77–82., 2017.
[54] I. Pantic, "Online Social Networking and Mental Health," Cyberpsychol Behav Soc Netw, vol.
17, no. 10, pp. 652-657, 2014.
73
[55] G. Park, H. Schwartz, J. Eichstaedt, M. Kern, M. Kosinski, D. Stillwell, L. Ungar and M.
Seligman, "Automatic personality assessment through social media language," J Pers Soc
Psychol, vol. 108, no. 6, pp. 934-952, 2015.
[56] E. Kross, P. Verduyn, E. Demiralp, J. Park, D. Lee, N. Lin, H. Shablack, J. Jonides and O.
Ybarra, "Facebook use predicts declines in subjective well-being in young adults," PLoS One,
vol. 8, no. 8, 2013.
[57] J. Jashinsky, S. Burton, C. Hanson, J. West, C. Giraud-Carrier, M. Barnes and T. Argyle,
"Tracking suicide risk factors through Twitter in the US," Crisis, vol. 35, no. 1, pp. 51-59,
2014.
[58] L. Li, K. Ota, Z. Zhang and Y. Liu, "Security and Privacy Protection of Social Networks in Big
Data Era," Mathematical Problems in Engineering, 2018.
[59] M. Smith, C. Szongott, B. Henne and G. von Voigt, "Big data privacy issues in public social
media," in 6th IEEE International Conference on Digital Ecosystems and Technologies,
Campione d'Italia, 2012.
[60] J. Neevaleni and M. Devasana, "Alzheimer Disease Prediction using Machine Learning
Algorithms," in 6th International Conference on Advanced Computing and Communication
Systems, Coimbatore, 2020.
[61] B. Yalamanchili, N. Kota, M. Abbaraju, V. Nadella and S. Alluri, "Real-time Acoustic based
Depression Detection using Machine Learning Techniques," in 2020 International Conference
on Emerging Trends in Information Technology and Engineering, Vellore, 2020.
[62] L. He, D. Jiang and H. Sahli, "Automatic Depression Analysis Using Dynamic Facial
Appearance Descriptor and Dirichlet Process Fisher Encoding," IEEE Transactions on
Multimedia, vol. 21, no. 6, 2019.
[63] X. Zhou, P. Huang, H. Liu and S. Niu, "Learning content-adaptive feature pooling for facial
depression recognition in videos," Electronics Letters, vol. 55, no. 11, pp. 648-650, 2019.
[64] S. Khatun, B. Morshed and G. Bidelman, "A Single-Channel EEG-Based Approach to Detect
Mild Cognitive Impairment via Speech-Evoked Brain Responses," IEEE Transactions on
Neural Systems and Rehabilitation Engineering, vol. 27, no. 5, pp. 1063-1070, 2019.
74
[65] P. Lodha, A. Talele and K. Degaonkar, "Diagnosis of Alzheimer's Disease Using Machine
Learning," in Fourth International Conference on Computing Communication Control and
Automation, Pune, 2018.
[66] A. Konig, A. Satt, A. Sorin, R. Hoory, O. Toledo-Ronen, A. Derreumaux, V. Manera, F.
Verhey, P. Aalten, P. Robert and R. Davida, "Automatic speech analysis for the assessment of
patients with predementia and Alzheimer's disease," Alzheimers Dement (Amst), vol. 1, no. 1,
pp. 112-124, 2015.
[67] S. Kloppel, C. Stonnington, C. Chu, B. Draganski, R. Scahill, J. Rohrer, N. Fox, C. Jack Jr., J.
Ashburner and R. Frackowiak, "Automatic classification of MR scans in Alzheimer's disease,"
Brain, vol. 131, no. 3, pp. 681-689, 2008.
[68] T. Kishimoto, A. Takamiya, K. Liang, K. Funaki, T. Fujita, M. Kitazawa, M. Yoshimura, Y.
Tazawa, T. Horigome, Y. Eguchi, T. Kikuchi, M. Tomita, S. Bun, J. Murakami, B. Sumali, T.
Warnita, A. Kishi, M. Yotsui, H. Toyoshiba, Y. Mitsukura, S. Koichi, Y. Sakakibara and M.
Mimura, "The Project for Objective Measures Using Computational Psychiatry Technology
(PROMPT): Rationale, Design, and Methodology," Contemporary Clinical Trials
Communications, 2020.
[69] P. Wright, J. Stern and M. Phelan, Core Psychiatry 3rd Edition, Elsevier, 2012.
[70] Y. Yacoob and L. Davis, "Recognizing human facial expressions from long image sequences
using optical flow," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18,
no. 6, pp. 636 - 642, 1996.
[71] K. Anderson and P. McOwan, "A real-time automated system for the recognition of human
facial expressions," IEEE Transactions on Systems, Man, and Cybernetics, Part B
(Cybernetics), vol. 36, no. 1, pp. 96 - 105, 2006.
[72] P. Aleksic and A. Katsaggelos, "Automatic facial expression recognition using facial animation
parameters and multistream HMMs," IEEE Transactions on Information Forensics and
Security, vol. 1, no. 1, pp. 3-11, 2006.
[73] C. Ding and D. Tao, "Robust Face Recognition via Multimodal Deep Face Representation,"
IEEE Transactions on Multimedia, vol. 17, no. 11, pp. 2049 - 2058, 2015.
75
[74] E. Shishido, S. Ogawa, S. Miyata, M. Yamamoto, T. Inada and N. Ozaki, "Application of eye
trackers for understanding mental disorders:Cases for schizophrenia and autism spectrum
disorder," Neuropsychopharmacology Reports, vol. 39, no. 2, p. 2019, 72-77.
[75] Y. Li, Y. Xu, M. Xia, T. Zhang, J. Wang, X. Liu, Y. He and J. Wang, "Eye Movement Indices
in the Study of Depressive Disorder," Shanghai Arch Psychiatry, vol. 28, no. 6, p. 326–334,
2016.
[76] A. Peckham, S. Johnson and J. Tharp, "Eye Tracking of Attention to Emotion in Bipolar I
Disorder: Links to Emotion Regulation and Anxiety Comorbidity," Int J Cogn Ther., vol. 9,
no. 4, pp. 295-312, 2016.
[77] C. M. University, "The CMU Multi-PIE Face Database," [Online]. Available:
http://www.cs.cmu.edu/afs/cs/project/PIE/MultiPie/Multi-Pie/Home.html. [Accessed July
2020].
[78] H. Song, J. Kang and S. Lee, "ConcatNet: A Deep Architecture of Concatenation-Assisted
Network for Dense Facial Landmark Alignment," in 25th IEEE International Conference on
Image Processing, Athens, 2018.
[79] H. Ouanan, M. Ouanan and B. Aksasse, "Facial landmark localization: Past, present and
future," in 4th IEEE International Colloquium on Information Science and Technology,
Tangier, 2016.
[80] Y. Liu, A. Jourabloo, W. Ren and X. Liu, "Dense Face Alignment," in IEEE International
Conference on Computer Vision Workshops, Venice, 2017.
[81] C. Zhuo, G. Li, X. Lin, D. Jiang, Y. Xu, H. Tian, W. Wang and X. Song, "The rise and fall of
MRI studies in major depressive disorder," Translational Psychiatry, vol. 9, no. 335, 2019.
[82] R. Bansal, L. Staib, A. Laine, X. Hao, D. Xu, J. Liu, M. Weissman and B. Peterson,
"Anatomical Brain Images Alone Can Accurately Diagnose Chronic Neuropsychiatric
Illnesses," PLoS One, vol. 7, no. 12, 2010.
[83] A. Sankar, T. Zhang, B. Gaonkar, J. Doshi, G. Erus, S. Costafreda, L. Marangell, C. Davatzikos
and C. Fu, "Diagnostic potential of structural neuroimaging for depression from a multi-ethnic
community sample," BJPsych Open, vol. 2, no. 4, pp. 247-254, 2018.
76
[84] B. Hage, B. Britton, D. Daniels, K. Heilman, S. Porges and A. Halaris, " Low cardiac vagal
tone index by heart rate variability differentiates bipolar from major depression," The World
Journal of Biological Psychiatry, vol. 20, no. 5, pp. 359-367, 2019.
[85] J. Moriarty, "Recognising and evaluating disordered mental states: a guide for neurologists,"
Journal of Neurology, Neurosurgery, and Psychiatry, vol. 76, 2005.
[86] D. A. Sauter, F. Eisner, A. J. Calder and S. K. Scott, "Perceptual cues in non-verbal vocal
expressions of emotion," Q J Exp Psychol (Hove)., vol. 63, no. 11, pp. 2251-2272, 2020.
[87] E. Kraepelin, "Manic depressive insanity and paranoia," J Nerv Ment Dis., vol. 53, no. 4, p.
350, 1921.
[88] D. Low, K. Bentley and S. Ghosh, "Automated assessment of psychiatric disorders using
speech: A systematic review," Laryngoscope Investig Otolaryngol., vol. 5, no. 1, p. 96–116,
2020.
[89] G. Cho, J. Yim, Y. Choi, J. Ko and S. Lee, "Review of Machine Learning Algorithms for
Diagnosing Mental Illness," Psychiatry Investig., vol. 16, no. 4, pp. 262-269, 2019.
[90] M. Hearst, S. Dumais, E. Osuna, J. Platt and B. Scholkopf, "Support vector machines," IEEE
Intell Syst Appl., vol. 13, p. 18–28, 1998.
[91] J. Friedman, "Greedy function approximation: A gradient boosting machine," Annals of
Statistics, vol. 29, no. 5, pp. 1189-1232, 1999.
[92] Y. Freund and R. Schapire, "A Decision-Theoretic Generalization of On-Line Learningand an
Application to Boosting," Journal of Computer and System Sciences, vol. 55, pp. 119-139,
1997.
[93] T. Ho, "Random decision forests," in Proceedings of 3rd International Conference on
Document Analysis and Recognition, Montreal, 1995.
[94] T. Cover and P. Hart, "Nearest neighbor pattern classification," IEEE Transactions on
Information Theory, vol. 13, no. 1, pp. 21-27, 1967.
[95] R. Tibshirani, "Regression Shrinkage and Selection Via the Lasso," JOURNAL OF THE
ROYAL STATISTICAL SOCIETY, SERIES B, vol. 58, no. 1, pp. 267-288, 1996.
77
[96] P. Kulkarni and M. Patil, "Clinical Depression Detection in Adolescent by Face," in
International Conference on Smart City and Emerging Technology, Mumbai, 2018.
[97] A. Pampouchidou, K. Marias, M. Tsiknakis, P. Simos, F. Yang, G. Lemaitre and F.
Meriaudeau, "Video-based depression detection using local Curvelet binary patterns in
pairwise orthogonal planes," in 38th Annual International Conference of the IEEE Engineering
in Medicine and Biology Society, Orlando, 2016.
[98] A. Pampouchidou, O. Simantiraki, C. Vazakopoulou, C. Chatzaki, M. Pediaditis, A. Maridaki,
K. Marias, P. Simos, F. Yang, F. Meriaudeau and M. Tsiknakis, "Facial geometry and speech
analysis for depression detection," in 39th Annual International Conference of the IEEE
Engineering in Medicine and Biology Society, Seogwipo, 2017.
[99] B. N and H. Rajaguru, "Classification of Dementia Using Harmony Search Optimization
Technique," in IEEE Region 10 Humanitarian Technology Conference, Malambe, 2018.
[100] F. Zhu, X. Li, D. Mcgonigle, H. Tang, Z. He, C. Zhang, G. Hung, P. Chiu and W. Zhou,
"Analyze Informant-Based Questionnaire for The Early Diagnosis of Senile Dementia Using
Deep Learning," IEEE Journal of Translational Engineering in Health and Medicine, vol. 9,
2019.
[101] X. Xiong and F. De la Torre, "Supervised descent method and its applications to Face
Alignment," in IEEE Conference on Computer Vision and Pattern Recognition, Portland, 2013.
[102] S. Ren, X. Cao, Y. Wei and J. Sun, "Face alignment at 3000 fps via regressing local binary
features," in IEEE Conference on Computer Vision and Pattern Recognition, Colombus, 2014.
[103] H. Lai, S. Xiao, Y. Pan, Z. Cui, J. Feng, C. Xu, J. Yin and S. Yan, "Deep Recurrent Regression
for Facial Landmark Detection," PREPRINT, 2016.
[104] S. Zhu, C. Li, C. Loy and X. Tang, "Face alignment by coarse-to-fine shape searching," IEEE
International Conference on Computer Vision and Pattern Recognition, Boston, 2015.
[105] X. Xiong and F. De la Torre, "Global supervised descent method," in IEEE Conference on
Computer Vision and Pattern Recognition, Boston, 2015.
[106] S. Zhu, C. Li, C. Loy and X. Tang, "Unconstrained Face Alignment via Cascaded
Compositional Learning," in IEEE Conference on Computer Vision and Pattern Recognition,
Las Vegas, 2016.
78
[107] 一. 川本, "プティカルフロー駆動型運動モデルによる対象領域追跡," in 画像の認識・
理解シンポジウム, Hiroshima, 2007.
[108] H. Drucker, C. Burges, L. Kaufman, A. Smola and V. Vapnik, "Support vector regression
machines," in 9th International Conference on Neural Information Processing Systems,
Denver, 1996.
[109] M. Kostinger, P. Wohlhart, P. Roth and H. Bischof, "Annotated Facial Landmarks in the Wild:
A large-scale, real-world database for facial landmark localization," in IEEE International
Conference on Computer Vision Workshops, Barcelona, 2011.
[110] G. Chrsyos, E. Antonakos, S. Zafeiriou and P. Snape, "Offline Deformable Face Tracking in
Arbitrary Videos," in IEEE International Conference on Computer Vision Workshop, Santiago,
2015.
[111] J. Shen, S. Zafeiriou, G. Chrysos, J. Kossaifi, G. Tzimiropoulos and M. Pantic, "The first facial
landmark tracking in-the-wild challenge," in IEEE International Conference on Computer
Vision Workshop, Santiago, 2015.
[112] G. Tzimiropoulos, "Project-Out Cascaded Regression with an application to face alignment,"
in IEEE Conference on Computer Vision and Pattern Recognition, Boston\, 2015.
[113] K. Mueller, B. Hermann, J. Mecollari and L. Turkstra, "Connected speech and language in mild
cognitive impairment and Alzheimer’s disease: A review of picture description tasks," J. Clin.
Exp. Neuropsychol, vol. 40, pp. 917-939, 2018.
[114] J. Mundt, A. Vogel, D. Feltner and W. Lenderking, "Vocal Acoustic Biomarkers of Depression
Severity and Treatment Response," Biol. Psychiatry, vol. 72, p. 580–587, 2012.
[115] J. Darby and H. Hollien, "Vocal and Speech Patterns of Depressive Patients," Folia Phoniatr.
Logop., vol. 29, pp. 279-291, 1977.
[116] S. Gonzalez and M. Brookes, "PEFAC - A pitch estimation algorithm robust to high levels of
noise," IEEE Trans. Audio, Speech Lang. Process., vol. 22, pp. 518-530, 2014.
[117] H. Kim, N. Moreau and T. Sikora, MPEG-7 Audio and Beyond: Audio Content Indexing and
Retrieval, Chichester: John Wiley & Sons, Ltd, 2005.
79
[118] M. Sahidullah and G. Saha, "Design, analysis and experimental evaluation of block based
transformation in MFCC computation for speaker recognition," Speech Communication, vol.
54, pp. 543-565, 2012.
[119] X. Valero and F. Alias, "Gammatone Cepstral Coefficients: Biologically Inspired Features for
Non-Speech Audio Classification," IEEE Trans. Multimedia, vol. 14, p. 1684–1689, 2012.
[120] G. Peeters, "A large set of audio features for sound description (similarity and classification)
in the CUIDADO project," 2004. [Online]. Available:
http://recherche.ircam.fr/anasyn/peeters/ARTICLES/Peeters_2003_cuidadoaudiofeatures.pdf.
[Accessed July 2020].
[121] N. Sugan, N. Sai Srinivas, N. Kar, L. Kumar, M. Nath and A. Kanhe, "Performance
Comparison of Different Cepstral Features for Speech Emotion Recognition," in 2018
International CET Conference on Control, Communication, and Computing,
Thiruvananthapuram, 2018.
[122] A. Adiga, M. Magimai and C. Seelamantula, "Gammatone wavelet Cepstral Coefficients for
robust speech recognition.," in 2013 IEEE International Conference of IEEE Region 10, Xi'an,
2013.
[123] M. Stanners, C. Barton, S. Shakib and H. Winefield, "Depression diagnosis and treatment
amongst multimorbid patients: a thematic analysis," BMC Fam Pract, vol. 15, p. 124, 2014.
[124] M. Gavrilescu and N. Vizireanu, "Predicting Depression, Anxiety, and Stress Levels from
Videos Using the Facial Action Coding System," Sensors, vol. 19, 2019.
[125] L. Wu, J. Pu, J. Allen and P. Pauli, "Recognition of Facial Expressions in Individuals with
Elevated Levels of Depressive Symptoms: An Eye-Movement Study," Depression Research
and Treatment, 2012.
[126] D. Gerhard, E. Wohleb and R. Duman, "Emerging treatment mechanisms for depression: focus
on glutamate and synaptic plasticity," Drug Discovery Today, 2016.
[127] G. Henderson, E. Ifeachor, N. Hudson, C. Goh, N. Outram, S. Wimalaratna, C. Del Percio and
F. Vecchio, "Development and assessment of methods for detecting dementia using the human
electroencephalogram," IEEE Trans. Biomed. Eng., vol. 53, 2006.
80
[128] H. Song, W. Du, X. Yu, W. Dong, W. Quan, W. Dang, H. Zhang, J. Tian and T. Zhou,
"Automatic depression discrimination on FNIRS by using general linearmodel and SVM," in
7th International Conference on Biomedical Engineering and Informatics, Dailian, 2014.