multimodal feature extraction for psychiatric disorder

90
A Thesis for the Degree of Ph.D. in Engineering Multimodal Feature Extraction for Psychiatric Disorder Screening November 2020 Graduate School of Science and Technology Keio University SUMALI, Brian

Upload: others

Post on 05-Feb-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multimodal Feature Extraction for Psychiatric Disorder

A Thesis for the Degree of Ph.D. in Engineering

Multimodal Feature Extraction for

Psychiatric Disorder Screening

November 2020

Graduate School of Science and Technology

Keio University

SUMALI, Brian

Page 2: Multimodal Feature Extraction for Psychiatric Disorder

i

Preface

Mental health examinations or screenings are commonly performed by licensed

psychiatrists in a health care location, but the advances in technology enable the

development of clinical decision support systems. In the beginning, the features inputted

to the machine include the answers from the examination conducted by the psychiatrist.

However, recently the focus shifted into predicting a person’s mental health without the

need of conducting extensive tests.

Two most common psychiatric disorders include depression and dementia.

Depression is a mood disorder traditionally associated with persistent feeling of sadness

whilst dementia is a collection of symptoms commonly caused by progressive

neurological disorders. Both are popular research objectives of many automated

psychiatric disorder screening studies. Unfortunately, most conventional studies did not

consider the existence of “pseudodementia”. Although the symptoms of depression and

dementia are different, sometimes dementia-like symptoms might be observed in a

depression patient, which is termed “pseudodementia”. Distinguishing pseudodementia

is hard even for expert psychiatrists and feature analysis may help solve the classification

problem. The focus of this dissertation is the similarities, differences, and characteristics

of the two psychiatric disorders: depression and dementia. Feature extraction and analysis

are performed to a audiovisual database of clinical psychiatric patients. Automated

psychiatric disorders using machine learning is also proposed.

Chapter 1 introduces the background of psychiatric illness and automatic psychiatric

disorder screening. Short review of conventional automatic psychiatric disorder diagnosis

algorithms is described along with the limitations and significance of this study.

Chapter 2 reviews of automatic psychiatric disorder screening. The background of

automatic screening and telemedicine in addition with common features and algorithms

utilized are described in this chapter.

In chapter 3 the facial feature analysis of both depression patients and dementia

patients was performed. Conventional facial landmarks were extracted and analyzed. The

Page 3: Multimodal Feature Extraction for Psychiatric Disorder

ii

visualizations of features correspond to depression, dementia, and features most

important to distinguish depression and dementia is described in this chapter.

In chapter 4 proposes an improvement of the facial landmark extraction for real-time

tracking. Conventional facial landmark extraction techniques are notoriously inaccurate

for non-forward-facing pose and it is pointless to instruct a psychiatric patient to be facing

the camera at all times. In this chapter, an improvement for facial landmark extraction

was proposed. The proposed facial landmark extraction algorithm is based on Cascaded

Compositional Learning and is robust even for random facial poses.

Chapter 5 and 6, are similar to chapter 3. In chapter 5, the focus was the speech

features from the psychiatric patients, while in chapter 6, both facial features and speech

features were utilized. Additionally, the similarities of facial landmarks and speech

features are examined in chapter 6.

Finally, this dissertation is summarized and concluded in chapter 7.

Page 4: Multimodal Feature Extraction for Psychiatric Disorder

iii

Table of Contents

Acknowledgements ........................................................................................................ ix

1. Introduction ........................................................................................................... 1

1.1. Background ........................................................................................................ 1

1.1.1. Depression .................................................................................................. 2

1.1.2. Dementia ..................................................................................................... 2

1.1.3. Similarities between depression and dementia ........................................... 3

1.2. Psychiatric disorder screening ........................................................................... 3

1.2.1. Automated psychiatric screening................................................................ 4

1.2.2. Review of conventional automated psychiatric screenings ........................ 5

1.3. Main contributions of this research.................................................................... 9

1.4. Thesis outline ................................................................................................... 10

2. Review of automated psychiatric disorders screening ..................................... 12

2.1. Introduction ...................................................................................................... 12

2.2. Sensory features for automatic psychiatric disorder screening........................ 13

2.2.1. Facial Features .......................................................................................... 13

2.2.2. Biosignals ................................................................................................. 14

2.2.3. Auditory features ...................................................................................... 15

2.3. Machine learning algorithms for automatic psychiatric disorder screening .... 15

2.3.1. Support vector machine ............................................................................ 16

2.3.2. Gradient Boosting Machine ...................................................................... 18

2.3.3. Random Forest .......................................................................................... 20

2.3.4. Naive Bayes .............................................................................................. 20

2.3.5. K-Nearest Neighborhood .......................................................................... 21

Page 5: Multimodal Feature Extraction for Psychiatric Disorder

iv

2.4. Summary .......................................................................................................... 23

3. Facial landmark analysis from static images.................................................... 24

3.1. Introduction ...................................................................................................... 24

3.2. Data Acquisition .............................................................................................. 24

3.3. Analysis............................................................................................................ 25

3.3.1. Preprocessing ............................................................................................ 27

3.3.2. Feature extraction ..................................................................................... 27

3.3.3. Statistical analysis..................................................................................... 28

3.3.4. Feature selection and machine learning.................................................... 28

3.4. Results .............................................................................................................. 29

3.5. Discussion ........................................................................................................ 31

3.6. Summary .......................................................................................................... 32

4. Robust facial landmark tracking algorithm ..................................................... 33

4.1. Introduction ...................................................................................................... 33

4.2. Conventional facial tracking analysis .............................................................. 33

4.3. Proposed method .............................................................................................. 35

4.3.1. Supervised descent method ...................................................................... 35

4.3.2. Compositional vector estimation .............................................................. 36

4.3.3. Training composite vectors ...................................................................... 37

4.4. Experiment ....................................................................................................... 38

4.5. Results and discussion ..................................................................................... 39

4.6. Conclusion ....................................................................................................... 40

5. Speech feature analysis for classification of depression and dementia .......... 42

5.1. Introduction ...................................................................................................... 42

5.2. Data Acquisition .............................................................................................. 42

5.3. Analysis............................................................................................................ 42

Page 6: Multimodal Feature Extraction for Psychiatric Disorder

v

5.3.1. Audio signal analysis ................................................................................ 43

5.3.2. Statistical analysis..................................................................................... 46

5.3.3. Machine Learning ..................................................................................... 46

5.3.4. Evaluation Metrics .................................................................................... 48

5.4. Results .............................................................................................................. 49

5.4.1. Statistical analysis..................................................................................... 50

5.4.2. Machine learning ...................................................................................... 51

5.5. Discussion ........................................................................................................ 54

5.6. Conclusions ...................................................................................................... 56

6. Multimodal feature analysis in depression patients and dementia patients .. 57

6.1. Introduction ...................................................................................................... 57

6.2. Data acquisition ............................................................................................... 57

6.3. Analysis............................................................................................................ 58

6.3.1. Facial feature analysis .............................................................................. 58

6.3.2. Audio feature analysis .............................................................................. 58

6.3.3. Statistical analysis..................................................................................... 58

6.3.4. Machine learning ...................................................................................... 59

6.4. Results and Discussion .................................................................................... 59

6.4.1. Demographics ........................................................................................... 59

6.4.2. Statistical analysis..................................................................................... 60

6.4.3. Machine learning ...................................................................................... 61

6.5. Conclusion ....................................................................................................... 62

7. Conclusion ............................................................................................................ 64

7.1. Summary of this thesis ..................................................................................... 64

7.2. Conclusion ....................................................................................................... 65

7.3. Suggestions for future research ........................................................................ 66

Page 7: Multimodal Feature Extraction for Psychiatric Disorder

vi

References...................................................................................................................... 68

Page 8: Multimodal Feature Extraction for Psychiatric Disorder

vii

List of Figures

Figure 2.1 ........................................................................................................................ 14

Figure 2.2 ........................................................................................................................ 17

Figure 2.3 ........................................................................................................................ 18

Figure 2.4 ........................................................................................................................ 19

Figure 3.1 ........................................................................................................................ 26

Figure 4.1 ........................................................................................................................ 39

Figure 4.2 ........................................................................................................................ 40

Figure 5.1 ........................................................................................................................ 43

Figure 5.2 ........................................................................................................................ 48

Page 9: Multimodal Feature Extraction for Psychiatric Disorder

viii

List of Tables

Table 1.1 ........................................................................................................................... 8

Table 3.1 ......................................................................................................................... 30

Table 3.2 ......................................................................................................................... 31

Table 4.1 ......................................................................................................................... 40

Table 5.1 ......................................................................................................................... 45

Table 5.2 ......................................................................................................................... 49

Table 5.3 ......................................................................................................................... 50

Table 5.4 ......................................................................................................................... 51

Table 5.5 ......................................................................................................................... 52

Table 5.6 ......................................................................................................................... 52

Table 5.7 ......................................................................................................................... 53

Table 5.8 ......................................................................................................................... 54

Table 6.1 ......................................................................................................................... 60

Table 6.2 ......................................................................................................................... 62

Page 10: Multimodal Feature Extraction for Psychiatric Disorder

ix

Acknowledgements

This dissertation is the culmination of my studies as a Ph.D student at Mitsukura

laboratory, Keio University. First and foremost, I wish to express my deepest appreciation

to my supervisor, Prof. Dr. Yasue Mitsukura, for all her guidance and support from the

start of my Ph.D program up until its completion. Thank you very much for your

inspiration, encouragement, patience, support, confidence, and trust. I cannot thank you

enough for all the learning opportunities you have given to me.

I would also like to express my thanks to Prof. Emer. Dr. Nozomu Hamada for the help,

advice and comments during my research and writing of this dissertation. You are one of

the most inspirational figures I have ever met and I aim to improve myself to be a top-

notch researcher like you.

I extend my sincere appreciation to the colleagues and alumni of Mitsukura laboratory,

especially Mr. Takahiro Asano, Mr. Motonobu Fujioka, Mr. Toshiya Nakaigawa,

Mr. Hideto Watanabe, and Dr. Suguru Kanoga, for their ideas, encouragement, and advice.

Unfortunately, it is not possible to list all of them in this limited space.

I am also grateful to Dr. Taishiro Kishimoto the members of Kishimoto laboratory,

epsecially Dr. Kuo-ching Liang, Dr. Michitaka Yoshimura, and Dr. Momoko Kitazawa,

for their support during the collaborative research period. Your advices and support have

been very invaluable.

I would also like to thank Prof. Dr. Yoshimitsu Aoki, Prof. Dr. Toshiyuki Tanaka, and

Prof. Dr. Toshiyuki Murakami. Thank you very much for agreeing to be part of my

dissertation thesis committee, and thank you very much giving me your time for

discussing the revisions of the dissertation. Without your support and advice. this

dissertation would not have been the same as presented here.

I am also indebted to Ministry of Education, Culture, Sports, Science and Technology

(MEXT), Japan for funding my Ph.D study. Without the funding, I would not be able to

pursue my Ph.D study in Japan.

Last but certainly not leasr, I would like to thank my family and friends for their endless

support. I am sorry that I could not visit Indonesia frequent enough during my study here

and thank you very much for the encouragement and moral support.

October 2020

Brian Sumali

Page 11: Multimodal Feature Extraction for Psychiatric Disorder

1

Chapter 1

Introduction

1.1. Background

According to WHO, five most common psychiatric disorders affecting human worldwide

are: depression, dementia, bipolar disease, psychosis including schizophrenia, and

developmental disorders including autism [1]. Of those five, depression ranks first on the

number of patients (estimation of 264 million), followed by dementia (estimation of 50

million). In Japan, dementia is a serious problem brought with the population ageing

affecting the country. Ministry of Health, Labor, and Welfare (MHLW) approximates at

least 6 million cases of dementia in Japan (15% of population) by the year of 2020 [2].

On the other hand, depression is said as the common cause of suicide, which is also a

plight in Japan [3] [4]. Correctly diagnosing these psychiatric disorders is especially

important for Japanese society.

The book detailing guideline for classification of psychiatric disorders is called

“Diagnostic and Statistical Manual of Mental Disorders (DSM)”. It is published by

American Psychiatric Association (APA) and is used by clinicians, researchers,

psychiatric drug regulation agencies, health insurance companies, pharmaceutical

companies, the legal system, and policy makers. It was first published 1952 and since

then had revisions – inclusions of new mental disorders and removal of entries which is

no longer considered as mental disorders. Its latest edition was published in 2013, called

DSM-5 [5].

In a clinical setting, “mental health screening tools” are commonly utilized by psychiatrist

as a guideline for diagnosing a certain mental health issue. These tools are based on the

DSM-5 and are specialized for diagnosing a specific psychiatric illness and consist of a

guideline for interviews and tests. For example, for diagnosing dementia a psychiatrist

commonly used a combination of mini-mental state examination (MMSE) [6], clock-

drawing test (CDT) [7], clinical dementia rating (CDR) [8], and logical memory test (LM)

from Wechsler Memory Scale [9]. The mental health examinations are similar in timing

Page 12: Multimodal Feature Extraction for Psychiatric Disorder

2

with physical examination, in that they are not performed every day but with a specific

interval, for example every two or three months.

1.1.1. Depression

Clinical depression, or major depressive disorder (MDD), is a mental disorder

characterized by two weeks or more of prolonged feeling of sadness, low spirits, and

helplessness, accompanied with changes in interest or sleep schedule [5]. Depression is

also the major cause of suicide, affecting around 50% of the suicide victims. Additionally,

Japan’s suicide rate is rated as the sixth highest worldwide and second highest among

eight industrialized nations in 2017, making depression a serious predicament for the

nation.

The cause of depression is not conclusively known. It is believed to be a combination of

genetic, environmental, and psychological factors [10] [11] [12]. Risk factors include a

family history of the condition, major life changes, certain medications, chronic health

problems, and substance abuse. About 40% of the risk appears to be related to

genetics [13] [14].

The diagnosis of MDD is mainly based on the person's mental health screening. There is

no laboratory test for the diagnosing MDD; interviews and patient’s history are the

primary grounds for diagnosis. Hamilton depression rating scale (HAMD / HDRS) [15]

is one of the commonly used guidelines of mental health screening for depression

However, mental and physical tests may be done to rule out conditions that may cause

similar symptoms. The conventional treatment options for MDD are counseling and

antidepressants [16] [17].

1.1.2. Dementia

Dementia is not a disease but a collection of symptoms. Dementia usually refers to loss

of memory, language, problem-solving and other thinking abilities [5]. The cognitive

deficit of dementia is impactful in daily life, for example, forgetting one’s home address

or in one of the worst cases, how to turn on the stove.

Dementia does not only affect the patient alone; it also heavily affects the community

both socially and economically. It can be emotionally overwhelming to the families of

Page 13: Multimodal Feature Extraction for Psychiatric Disorder

3

the patients and for their caregivers. The care for dementia is also costly; in Japan 2014,

it was estimated that 14.5 trillion JPY was spent for dementia care and the cost per person

with dementia appeared to be 5.95 million JPY [18].

Most of the underlying cause of dementia are uncurable. Alzheimer's disease makes

up ,pre than 50% of cases of dementia. Other common causes include vascular dementia

(25%), dementia with Lewy bodies (15%), and frontotemporal dementia [19]. A person

with dementia may have more than one underlying cause.

Diagnosis of dementia is usually based on history of the illness and cognitive testing with

medical imaging or genetic testing. The MMSE is one commonly used cognitive test.

There is no known cure for dementia. Most treatment are only symptomatic and have

limited effectiveness. Prevention of dementia by reducing the risk factors is the main

consensus of the experts.

1.1.3. Similarities between depression and dementia

The symptoms of depression and dementia can be very similar or even overlap. But

despite their similarities, they are two distinct illnesses. The term pseudodementia

typically refers to dementia-like symptoms caused by curable mental disorders [20] [21]

[22] [23]. The most common cause of pseudodementia is depression. The shared

symptoms of depression and dementia include impaired cognitive ability, reduced

concentration, and feelings of apathy. Accordingly, both diseases seem to affect the

patient’s memory and cognition. Results from extensive tests have shown that

pseudodementia patients perform better on memory tests compared to true dementia

patient. Additionally, because depression is treatable, it is very important to distinguish

depression from dementia. Currently, diagnosing pseudodementia is difficult even for

expert psychiatrists, and extensive testing of both depression and dementia are needed to

clinically diagnose psedudodementia [24].

1.2. Psychiatric disorder screening

Each psychiatric disorder screening is different based on the patient and their symptoms.

A conventional screening conducted by a psychiatrist in a health care service typically

include interview followed by tests. The verbal tests conducted by the psychiatrist follow

Page 14: Multimodal Feature Extraction for Psychiatric Disorder

4

certain guidelines: “mental health screening tools”. Although DSM5 contains numerous

potential mental health disorders, administering detailed assessments for all the problems

is simply impossible.

Popular mental health screening tools for depression include Hamilton depression rating

scale (HAMD / HDRS) [15], Montgomery–Asberg depression rating scale (MADRS)

[25], Beck’s depression inventory (BDI) [26], and Young’s mania rating scale (YMRS)

[27]. Sometimes, the Pittsburgh sleep quality index (PSQI) [28] might be included to help

score the patient’s sleep health. As an example, a psychiatrist using HAMD for

diagnosing a depression patient tries to score the patient’s depressed mood, feelings of

guilt, suicide ideation, etc. indirectly via structured interview, like a conversation. After

the scoring, the psychiatrist will tell the diagnosis result to the patient along with the

treatment options.

On the other hand, dementia screening tools are vastly different from the depression ones.

The tools include mini-mental state examination (MMSE) [6], clinical dementia rating

(CDR) [8], neuropsychiatric inventory questionnaire (NPI-Q) [29], clock-drawing test

(CDT) [7], logical memory test (LM) [9], and Boston cookie theft task [30]. Albeit the

seemingly straightforward nature of tests, diagnosing someone with a mild dementia is

often a challenge. Additionally, the aspects of natural intelligence and knowledge differs

person-to-person. In many cases, conclusive dementia screening needs additional tests

such as physical examination, brain imaging, and laboratory tests.

1.2.1. Automated psychiatric screening

Assessment and outcome monitoring are critical for the effective detection and treatment

of mental illness. Traditional methods of capturing social, functional, and behavioral data

are limited to the information that patients report back to their health care provider at

selected points in time. As a result, these data are not accurate accounts of day-to-day

functioning, as they are often influenced by biases in self-report. Telemedicine or

telehealth, the practice of utilizing electronic information and technologies to support

health care with the objective of removing the effect of physical distance between the

patient and health care providers has been proposed [31]. Recent development of mobile

technology such as mobile applications on smartphones has the potential to overcome

Page 15: Multimodal Feature Extraction for Psychiatric Disorder

5

problems with traditional assessment and improve telehealth quality by providing

information about patient symptoms, behavior, and functioning in real time [32] [33] [34].

Although the use of sensors and apps are widespread, most of the tools are not clinically

validated, and the reliability of the apps and sensors remain unverified.

1.2.2. Review of conventional automated psychiatric screenings

Conventional psychiatric screening relies solely on patient reporting and psychiatrist

observations. When conducting the assessment or interpreting the findings, it is important

to consider the cultural background of the patient, as behavioral patterns vary between

cultures. This results in some criticism about the subjectivity of conventional mental

health

assessment [35] [36] and proposes machine learning and artificial intelligence (AI) as an

objective mental health screening tool. In a recent survey [37], many psychiatrists believe

that artificial intelligence (AI) and machine learning will significantly transform the way

they work. Psychiatrists also predicted that AI and machine learning could help ensure

more accurate diagnosis, reduce administrative burden, provide ceaseless monitoring,

personalize drug targets to reduce adverse effects, and integrate new streams of data from

various data source.

In general, the most popular features utilized in the automatic mental health assessment

are as follows [38]:

1. Self-report or questionnaire

The mental health questionnaire and self-reporting survey are the oldest methods

of technology-based data collection. The contents of these questionnaires consist

largely of standardized mood and disability questions. With the advancement of

technology, mobile and web platforms are commonly utilized as the media. The

question in the platform mimics some of the clinical screening tools. For example,

depression-based self-report platform may display questions from the 9- and 2-

question Patient Health Questionnaires (PHQ-9 and PHQ-2) [39].

2. Task performance result

Computer assisted psychiatric diagnosis has been proposed since the 1980s [40].

The most common computer-assisted diagnosis software in recent years [41] [42]

Page 16: Multimodal Feature Extraction for Psychiatric Disorder

6

[43] are game-like tasks. These tasks commonly measure a subject’s attention,

concentration, and working memory. The subject’s reaction time, number of

errors and retries, and other performance measures are collected and processed to

perform a diagnosis. Some of the diagnosis algorithms are adaptive, that is they

do not have a fixed threshold but rather adapt the diagnosis threshold based on the

current user of the software. The adaptive algorithms also consider practice or

experience effects which may boost the performance score of a subject.

3. Data from sensors

Wearable sensors such as those embedded in smartphones and smartwatches can

measure behavioral features such as physical activity and location. Wearable

sensors can also detect physiological data such as heart rate, galvanic skin

response, respiration rate, etc. These behavioral and physiological data were then

utilized for diagnose a person’s mental state [44] [45].

Electroencephalogram (EEG) sensors are also an option for automatic screening

[46] [47] and the development of mobile EEG sensors also enables the remote

screening [48] [49] [50]. Other noninvasive sensors such as video camera and

audio recording device may detect patient’s emotions, body languages, cognitive

burden, and mood [45] [51]. Portable magnetic resonance imaging (MRIs) has

been announced in early 2020 [52], opening the possibility of utilizing MRI for

remote screening.

4. Data from social media

The use of social media for medical research is rising in popularity [53] [54]. The

contents extracted from social media have been utilized for personality assessment

[55], screening for depression [56], and suicide risk factors research [57]. Social

data collected from the social media websites and smartphones may include a

combination of incoming and outgoing call and text frequency, length of texts and

calls, and number of people contacted, as well as the content of public messages

sent via social medias. Although these data could serve as indicators of

psychopathology, the use of social media data is controversial and privacy issues

are inseparable from the researches utilizing social data [58] [59].

Page 17: Multimodal Feature Extraction for Psychiatric Disorder

7

To summarize, telehealth researches are gaining popularity, especially for psychiatric

telehealth services. Remote physical examinations exist but face more challenge

compared to mental health assessment via virtual consultation. For example, stethoscope

for telemedicine are available, but requires an additional purchase. Meanwhile since the

basic psychiatric examinations consist of interviews and communication- or paper-based

tests, they did not require additional tools.

In general, the advantages to the collection of mood and behavioral data from

smartphones, wearable sensors, and noninvasive sensors are many, and several studies

have developed their own algorithm or machine learning models to learn accurately

predict one’s mental state. In Table 1.1, the short reviews of recent and influential studies

are described.

From the table, it can be concluded that the counts of machine learning focusing on early

detection or algorithms utilizing non-invasive inputs are rising in popularity. Additionally,

the databases of dementia studies are mostly not public. AVEC dataset utilized in

depression studies are an audiovisual database of participants performing human-

computer interaction task in quiet settings. The labeling of AVEC subjects was performed

by computing BDI score, a self-reporting questionnaire for depression.

Page 18: Multimodal Feature Extraction for Psychiatric Disorder

8 Table 1.1 Conventional studies for automated psychiatric screening

Authors Year Objectives Features Methodology Accuracy Dataset

Neevaleni

and Devasana

[60]

2020 Alzheimer’s

disease

Parameters (task

result)

SVM, Decision tree 85% Clinical, not public

Yalamanchili

et al. [61]

2020 Depression Realtime Speech

(sensors)

SMOTE + SVM (best) 93% DAIC-WOZ

(AVEC2016)

He et al. [62] 2019 Depression

severity

Face landmarks

(sensors)

Dirichlet process

Fisher vector + BoW

RMSE = 9.20

MAE = 7.55

AVEC2013

AVEC2014

Zhou et al.

[63]

2019 Depression

severity

Raw face

(sensors)

Deep learning (ResNet

+ hidden layers)

RMSE = 6.37

MAE = 8.43

AVEC2014

Khatun et al.

[64]

2019 MCI EEG (FPz) from

audio ERPs

(sensors)

SVM RBF kernel 87.9% Not public

Lodha et al.

[65]

2018 Alzheimer’s

Disease

MRI

(sensors)

Neural networks 98.36% ADNI project

Konig et al.

[66]

2015 Alzheimer’s

Disease

Speech

(sensors)

SVM 87% (healthy

vs AD)

Clinical, not public.

Kloppel et al.

[67]

2008 Alzheimer’s

Disease

MRI

(sensors)

SVM 95% (healthy

vs AD)

Clinical, not public

Page 19: Multimodal Feature Extraction for Psychiatric Disorder

9

1.3. Main contributions of this research

One big problem when diagnosing dementia and depression is the existence of

“pseudodementia” – a dementia-like symptoms that was caused by depression [20]. This

is a serious problem as depression is a curable mood disorder and dementia on the other

hand is typically signifies an underlying progressive neurological disorder, most of which

are not curable. Additionally, for elderly patients, late-life depression is a risk factor,

symptom, or a prodrome for dementia, or in worst case a comorbidity illness. Albeit the

progresses in automated psychiatric screening, minimal attention is given for

pseudodementia research. As shown in the Table 1.1, the researchers were focused only

in screening one disease and disregard two-dimensional aspect of comorbidities or shared

symptoms. Additionally, the depression screening researches utilizing a patient’s

audiovisual features seems to be non-clinical; the label for the datasets are from self-

reported tools and without the supervision of clinician. Noninvasive dementia diagnosis

studies are still rare; most of the research seems to be focused utilizing biosignals and

brain imaging.

This dissertation focuses on analyzing the differences between depression and dementia,

then report the differing features. The limitations of this study are as follows:

• The database was obtained from The Project for Objective Measures Using

Computational Psychiatry Technology (PROMPT) by Dr. Kishimoto [68].

• The subjects are actual patients affected with depression, dementia, or

comorbidity of dementia and depression.

• There is 70cm of distance between recording apparatus and subjects.

• The recording apparatuses utilized in this dissertation are microphone for speech

recording and camera recorder for observing patient’s face and movement.

• The microphone utilized for recording patient’s speech was Classis RM30W

(Beyerdynamic GmbH & Co. KG) with 16kHz of sampling rate.

• The video recording device were RealSense R200 (Intel Corporation) and

Microsoft Kinect for Windows v2 (Microsoft Corporation). Some patients were

recorded with Kinect and some others were recorded using RealSense.

Page 20: Multimodal Feature Extraction for Psychiatric Disorder

10

• The labelling of each dataset and its test scoring was performed by licensed

clinical psychiatrists.

The main contributions of this dissertation are as follows:

1. Analysis of acoustic, facial, and fusion of acoustic and facial features from

depression patients, dementia patients.

2. Analysis of similarity and difference between acoustic features and facial features

for each patient group.

3. Explored the possibility of classification of depression, dementia, and dementia

with depression with simple machine learning models.

4. Proposal of a pose-robust facial landmark tracking algorithm which is beneficial

for both automatic screening and telehealth in general,

1.4. Thesis outline

Chapter 2 reviews of automatic psychiatric disorder screening in detail. The background

of automatic screening and telemedicine in addition with common features and

algorithms utilized are described in this chapter.

In chapter 3 the facial feature analysis of both depression patients and dementia patients

was performed. Conventional facial landmarks were extracted and analyzed. The

visualizations of features correspond to depression, dementia, and features most

important to distinguish depression and dementia is described in this chapter.

In chapter 4 proposes an improvement of the facial landmark extraction for real-time

tracking. Conventional facial landmark extraction techniques are notoriously inaccurate

for non-forward-facing pose and it is pointless to instruct a psychiatric patient to be facing

the camera at all times. In this chapter, an improvement for facial landmark extraction

was proposed. The proposed facial landmark extraction algorithm is based on Cascaded

Compositional Learning and is robust even for random facial poses.

Chapter 5 and 6, are similar to chapter 3. In chapter 5, the focus was the speech features

from the psychiatric patients, while in chapter 6, both facial features and speech features

were utilized. Additionally, the similarities of facial landmarks and speech features are

examined in chapter 6.

Page 21: Multimodal Feature Extraction for Psychiatric Disorder

11

Finally, this dissertation is summarized and concluded in chapter 7.

Page 22: Multimodal Feature Extraction for Psychiatric Disorder

12

Chapter 2

Review of automated psychiatric

disorders screening

2.1. Introduction

A psychiatric screening is when a licensed psychiatrist examines a patient for possible

psychiatric disorders. Similar to a physical examination, a psychiatric screening session

consists of interviews and tests with the purpose of diagnosing the mental health of the

patient. It is a structured way of observing and describing the psychological functions of

a patient. Psychological aspects conventionally monitored during a psychiatric screening

include the patient’s attitude, behavior, mood, emotion, thought process, perception,

cognition, and judgement, which were inferred from the patient’s facial expression, body

language, and speech [69].

Conventionally, patients need to go to a hospital or a health care service to be diagnosed

and treated. However, with the advance of information technology and communication

technology, some health care services are now available remotely. Telehealth is the use

of digital information and communication technologies, such as computers and mobile

devices, to access health care services remotely and manage a person’s health. It also

enables health care service for patients with limitations in transportation or mobility, for

example, patients inside a rural area or areas with travel limitations or bans. With the

recent coronavirus pandemic in 2020, more attention is being paid for telehealth research,

both for psychiatric examination and physical examination.

The advances of telehealth also open the bridge for automatic telehealth screening, both

for physical and mental disorders. Automated testing or self-testing could be used to

identify, and perhaps assess, individuals with hearing loss. Self-testing with an intelligent

automated system could offer accurate results and measure background noise to ensure

validity. Such systems are now being developed by researchers worldwide.

Page 23: Multimodal Feature Extraction for Psychiatric Disorder

13

2.2. Sensory features for automatic psychiatric disorder screening

As reported in the previous chapters, self-report or questionnaire, task performance result,

data from sensors, and data from social media are the most common features utilized for

automatic mental health screening. Sensory features commonly employed for automated

mental health screening include

a. Facial features (gaze, blink, emotion detection, etc.)

b. Biosignals (electroencephalogram, heart rate, respiration, etc.)

c. Auditory features (intensity, tone, speed of speech, etc.).

2.2.1. Facial Features

Patients with a number of psychiatric conditions may display abnormal facial expressions.

Facial expression or emotion detection has always been an easy task for humans but

achieving the same task with a computer algorithm is quite challenging and has been the

objective for computer vision field. With the recent advancement in computer vision and

machine learning, it is possible to detect emotions from images. The detection and

processing of facial expression are achieved through various methods such as optical flow

[70] [71], hidden Markov models [72], or artificial neural networks [73].

Aside from facial expression, gaze, blink, and eye movements are said to be important

clues to mental health clinicians [74] [75]. One such method involves using an eye-

tracking device to monitor the duration of a patient’s gaze when presented with

emotionally evocative stimuli, such as photos of happy or sad faces [76]. It is said that

people with depression tend to have an attentional bias for negative information, which

may be one factor that increases vulnerability to depressive episodes.

It must be noted that there is no uniform facial landmark map. The most popular facial

landmark map is the perhaps from “Multi-PIE database” [77]. It is a database of human

faces, with 68 facial landmarks pre-annotated. An example from such scheme is given in

Figure 2.1. Nevertheless, the effectiveness of facial landmarks is often dependent on the

objective and several studies were reported using their own facial landmark mapping

scheme [78] [79], even a 3D-based facial landmarks [80].

Page 24: Multimodal Feature Extraction for Psychiatric Disorder

14

Figure 2.1 Multi-PIE 68-point facial landmark scheme

2.2.2. Biosignals

An electroencephalogram (EEG) is a method to record electrical activity of the brain. It

is noninvasive – the electrodes are placed in the scalp. The invasive method for recording

electrical brain activities are called electrocorticography (ECoG). EEG has been utilized

by mental health professionals for diagnosis [47]. During an EEG recording session,

small electrodes are placed on the scalp of the head, and they are attached to a computer.

This computer measures all the electrical impulses that brain cells trade with one another,

and all the portions of the brain that are at work. As the test moves forward, the provider

of the test might ask people to either relax or perform certain types of activities, like

problem solving or storytelling, and then measure how the electrical activity changes due

to those behaviors. During “relax” session, the subjects are typically instructed to keep

their eyes closed to reduce eyeblink artifacts.

Another biomarker is brain imaging from MRI. An MRI uses electric currents and radio

waves to develop a three-dimensional view of a body part. Obtaining this of information

is relatively easy and an MRI scan is typically completed in one hour. However, MRI

scans are typically costly, and the result might not be meaningful [81]. Nevertheless, some

studies report that MRI scans alone might be able to accurately predict psychiatric

disorders [82] [83].

Page 25: Multimodal Feature Extraction for Psychiatric Disorder

15

Electrocardiograms (ECGs) and heart rate variability (HRV) have been proven to be

beneficial for diagnosing bipolar disorder. In one study, the researchers computed what

is known to cardiologists as respiratory sinus arrhythmia (RSA). At the baseline

(beginning of the study), the subjects with major depression had significantly higher RSA

than those with bipolar disorder [84].

2.2.3. Auditory features

Similar to facial expression, emotional speech also seems to be beneficial for diagnosing

mental health [85]. Various changes in the autonomic nervous system can indirectly alter

a person's speech and are beneficial for recognizing emotion. For example, speech

produced in a state of excitement (fear, anger, or joy) becomes fast, loud, and precisely

enunciated, with a higher and wider range in pitch, whereas low mood emotions such as

tiredness, boredom, or sadness tend to generate slow, low-pitched, and slurred speech

[86].

Speech patterns have been known to provide indicators of mental disorders. One study in

1921 stated that depressed patients' voices tended to have lower pitch, more monotonous

speech, lower sound intensity, and lower speech rate as well as more hesitations,

stuttering, and whispering [87]. The advantages of using speech features compared to

other features is that the symptoms are often hard to disguise and the possibility of

generalization across languages, considering similar human vocal anatomy [88].

Nevertheless, cultural effects on human behavior should be considered during

interpretation of speech analysis.

2.3. Machine learning algorithms for automatic psychiatric disorder

screening

Machine learning algorithms commonly employed for automatic psychiatric disorder

screening include [89]:

a. Support Vector Machines (SVM),

b. Gradient Boosting Machine (GBM),

c. Random Forest,

d. Naive Bayes, and

Page 26: Multimodal Feature Extraction for Psychiatric Disorder

16

e. K-Nearest Neighborhood (KNN)

2.3.1. Support vector machine

A support vector machine (SVM) is a supervised machine learning model that uses

classification algorithms for two-group (binary) classification problems [90]. To

understand what an SVM is, the understanding of following keywords is required:

hyperplane, support vector, and margin.

Hyperplane: The objective of an SVM is to find a hyperplane that best divides a dataset

into two classes. A hyperplane in 2D data is a line. This line is the decision boundary, any

data that falls to one side of it we will classify as class zero, and anything that falls to the

other as class one, where “class zero” and “class one” are the two possible class labels in

this example.

Support vector: Support vectors are the data points nearest to the hyperplane, the points

of a data set that, if removed, would alter the position of the dividing hyperplane.

Margin: The distance between the hyperplane and the nearest data point from either set

is known as the margin. As the nearest datapoints to hyperplane is the support vectors,

margins are the distance between support vectors and a hyperplane. The optimal

hyperplane of an SVM is defined by the hyperplane which produces smallest

classification error while producing greatest possible margins.

After giving an SVM model sets of labeled training data for each category, the model is

able to categorize new text. The main idea is that based on the labeled data (training data)

the algorithm tries to find the optimal hyperplane which can be used to classify new data

points. In two dimensions the hyperplane is a simple line and Figure 2.2 shows the

illustration of support vectors determination among the two-class problem in two-

dimensional data points.

Page 27: Multimodal Feature Extraction for Psychiatric Disorder

17

Figure 2.2 An example of SVM for binary classification. Red dots represent the class

“0” and cyan dots represent the class “1”. The dashed green line represents the

hyperplane.

A support vector machine takes the red and cyan data points and outputs the hyperplane

that best separates the tags. As shown in the Figure 2.2, in case where the data are not

completely linearly-separable, the hyperplane is chosen to minimize misclassification

based on the given training data and label.

However, since SVM is inherently a linear separator, it is unable to solve a nonlinearly

separable data as shown in Figure 2.3a. In this case, a “kernel trick” is required. Adding

a third dimension, for example z = x² + y² causes the data to be projected into a new space,

and a slice of that space is now linearly separable (see Figure 2.3b). Common kernels to

be utilized with SVM include polynomial kernel (2nd and 3rd order), radial basis function

(RBF), sigmoid, and gaussian.

Page 28: Multimodal Feature Extraction for Psychiatric Disorder

18

(a) (b)

Figure 2.3 The kernel trick. (a) is a nonlinearly separable dataset, and (b) the dataset

after adding a third dimension, by applying the kernel z = x² + y² is now linearly

separable by a linear plane

Computing an SVM classifier is equivalent to solving:

𝑓(�⃗⃗� , 𝑏) = [1

𝑛∑max(0, 1 − 𝑦𝑖(�⃗⃗� ∙ 𝑥 𝑖 − 𝑏))

𝑛

𝑖=1

] + 𝜆‖�⃗⃗� ‖2 (2.1)

Here, �⃗⃗� is the normal vector to the hyperplane and 𝑏 is the bias. 𝑦𝑖 is either -1 or 1,

indicating the class to which 𝑥 𝑖 belongs to. The term 𝜆 is the tradeoff parameter. Since

𝑓(�⃗⃗� , 𝑏) is a convex function of �⃗⃗� and b, optimization algorithms such as gradient descent

algorithm can be used.

2.3.2. Gradient Boosting Machine

Gradient boosting is an ensemble machine learning technique [91]. It produces a strong

model by converting an ensemble of weak prediction models, typically decision trees. A

decision Tree consists of nodes and edges. Terminal nodes that predict the outcome are

called “leaf nodes”. An illustration of a decision tree is shown in Figure 2.4. Different

from a standard decision tree, the decision trees utilized in boosting algorithms are

typically

“stump” – a one-level decision tree. A stump makes a prediction based on the value of

just a single input feature.

Page 29: Multimodal Feature Extraction for Psychiatric Disorder

19

Figure 2.4 An illustration of a decision trees. Conditions are checked iteratively until the

decision in leaf nodes are reached. In this figure, the objective is a classification and the

leaf nodes mark the class which the input belongs. A stump is a decision tree with only

one height: it consists only of the root node and leaf nodes.

Boosting is a method of converting weak learners into strong learners. In boosting, each

new tree is a fit on a modified version of the original data set. It builds the strong model

by sequentially adding the weak models. Gradient boosting machine (GBM) or gradient

tree boosting is a generalization of adaptive boosting algorithm (AdaBoost) [92].

AdaBoost works by weighting the observations, putting more weight on difficult to

classify instances and less on those already handled well. New weak learners are then

added sequentially. The weights from previous training sessions caused the new learners

to focus more on the difficult instances. Predictions are made by weighted majority voting

from the weak learners’ predictions, weighted by their individual accuracy.

Gradient boosting re-defines boosting as a numerical optimization problem where the

objective is to minimize the loss function of the model by adding weak learners using a

gradient-descent like procedure. As gradient boosting is based on minimizing a loss

function, different types of loss functions can be used resulting in a flexible technique

that can be applied to regression, multi-class classification, etc.

Page 30: Multimodal Feature Extraction for Psychiatric Disorder

20

The first learner to predict the observations in the training dataset is trained and the error

is calculated. In AdaBoost, the data is assigned weights based on the misclassification for

the second learner. In GBM, the error is defined as a loss function and the second learner

is built based on the residual error produced by the first learner to predict the loss after

the first step and continue to do so until it reaches a threshold.

2.3.3. Random Forest

Random forest is also an ensemble learning algorithm concerning with decision trees [93].

A random forest divides the training data into random subsets of features and random

subsets of data points, then decision trees were trained based on those subsets. This is

defined as “bagging” (bootstrap aggregating). The random sampling increases diversity

in the decision trees and leads into to more robust overall predictions. Both random

sampling and usage of numerous decision trees lead to the name “random forest”. During

the output prediction, majority voting is taken from the trained decision trees. In case of

regression, average or median value from the predictions might be considered.

Disadvantages of random forest algorithm include the computational cost for very deep

trees. As random forest parallelly trains numerous decision trees, a setting of deep

decision trees may cause burden in the memory and processing power. It is said that the

computational cost increases in tandem with the depth of decision tree more than the

number of decision trees. It is also affected by inherent weakness of bagging algorithms:

the weakness to imbalanced dataset.

2.3.4. Naive Bayes

Naïve Bayes is a probabilistic model to find the value that achieves the maximum

probability computed from a conditional probability chain. For example, the probability

of an object to be an “apple” if it is red, round, and diameter is around 8cm. Naïve Bayes

is based on Bayes theorem:

𝑃(𝑐|𝑥) =𝑃(𝑥|𝑐)𝑃(𝑐)

𝑃(𝑥) (2.2)

Where 𝑃(𝑐|𝑥) is the probability of condition c given predictor (feature) x. For

independent feature vector 𝑋 = {𝑥1, 𝑥2, … , 𝑥𝑁}, the above equation can be written as

Page 31: Multimodal Feature Extraction for Psychiatric Disorder

21

𝑃(𝑐|𝑋) =𝑃(𝑥1|𝑐)𝑃(𝑥2|𝑐)…𝑃(𝑥𝑁|𝑐)𝑃(𝑐)

𝑃(𝑥1)𝑃(𝑥2)…𝑃(𝑥𝑁) (2.3)

During classification, 𝑃(𝑐|𝑋) is checked for each class c. For classification problem,

Naïve Bayes’ algorithm will classify the inputted features as the class with highest

probability. Since the denominator 𝑃(𝑥1)𝑃(𝑥2)…𝑃(𝑥𝑁) is same for all cases of c, it can

be safely ignored when comparing the likelihood for classification.

Naïve Bayes is computationally inexpensive and performs well if the assumption of

independence holds. However, the main limitation of Naive Bayes also lies in the

assumption of independent predictor features. In the real life, completely independent

predictors are almost impossible.

2.3.5. K-Nearest Neighborhood

The k-nearest neighbors algorithm (k-NN) is a method originally proposed by Thomas

Cover in 1967 that could be used for both classification and regression [94]. k-NN is a

type of instance-based learning, where the function is only approximated locally, and all

computation is deferred until function evaluation.

k-NN algorithm works by computing the distance between one query datapoint with all

other datapoints, then appoints the k amount of nearest datapoints as neighbors. The label

of the queried datapoint is then determined by the neighbors, often by majority voting.

The k in k-NN refers to the number of nearest neighbors and is a parameter to be

optimized along with the distance metric. Popular distance metrics include:

1. Euclidean distance

Euclidean distance is the straight-line distance between two points in Euclidean space

and defined by the following equation:

𝑑(𝑥, 𝑦) = √∑(𝑥𝑖 − 𝑦𝑖)2𝑛

𝑖=1

(2.4)

Here, 𝑑(𝑥, 𝑦) is the distance between two points: x and y. The points x and y are of

n-dimension.

Page 32: Multimodal Feature Extraction for Psychiatric Disorder

22

2. Manhattan distance

The Manhattan distance, also known as city block distance or taxicab distance, is

defined as the sum of the lengths of the projections of the line segment between the

points onto the coordinate axes.

𝑑(𝑥, 𝑦) = ∑|𝑥𝑖 − 𝑦𝑖|

𝑛

𝑖=1

(2.5)

Here, 𝑑(𝑥, 𝑦) is the distance between two points: x and y. The points x and y are of

n-dimension

3. Hamming distance

Hamming distance is used for categorical features. Its mathematical formula is

similar to Manhattan distance, but with a catch. For cases where 𝑥𝑖 = 𝑦𝑖, then 𝑥𝑖 −

𝑦𝑖 = 0 and 𝑥𝑖 ≠ 𝑦𝑖, then 𝑥𝑖 − 𝑦𝑖 = 1.

4. Minkowski distance

The Minkowski distance is the generalization of of both the Euclidean distance and

the Manhattan distance. The Minkowski distance of order P is defined as:

𝑑(𝑥, 𝑦) = (∑|𝑥𝑖 − 𝑦𝑖|𝑃

𝑛

𝑖=1

)

1𝑃

(2.6)

Where, 𝑑(𝑥, 𝑦) is the distance between two points: x and y. The points x and y are of

n-dimension. For P=1, it is equal to Manhattan distance and for P=2, it is equal to

Euclidean distance. P must be a positive integer as P<1 violates the triangle

inequality.

The main weaknesses of kNN algorithm are the fact that the computational cost does not

scale well with large number of samples and the need to determine an optimal value k. As

the computational cost scales with the number of samples, testing optimal k values also

become computationally expensive. It is also very affected by curse of dimensionality

and outliers. Feature scaling also needs to be performed to ensure homogeneity of the

features.

Page 33: Multimodal Feature Extraction for Psychiatric Disorder

23

2.4. Summary

In this chapter, conventional inputs and machine learning algorithms for automatic mental

health screening is discussed. In section 2.2 the conventional inputs are described and in

section 2.3 the conventional machine learning algorithms are discussed.

Page 34: Multimodal Feature Extraction for Psychiatric Disorder

24

Chapter 3

Facial landmark analysis from static

images

3.1. Introduction

This chapter describes the feature analysis of facial landmarks from depression patients

and dementia patients. In section 3.2, the data acquisition protocol is described. In section

3.3, the analysis procedure is described, from preprocessing up to feature analysis. The

machine learning experiment setup is also described in this section. In section 3.4, the

results of the analysis and machine learning were described. Discussion of the results are

in section 3.5. The chapter is then concluded in section 3.6.

3.2. Data Acquisition

The data utilized in this chapter is from PROMPT database, as specified in chapter 1. The

PROMPT database’s facial recordings were obtained by videotaping a patient’s clinical

interview session with a therapist. The recording apparatus was RealSense R200 (Intel

Corporation) and Microsoft Kinect for Windows v2 (Microsoft Corporation), both with

frame rate of 30 frame-per-second (FPS).

The recordings for database were conducted on Keio University Hospital and Joint

Medical Research Institute. The experiment was approved by Keio University Hospital

Ethics Committee (20160156, 20150427). A full psychiatric screening session consisted

of around 10 minutes of free talk session followed by 20 or more minutes rating session.

The description of interview setup and the details of the screening session are as follows:

Interview setup: During the interview, the patient and the psychiatrist were seated across

a table. The psychiatrist controls the start/end of the recordings. The distance between the

video device and the patient is around 70cm.

Free talk session: The psychiatrist conducts a typical interview concerning the patient’s

daily life and mood. If the current session is the patient’s first visit, the psychiatrist may

Page 35: Multimodal Feature Extraction for Psychiatric Disorder

25

also ask the patient’s clinical background such as family, history of other illness, etc. The

results of this session typically do not contribute for the assessment of the patient and its

main objective is to prepare patient for the rating session while also possibly also

obtaining the clinical background, in case there was no such information available.

Despite the name of “free talk”, this session has guidelines and is a semi-structured

clinical interview. The length of this segment is around 10 minutes.

Rating session: In the rating session, the patient is interviewed based on a clinical

assessment tools related to their mental health history, which may include some additional

tasks and tests such as clock-drawing test and memory test for dementia screening or

some personal questions such as their sleep habit (PSQI) and depressive mood in the

recent weeks which are related to depression screening. The duration of a single rating

segment typically lasts more than 20 minutes.

3.3. Analysis

To prevent the model learning age feature instead of disease’s features and to increase the

contrast between the features, we screened the dataset and only include datasets which

satisfy the following criteria:

(1) Recording length of 5 minutes or more. The objective of this limitation is to ensure

enough information is recorded in the dataset.

(2) Age between 57 and 84 years-old. This is to remove the effect of aging, which is

known to be positively correlated with dementia symptoms

(3) For dementia patients: mini-mental state examination (MMSE) score of 24 or less

accompanied with geriatric depression scale (GDS) score of 4 or less; This is to

ensure that the dementia patients are symptomatic and are not afflicted with

depression co-morbidity.

(4) For depression patients: 17-item Hamilton depression rating scale (HAMD17) of

8 or more; Similar with dementia patient criteria, this is to ensure that the

depression patients are symptomatic. Depression patients that have co-morbidities

with dementia are always classified as dementia patient.

Page 36: Multimodal Feature Extraction for Psychiatric Disorder

26

The qualifying dataset were 65 datasets from 46 subjects, consisting of 36 dementia

datasets (21 subjects) and 29 depression datasets (26 subjects). To protect the identity of

the subjects, facial landmarks extraction was performed using Omron’s OKAO Vision

[11] to extract the X-Y coordinates, as seen in Figure 3. These 40 points of facial

landmarks are processed and analyzed instead of the raw face images. The red squares

are the eyebrows, the green diamonds are the eyes, red crosses mark the nose, and yellow

circles are the subject’s mouth. The blue dot in the middle of the eyes are the glabella,

while the other blue dots mark the outline of the subject’s face.

OKAO vision used an AdaBoost algorithm to construct a cascade of facial region learners

from confidence-rated look-up-table (LUT) of Haar feature. The facial feature points

extraction utilized Gabor wavelet transform coefficients as feature values and SVM as a

classifier. The SVM classifier is trained using a database to output 1 at the predefined

feature point and output 0 otherwise. While detecting the feature point, the SVM is used

to search around the eye or mouth area, the position with the highest confidence is

considered as the facial feature point.

Figure 3.1 Facial landmarks extracted with OKAO vision (40 points).

Page 37: Multimodal Feature Extraction for Psychiatric Disorder

27

3.3.1. Preprocessing

The obtained facial landmarks were then normalized, such that the center of face lies on

origin coordinates (0,0) and then each landmark coordinate set was divided according to

the face width and height: X-coordinates were divided by face width and Y-coordinates

were divided by face height as shown on Equation 3.1 and 3.2.

𝑋�̂� =(𝑋𝑖 − 𝑋𝑐)

𝑊𝑖𝑑𝑡ℎ (3.1)

𝑌�̂� =(𝑌𝑖 − 𝑌𝑐)

𝐻𝑒𝑖𝑔ℎ𝑡 (3.2)

Here, 𝑋�̂� and 𝑌�̂� denotes the normalized X and Y coordinates, respectively. 𝑋𝑖, 𝑋𝑐, 𝑌𝑖 , 𝑌𝑐

are the X-coordinates of features, X-coordinate of face center, Y-coordinates of features,

and Y-coordinate of face center, respectively.

Another preprocessing step was performed to remove outliers, by removing frames which

landmarks have value less than 1st percentile and greater than 99th percentile.

3.3.2. Feature extraction

After the preprocessing, the feature extraction was performed. Features extracted in this

study were: speed statistics of each landmark, speed statistics of the face center, area of

mouth, and the standard deviation of eye pupils’ position. The speed of each landmark

was computed using the Equation 3.3 and 3.4.

𝑆𝑖 = √(�̂�𝑖 − �̂�𝑖+1)2+ (�̂�𝑖 − �̂�𝑖+1)

2 (3.3)

𝑆�̂� = ∑ 𝑆𝑖

30𝑗

𝑖=30(𝑗−1)+1

(3.4)

where �̂�𝑖 and �̂�𝑖 denotes a preprocessed landmark’s coordinate on frame 𝑖. Here, 𝑆𝑖 is the

landmark 𝑖’s speed per frame and 𝑆�̂� denotes the landmark 𝑗’s speed per second. The

Page 38: Multimodal Feature Extraction for Psychiatric Disorder

28

constant 30 represents the camera’s frame-per-second (FPS) rate of 30, as stated in data

acquisition part.

The area of left eye, right eye, and mouth were computed using the equation 3.5, which

is a general equation of computing the area of an arbitrary polygon in 2D space.

𝐴 =1

2|∑ �̂�𝑖𝑌𝑖+1 − �̂�𝑖+1𝑌𝑖

𝑛−1

𝑖=0

| (3.5)

when 𝑖 = 𝑛 − 1, then 𝑖 + 1 is expressed as 0.

3.3.3. Statistical analysis

To investigate the relationship between audio features and clinical symptoms, linear

correlations of the acoustic features against the corresponding clinical rating tools were

computed. The clinical rating tools were HAMD17 for depression subjects and MMSE

for dementia subjects. In addition, Wilcoxon rank-sum test were also performed to check

statistically significant difference between the facial features of dementia and depression

groups.

3.3.4. Feature selection and machine learning

Feature selection is an important process to reduce the number of features used as a

machine learning input. Several important reasons to perform feature selection are to

mitigate overfitting problem, to make the model easier to interpret, and to improve the

speed of a model’s learning session. Here, we utilized Least Absolute Shrinkage and

Selection Operator (LASSO) algorithm [95] as the feature selection algorithm.

By its definition, LASSO is a regression algorithm which shrinks the variable coefficients,

some even set to zero or near-zero. This is effectively a feature selection since variables

with near-zero coefficients can be ignored from the computation. Mathematically,

LASSO solves

min𝛼,𝛽

(1

2𝑁∑(𝑦𝑖 − 𝛼 − ∑𝛽𝑗𝑥𝑖𝑗

𝑗

)

2

+ 𝜆∑|𝛽𝑗|

𝑗

𝑁

𝑖=1

) (3.6)

Page 39: Multimodal Feature Extraction for Psychiatric Disorder

29

where 𝛼 is a scalar and 𝛽 is a vector of coefficients, N is the number of observations, 𝑦𝑖

is the response at observation i, 𝑥𝑖𝑗 is the vector of predictors at observation i, and 𝜆 is a

nonnegative regularization parameter. High value of 𝜆 results in stricter feature selection

and for this analysis, it is computed automatically such that it is the largest possible value

for nonnull model.

Only features that is selected at least 10% of the times with this algorithm are utilized for

machine learning; the performance of LASSO is not considered, only the coefficients

mattered in this case. This is to ensure that the LASSO filters out some features that only

selected in very rare occasion. The total number of features before feature selection is 41

(40 normalized facial landmarks + facial center point).

We then examined the possibility of differentiating depression patient and dementia

patient based on facial features by utilizing supervised machine learning with 10-fold

cross validation to measure the model’s performance. The machine learning model we

used was support vector machines (SVM) with various kernels: linear, polynomial with

order of 2, and radial basis function (RBF). Additionally, random undersampling boosted

trees (RUSBoost) and AdaBoost trees are also utilized. Neural networks are not

considered as the number of samples in the dataset is scarce.

3.4. Results

Wilcoxon rank-sum test between the dementia and depression group found statistically

significant difference (p<0.05) at the following features: left and right eye features

including pupil but not eyebrows, glabella, nose features, distance between eyelids. None

of the mouth features were considered significant difference between depression and

dementia group, along with jaw and ear features. Pearson’s correlation analysis between

the clinical screening tools and the features found almost no statistically significant

correlation (p<0.05), one feature in depression group: area of mouth showed significant

negative correlation with HAMD17 (p=0.0154, R=-0.4457) and one feature in dementia

group: the distance between eyelids showed with significant negative correlation MMSE

(p=0.0146; R=-0.4037).

Page 40: Multimodal Feature Extraction for Psychiatric Disorder

30

Twenty-two (22) features were selected with the LASSO algorithm and these features are

selected for all feature selection cases, hence the high number. These features were

utilized for machine learning. The selected features are listed on Table 3.1.

Table 3.1 Features selected with LASSO

Statistical

Feature Landmark

Average

speed

Left pupil, left eye (bottom),

right pupil, left nose, right nose, right eyebrow (bottom & left), right

jaw, right ear (bottom)

Median

speed Right pupil, upper lip (bottom)

Standard

deviation

of speed

Left eye (top), right eye (top & bottom), glabella, left eyebrow

(bottom), right eyebrow (top, bottom, and left), left jaw, right jaw, left

ear (bottom), right ear (top)

95th

percentile

of speed

Right nose, left eyebrow (top & left), right eyebrow (bottom). left ear

(bottom)

The best performances of the SVM models were the one with polynomial kernel with

order of 2. Its average accuracy is 81.37±15.03%. The second best was SVM with Linear

kernel, with average accuracy of 77.86±16.54% whilst RBF kernel performed the worst,

with the average accuracy of 69.46±15.96%. The detailed result is described in Table

3.2.

Page 41: Multimodal Feature Extraction for Psychiatric Disorder

31

Table 3.2 Machine learning Results

Models Accuracy

SVM-polynomial 81.37±15.03%

SVM-linear 77.86±16.54%

SVM-RBF 69.46±15.96%.

RUSBoost 69.19±20.92%

AdaBoost 80.71±12.22%

3.5. Discussion

The Pearson’s correlation analysis result of the mental health screening tools and the

facial features was very interesting. For each group, only one feature showed statistically

significant correlation but none of those features were not significant difference between

the two groups. With this result, it seemed that the facial features chosen in this chapter

were not a good predictor for depression or dementia, but it might still be possible to

differentiate depression and dementia patient using those features as supported by the

result of Wilcoxon rank-sum test. Eye, glabella, and nose features seemed to be the best

features for differentiating depression and dementia as those features were also chosen

by LASSO in the machine learning experiment.

The best performances of the SVM models were the one with polynomial kernel with

order of 2 and the worst performance was from RBF kernels. AdaBoosted trees perform

quite well, with AdaBoost taking overall second best with the fewest variance, however

RUSBoosted trees performed the worst. RUSBoost is a machine learning algorithm

aiming to solve class imbalance problem. In this case, the classes was 46 against 29 and

appeared to be imbalanced, but RUSBoost did not seem to work properly. As stated in

the chapter 1, there are no algorithm aiming to separate depression and dementia to date.

Automatic depression screening researches with facial feature as an input reports the

accuracy of 87.67% [96], 82% [97], and 74.5% [98]. Dementia researches from facial

features are almost non-existent but the automatic dementia machine learning from other

type of inputs report accuracies of 94.73% (MRI) [99] and 88% (questionnaire) [100].

Page 42: Multimodal Feature Extraction for Psychiatric Disorder

32

Overall, both AdaBoost and SVM with polynomial kernel might be considered the best

models in this case.

3.6. Summary

The results from facial feature analysis show that the dementia patients and depression

patients have statistically significant difference in facial features utilized. The two groups

can be differentiated even with traditional machine learning technique such as SVM. This

result suggests the possibility of automatic pseudodementia screening by utilizing

machine learning. The interesting result from Pearson’s correlation analysis suggests that

other facial features, possibly facial expression and such might be considered instead of

speed statistic features of facial landmarks and mouth area.

Page 43: Multimodal Feature Extraction for Psychiatric Disorder

33

Chapter 4

Robust facial landmark tracking

algorithm

4.1. Introduction

In contrast with other chapters, this chapter focuses on proposal of robust facial landmark

tracking algorithm. Real-time monitoring or even faster diagnosis are always preferable.

The trade-off between performance and number of samples (and speed, indirectly) has

been the inherent problem for machine learning and it is even more apparent in real-time

processing problem. One huge advantage for utilizing faster machine learning based

psychiatric disorder screening beside the human-computer interaction factors is the

possibility of applying adaptive machine learning algorithm instead of static models.

In addition to real-time analysis, facial feature point tracking for that is robust against

camera shake and face orientation changes are very important. It is impractical and

pointless to ask a clinical patient to stay still in front of the recording apparatus. Not only

it places burden to the patient, the features needed for analysis are also polluted with the

patient’s conscious to stay in facing the camera. Therefore, a facial feature point tracking

algorithm that is robust to head movement is desirable.

The rest of this chapter is structured as follows. In section 4.2 the criticism of conventional

facial landmark tracking algorithms is described. In section 4.3 the proposed

improvement is described. The experiment is reported in section 4.4 and the results is

discussed in section 4.5. Finally, the chapter is concluded in 4.6.

4.2. Conventional facial tracking analysis

Facial feature point detection conventional methods typically estimate the coordinates of

future feature points by a regression process. One of such algorithms is called Supervised

Descent Method (SDM) [101]. In SDM, the position of estimated feature points are

improved by iteratively applying the original estimation of feature points to input images,

Page 44: Multimodal Feature Extraction for Psychiatric Disorder

34

calculating local binary pattern (LBP) values around the original estimation, and

multiplying the LBP with the pre-trained weights by means of machine learning. This

results in the estimated feature points progressively converge towards the actual face on

the image. The SDM optimizes the error function between estimated face shape and the

correct shape and thus do not need recursive algorithms such as Newton’s method but

weights from training images. In recent years, various improvements have been proposed

concerning the feature extraction method alternatives [102], weight learning algorithms

such as deep learning [103] and the positioning of initial points [104] of the SDM.

The SDM will fail to learn the weights for updating feature points if there are no unique

solution in the feature space, non-front facing picture for example, or if the optimization

falls into a local optimum. To prevent this problem, [105] proposed to divide the training

set such that objective function always fall into global maxima by means of principal

component analysis or similar algorithms. However, this causes the necessity to evaluate

the regions at each step, in order to update the facial feature estimation, in the testing

phase as well. Additionally, the real-time implementation is impossible as the

computational cost is high.

To solve this problem, [106] proposed Cascaded Compositional Learning (CCL). CCL is

an algorithm which is a direct improvement of [105]. During the training phase, the

weights are learned for each region similar to [105] but in the test phase, all possible facial

feature points from all regions are obtained and then combined by applying the likelihood

of each feature points. As a result, it is possible to predict face shape while avoiding local

pitfalls and superior in both computational cost and performance. The initial stage of each

frame is suggested to be the average value of face feature points.

Although it is beneficial for still images, in case where prior information are available

such as tracking problem, it is inefficient to recalculate all the steps from the start. Also,

in case which multiple people are in a single picture, there is possibility that the algorithm

switched to the other person during the next frame. Therefore, it is theoretically possible

to improve the algorithm for case of tracking problem by incorporating transition

information between frames. Now, a simple solution would be to assume a linear motion

model such as constant one direction movement. However, this would cause the tracking

Page 45: Multimodal Feature Extraction for Psychiatric Disorder

35

to fail if an unexpected motion such as camera shake occurs. Additionally, in the field of

computer vision tracking, optical flow has been reported to yield satisfactory results [107].

Therefore, an algorithm of CCL-based facial feature point tracking method using optical

flow is developed and reported in this chapter. For the very first frame, the algorithm used

the average face, similar to the conventional studies. On the second frame onwards, the

initial value is estimated using optical flow and then CCL is performed to achieve

effective facial feature landmark tracking.

4.3. Proposed method

In general, the facial feature point tracking by CCL can be divided into three parts: facial

feature point mapping, regression learning of weights for each region, and shape

estimation by combined vectors. Aside from that, a pre-trained random forest model for

estimating feature points Φ from input image I and face shape S is necessary.

4.3.1. Supervised descent method

In SDM, facial feature points are expressed as 𝑆 = [𝑥1, 𝑦1, … , 𝑥𝑀, 𝑦𝑀] where 𝑀 is the

number of feature points. In the general SDM, when initial face shape 𝑋0 and input image

𝐼 are given, shape prediction is performed by 𝑁-step regression. The regression equation

of the 𝑛-th stage is as shown in following equation:

𝑆𝑛 = 𝑆𝑛−1 + 𝑊𝑛𝛷𝑛(𝐼, 𝑆𝑛−1) (4.1)

where 𝛷𝑛(𝐼, 𝑆𝑛−1) is a feature value at stage 𝑛 based on the input image 𝐼 and the face

shape S at the 𝑛−1 th stage, and 𝑊𝑛 is a weight vector for updating the face shape.

The feature value 𝛷 can be obtained by means of scale-invariant feature transform (SIFT)

or histogram of oriented gradients (HOG), difference between pixel value between the

two images. But for the case of side-view face and occlusion, the Local Binary Features:

(LBF) are said to be effective [102].

In SDM, the equation 4.1 is not solved by means of recursive algorithms such as Newton’s

algorithm, but by utilizing pre-trained weights with the aim to shorten the computation

time. However, as described above, SDM tends to fail into local optima when there are

Page 46: Multimodal Feature Extraction for Psychiatric Disorder

36

no unique solution. Therefore, by dividing the loss function into K domains of

homogenous descent (DHD), optimal weights 𝑊𝑘𝑛 can be obtained per each domain,

negating the pitfall of local optima. The division into DHDs is achieved by dimensional

reduction of Principal Component Analysis (PCA). Conventionally, the weight 𝑊𝑘𝑛 in

each region is calculated by the ridge regression as described in the following equation.

min𝑊𝑘

𝑛∑‖�̂�𝑖 − 𝑆𝑖

𝑛−1 − 𝑊𝑘𝑛𝛷𝑛(𝐼, 𝑆𝑛−1)‖

2

2+ 𝜆‖𝑊𝑘

𝑛‖𝐹2

𝑖∈𝑇𝑘

(4.2)

Here, the integer 𝑖 refers to the 𝑖-th training image �̂�𝑖 is the correct face shape, 𝑇𝑘 is the

training image set of the 𝑘-th region, 𝑊𝑘𝑛 is the n-th weight in the 𝑘-th region In addition,

𝜆 is a regularization parameter. The 𝑊𝑘𝑛 is estimated by Support Vector Regression

(SVR) algorithm with kernel trick [108].

As the testing phase revolves around moving pictures, each frame can be assumed to be

dependent from the previous frame. Therefore, in the test phase of the proposed method,

the feature points predicted by the optical flow from the previous frame are set as initial

values.

𝑆𝑖0 = 𝑆𝑖

𝑛 + 𝑣(𝑆𝑡−1) (4.3)

𝑣(𝑆𝑡−1) is the optical flow between the current and previous frame. It is defined by

𝑣(𝑆𝑡−1) = [∆𝑥1, ∆𝑦1, … , ∆𝑥𝑀, ∆𝑦𝑀]

4.3.2. Compositional vector estimation

The feature value Φ mentioned in the previous section is a quantity derived from Local

Binary Features: (LBF). the LBF, are adaptively extracted by using the fern model for the

difference in brightness between two points taken randomly from around the facial feature

point S of the input image I. In CCL, class labels considering the appearance of feature

points are also input to LBF at the same time in order to extract feature quantities based

on Hough Forest [108]. Hough Forest is a method of selecting two evaluation functions

according to hierarchy when defining a branching function, and it is possible to obtain

features robust to such as occlusion. In this example, two branch functions were selected

according to the ratio of label of sample input to the node.

Page 47: Multimodal Feature Extraction for Psychiatric Disorder

37

For each image in each iteration, the DHD area is first computed and then the weights for

each DHD area 𝑊𝑘𝑛 is estimated. However, this estimation requires the correct face shape

�̂�𝑖 and it is not practical for the testing phase. Therefore, in CCL, the regions are not

estimated, and instead compositve vector 𝑝 = [𝑝1, 𝑝2, … , 𝑝𝑘] which estimates the

predicted shape 𝑆𝑘. To calculate the composite vector, the information such as the offset

vector of the feature quantity 𝛷 extracted from the tentative predicted shape 𝑆𝑘 is deleted,

and the composite vector feature quantity 𝛷’ of compressed class label information is

trained to the fern model.

4.3.3. Training composite vectors

The correspondence relationship between the composite vector feature quantity 𝛷’ and

the composite vector described in the previous section is represented by 𝑔 in the following

equation.

min𝑔

∑‖�̂�𝑖 − 𝑆𝑖𝑛−1‖

2

2

𝑖∈𝑇

𝑠. 𝑡.𝑆𝑖 = ∑ 𝑝𝑖𝑘𝑆𝑖𝑘

𝐾

𝑘=1

𝑝𝑖 = 𝑔(𝛷′), 𝑝𝑖 ≥ 0, ‖𝑝𝑖‖ = 1

(4.4)

The fern model is used to learn the function 𝑔 corresponding to 𝛷′ and 𝑝𝑖. The branching

parameter 𝜃𝑗 that minimizes the sum of the errors and the resulting composite vector

𝑝 (𝑑 = {𝐿, 𝑅}) are stored respectively.

min𝜃,𝑝𝑑

∑ ∑‖�̂�𝑖 − 𝑆𝑖𝑛−1‖

2

2

𝑖∈𝑄𝑑=𝐿,𝑅

𝑠. 𝑡.𝑆𝑖 = ∑(𝑝(𝑑))𝑘𝑆𝑖𝑘

𝐾

𝑘=1

𝑝(𝑑) ≥ 0, ‖𝑝(𝑑)‖1= 1

(4.5)

In the test phase, the feature 𝛷𝑡𝑒𝑠𝑡′ is inputed. and branched based on 𝜃𝑗, and then 𝑝 is

computed by solving the following error minimization problem by using the set of

training images held by the nodes that have reached the terminal nodes 𝑄𝑠𝑒𝑡

Page 48: Multimodal Feature Extraction for Psychiatric Disorder

38

min𝑝

∑ ‖�̂�𝑖 − 𝑆𝑖𝑛−1‖

2

2

𝑖∈𝑄𝑠𝑒𝑡

𝑠. 𝑡.𝑝 ≥ 0, ‖𝑝‖1 = 1, 𝑝𝑘 = 0|𝑘 ∉ K

(4.6)

Face shape in each area obtained by multiplying the obtained 𝑝 and the weight for each

area calculated during learning. This process is repeated N times to ensure an accurate

predicted face shape

4.4. Experiment

In order to verify the effectiveness of the proposed method, a comparative experiment

was conducted using the benchmark data set.

AFLW dataset [109] was utilized as the training image. It is one of the most challenging

data sets. AFLW is a data set of face images in the real environment obtained from an

online photo sharing service called Flickr. In the AFLW data set, 21 facial feature points

including right and left eyes, eyebrows, nose, mouth both ears and jaw are included in

each image. These are annotated manually by humans.

Approximately 20% of the images show the occlusion of some facial feature points due

to eyeglasses, hands, etc. In the experiment, 1,000 frames were randomly selected from

all 24386 frames, and as in the previous study, a total of 19 points out of 21 points,

excluding 2 points of both ears, were used. Each face feature point is also classified into

positive (+) and negative (-) labels for Hough Forest depending on whether or not

occlusion is occurring and whether or not the distance from the average face shape is

within the search range at the time of feature extraction.

The test dataset is the database 300VW [110] [111] [112]. The 90th movie was used for

evaluation as a moving image including face direction change. In addition to all 649

frames of the moving image, data of feature points is associated with each frame. Among

the 68 points included in the 300-VW data, only the 19 points corresponding to the ones

selected in the AFLW data set are used as reference face shape for evaluation.

In the proposed method, based on face feature point detection by CCL, the optical flow

from the previous frame is used as the initial value. The initial face shape in the

comparison method is determined by fitting the average face to the face area established

Page 49: Multimodal Feature Extraction for Psychiatric Disorder

39

by the Viola-Jones algorithm. We used the mean error rate, which is the normalized

average Euclidean distance between the correct face shape and the estimated face shape

at all feature points, as an evaluation method. In this experiment, a PC with Intel (R) Core

(TM) i7-6700 HQ CPU @ 2.60 GHz and Matlab R2016a were used for evaluation. The

parameters utilized are shown in Table 4.1.

4.5. Results and discussion

Figure 4.1 shows the facial feature points utilized in this experiment. The blue dots

indicate feature points from 300VW and red diamonds are 19 of 21 facial feature points

utilized in AFLW dataset.

The experimental results are shown in Figure 4.2. The green dots are the facial feature

points, and the yellow rectangles are the bounding boxes for face detection. The value of

3.97 obtained with the conventional method decreases to 3.02 with the proposed method,

indicating the effectiveness of the proposed method.

Figure 4.1 Comparison of facial feature points from AFLW dataset and 300-VW. The

blue dots indicate feature points from 300VW and red diamonds are 19 of 21 facial feature

points utilized in AFLW dataset. In this chapter, only these 19 landmarks are tracked.

Page 50: Multimodal Feature Extraction for Psychiatric Disorder

40

Table 4.1 Parameters utilized in the experiment

Parameter Value

L Number of face features 19

N Number of iterations 5

K Number of DHDs 8

Number between two points to sample 500

Sample overlap (%) 20

Number of decision trees 5

Depth of decision trees 4

Conventional

Proposed

Figure 4.2 Experiment result

4.6. Conclusion

In this chapter, a new facial feature point tracking with optical flow based CCL was

described. This showed that continuous tracking is possible even in situations where

tracking failed due to face orientation fluctuation and occlusion so far. In the experiment,

we compared the error with the conventional method using the 300-VW data set and

confirmed the effectiveness of the proposed method. Since parameter optimization has

Page 51: Multimodal Feature Extraction for Psychiatric Disorder

41

not been performed and tracking speed has not been taken into consideration for the time

being, we will search parameters with the best accuracy in real time and verify the

effectiveness in the real environment in the future.

Page 52: Multimodal Feature Extraction for Psychiatric Disorder

42

Chapter 5

Speech feature analysis for classification

of depression and dementia

5.1. Introduction

This chapter describes the feature analysis of acoustic features from depression patients

and dementia patients, in particular: spectral and temporal features of speech. In section

5.2, the data acquisition protocol is described. In section 5.3, the analysis procedure is

described and the results are available in section 5.4. Discussion of the results are

described in 5.5. The chapter is then concluded in section 5.6.

5.2. Data Acquisition

Similar to chapter 3, the data utilized in this chapter is from PROMPT database. The

PROMPT database’s audio recordings were obtained by recording a patient’s clinical

interview session with a therapist. The recording apparatus is Beyerdynamic Classis

RM30W GmbH & Co. KG with 16 kHz sampling rate. For uniformity, the database

filtration criteria are largely similar to chapter 3.

5.3. Analysis

The analysis is largely divided into two parts: statistical analysis and machine learning

experiment. Machine learning experiment is further divided into three stages. For

statistical analysis, the first, and the second parts of machine learning, several datasets

were removed from the PROMPT database in consideration of age features and the

presence of symptoms. This is similar to chapter 3’s data filtration criteria with one

difference: the minimum recording length. Successful visual recordings obtained in

chapter 3 are fewer in number than successful audio recordings, and the average minimum

length of the audio recordings are longer than the visual recordings. In this chapter, the

minimum length for a recording to be utilized is 10 minutes, twice the requirements

utilized in chapter 3.

Page 53: Multimodal Feature Extraction for Psychiatric Disorder

43

The third part of machine learning experiment utilized the dataset which were filtered out

for statistical analysis, first and second parts of machine learning experiment. Figure 5.1

illustrates the dataset filtering for the statistical analysis and machine learning phases.

Figure 5.1 Dataset Filtration in Statistical Analysis

5.3.1. Audio signal analysis

In some rare cases, the recordings contained some outliers, possibly caused by random

errors, and preprocessing of the raw data needs to be conducted. We defined the outliers

by using inter-quartile range (IQR). A point in the audio recording is defined to be an

outlier if it satisfies one of the following conditions:

1. 𝑋 < 𝑄1 − 1.5𝐼𝑄𝑅

2.𝑋 > 𝑄3 + 1.5𝐼𝑄𝑅

Page 54: Multimodal Feature Extraction for Psychiatric Disorder

44

Here, X is the signal, Q1 is the lower (1st) quartile, Q3 is the upper (3rd) quartile, and

IQR is the inter-quartile range, computed by subtracting Q1 from Q3. We then apply

cubic smoothing spline fitting to the audio signal, without the outliers. The objective of

this method is twofold: (1) to interpolate the removed outliers, (2) subtle noise removal.

Additionally, intensity normalization was also performed. This was to ensure that the data

is in equal scale to each other and to reduce clipping in audio signals. The normalization

was conducted by rescaling the signal such that the maximum absolute value of its

amplitude is 0.99. Continuous silence in form of trailing zeroes at front and end of the

recordings were also deleted.

A subtotal of ten acoustic features were extracted from raw data. They were: Pitch,

harmonics-to-noise ratio (HNR), zero-crossing rate (ZCR), Mel-frequency cepstral

coefficients (MFCC), Gammatone cepstral coefficients (GTCC), mean frequency,

median frequency, signal energy, spectral centroid, and spectral rolloff point, with details

in Table 5.1. These features were chosen as they represent both temporal and spectral

features of a signal. Additionally, some of these features relate to closely to speech which

is a common biomarker for both depression and dementia [113] [114] [115]. These

features were computed once in every 10 ms by applying a 10 ms window with no overlap.

We then performed feature extraction to the windowed signals. The total count of audio

feature is 36, with 14 MFCCs and GTCCs. As we used data with length of at least 10 min,

a minimum of 60.000 datapoints were obtained, for each feature. We then computed the

mean, median, and standard deviation (SD) of the datapoints and used them for statistical

analysis and machine learning, resulting in total feature count of 108.

Page 55: Multimodal Feature Extraction for Psychiatric Disorder

45

Table 5.1 Acoustic features utilized in this chapter

Feature Mathematical function and references

Pitch [116]

HNR [117]

ZCR

𝑍𝐶𝑅(𝑋) =1

2𝑁∑|𝑠𝑔𝑛(𝑋𝑖) − 𝑠𝑔𝑛(𝑋𝑖−1)|

𝑁

𝑖

MFCC [118]

GTCC [119]

Mean frequency Mean of power spectrum from the signal

Median frequency Median of power spectrum from the signal

Signal energy 𝐸(𝑋) =

𝜎(𝑋)

𝜇(𝑋)

Spectral centroid 𝑐 =

∑ 𝑓𝑖𝑠𝑖𝑏2𝑖=𝑏1

∑ 𝑠𝑖𝑏2𝑖=𝑏1

[120]

Spectral rolloffpoint ∑ 𝑠𝑖𝑟𝑖=𝑏1

=𝑘

100∑ 𝑠𝑖

𝑏2𝑖=𝑏1

[120]

For ZCR: N, sgn, and 𝑋𝑖 denotes the length of signal, signum function extracting the sign

of a real number (positive, negative, or zero), and i-th sequence of signal 𝑋𝑖 respectively.

For mean frequency and median frequency: power spectrum from the signal was applied

by performing Fourier transform. For signal energy: E(X) is the signal energy of signal 𝑋,

𝜎(𝑋) denotes the function of standard deviation of signal X and 𝜇(𝑋) indicates the

function of mean of signal X. For spectral centroid: c denotes the spectral centroid, 𝑓𝑖 is

the frequency in Hertz corresponding to bin i, 𝑠𝑖 is the spectral value at bin i, and 𝑏1 and

𝑏2 are the band edges, in bins, over which to calculate the spectral centroid. For spectral

rolloff point: r is the spectral rolloff frequency, 𝑠𝑖 is the spectral value at bin i, and 𝑏1 and

𝑏2 are the band edges, in bins, over which to calculate the spectral spread.

Page 56: Multimodal Feature Extraction for Psychiatric Disorder

46

5.3.2. Statistical analysis

To investigate the relationship between audio features and clinical symptoms, linear

correlations of the acoustic features against the corresponding clinical rating tools were

computed. The clinical rating tools were HAMD for depression subjects and MMSE for

dementia subjects. In addition, two-tailed t-test were also performed to check statistical

significance. The values were adjusted using Bonferroni correction. Additionally,

correlation between age and sex with clinical rating tools were also evaluated for

validation purposes.

5.3.3. Machine Learning

Machine learning was performed in three stages: (1) to examine the possibility of

automatic pseudodementia diagnosis with unsupervised learning, (2) to examine the

possibility of automatic pseudodementia diagnosis with supervised classifier, and (3) to

validate its robustness against non age-matched datasets. The unsupervised learning

algorithm utilized for the first stage was k-means clustering. The parameters for k-means

clustering were k = 2 with squared Eucledian distance metric. For stages 2 and 3, the

machine learning model utilized was a binary classifier: support vector machine (SVM)

with linear kernel , 3rd order polynomial kernel, and radial-basis function (RBF) kernel

[90]. The hyperparameters for both linear kernel and polynomial kernel is the cost

parameter C while RBF kernel has two hyperparameters: C and gamma. The optimization

of hyperparameters was performed using grid search algorithm with values ranging from

10-3 to 1000. Linear kernel was chosen as it allows the visualization of feature

contributions, as opposed to SVM with nonlinear kernels. For the second phase, the

machine learning session was performed using nested 10-fold cross-validation. It is

defined as follows:

1. Split the datasets into ten smaller groups, maintaining the ratio of the classes

2. Perform ten-fold cross validation using these datasets.

For each fold:

(a) Split the training group into ten smaller subgroups.

(b) Perform another ten-fold cross-validation using these subgroups.

Page 57: Multimodal Feature Extraction for Psychiatric Disorder

47

For each inner fold:

i. Perform LASSO regression [95] and obtain the coefficients. The

performance of the LASSO model is not considered.

ii. Mark the features with coefficient of less than 0.01.

(c) Perform feature selection by removing features with 10 marks obtained from

step 2-b-ii.

(d) Train an SVM model based on features from (c).

3. Compute the average performance and standard deviation of the models.

In the third phase, a SVM model was trained using age-matched subjects and selected

features from the second phase. Resulting model’s performance is evaluated against the

filtered-out subjects: young depression and old dementia subjects. In both cases, the

dementia patients were labelled as class 0 (negative) and depression patients were labelled

as class 1 (positive). The illustration of the phases is shown in Figure 5.2.

Page 58: Multimodal Feature Extraction for Psychiatric Disorder

48

Figure 5.2 Flowchart of supervised machine learning procedure. The first and second

phase used age-matched symptomatic depression and dementia subjects. Since the first

phase consists only of unsupervised machine learning clustering, it is omitted here. The

second phase consists of conventional training and evaluation. The third phase involves

of utilizing machine learning model trained from age-matched subjects against non-age

matched subjects.

5.3.4. Evaluation Metrics

We utilized eight metrices to evaluate the effectiveness of the machine learning model,

all of which are computed based on the ratio of true positive (TP), false positive (FP), true

negative (TN), and false negatives (FN). In this study, the class depression was labelled

as “positive” and dementia was labelled as “negative”. All of the TP, FP, TN, and FN

values were obtained from confusion matrix. Based on the confusion matrices, the

evaluation metrices of observed accuracy, true positive rate (TPR / sensitivity), true

negative rate (TNR / specificity), positive predictive value (PPV / precision), negative

predictive value (NPV), F1-score, Cohen’s kappa, and Matthew’s correlation coefficient

(MCC) can be then computed. The formulas for computing these metrics are described in

Page 59: Multimodal Feature Extraction for Psychiatric Disorder

49

Table 5.2. These metrics were conventional evaluation metrics utilized in performance

evaluation. Metrics related to inter-rater reliability such as Cohen’s kappa and MCC were

included to ensure validity of measurement in cases of imbalanced sample problem.

Table 5.2 Accuracy metrices

Metric Formula

Accuracy (ACC) 𝐴𝐶𝐶 =

𝑇𝑃 + 𝑇𝑁

(𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁)

True Positive Rate

(TPR) 𝑇𝑃𝑅 =

𝑇𝑃

(𝑇𝑃 + 𝐹𝑁)

True Negative Rate

(TNR) 𝑇𝑁𝑅 =

𝑇𝑁

(𝑇𝑁 + 𝐹𝑃)

Positive Predictive

Value (PPV) 𝑃𝑃𝑉 =

𝑇𝑃

(𝑇𝑃 + 𝐹𝑃)

Negative Predictive

Value (NPV) 𝑁𝑃𝑉 =

𝑇𝑁

(𝑇𝑁 + 𝐹𝑁)

F1 score 𝐹1 =

2 ∙ 𝑃𝑃𝑉 ∙ 𝑇𝑃𝑅

(𝑃𝑃𝑉 + 𝑇𝑃𝑅)

Cohen’s kappa 𝐸𝑋𝑃 =

(𝑇𝑃 + 𝐹𝑃)(𝑇𝑃 + 𝐹𝑁) + (𝑇𝑁 + 𝐹𝑁)(𝑇𝑁 + 𝐹𝑃)

(𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁)2

𝐾𝑎𝑝𝑝𝑎 = (𝐴𝐶𝐶 − 𝐸𝑋𝑃)/(1 − 𝐸𝑋𝑃)

Matthew’s Correlation

Coefficient (MCC) 𝑀𝐶𝐶 =

(𝑇𝑃 ∗ 𝑇𝑁) − (𝐹𝑃 ∗ 𝐹𝑁)

√(𝑇𝑃 + 𝐹𝑃)(𝑇𝑃 + 𝐹𝑁)(𝑇𝑁 + 𝐹𝑃)(𝑇𝑁 + 𝐹𝑁)

5.4. Results

A total of 419 datasets (300 of depression and 119 of dementia) from 120 subjects

(depression n = 77, dementia n = 43) were available from the PROMPT dataset. After

age-matching, only 177 datasets (89 of depression and 88 of dementia) from 53

participants (depression n = 24, dementia n = 29) were qualified for the first and second

phase of machine learning. The test dataset for second phase of machine learning

consisted of young depression patients and old dementia patients and was used in the third

phase of machine learning. There were 242 datasets (211 of depression and 31 of

dementia) from 67 patients (depression n = 53, dementia n = 14). Details of subject

demographics were described in Table 5.3.

Page 60: Multimodal Feature Extraction for Psychiatric Disorder

50

Table 5.3 Subject demographics

Demographics Depression Dementia

Symptomatic n (dataset / subject) 300 / 77 119 / 43

age (mean ± s.d. years) 50.4 ± 15.1 80.8 ± 8.3

sex (female %) 54.5 72.1

Age-matched n (dataset / subject) 89 / 24 88 / 29

age (mean ± s.d. years) 67.8 ± 7.1 77.0 ± 7.5

sex (female %) 83.3 72.4

Young

depression, old

dementia

n (dataset / subject) 211 / 53 31 / 14

age (mean ± s.d. years) 42.5 ± 10.4 88.5 ± 1.9

sex (female %) 41.5 71.4

5.4.1. Statistical analysis

Pearson’s correlation found significant correlations with clinical interview tools in

features of GTCCs 1, 3, 12 and MFCCs 1, 3, 4, 7, 12. The average absolute correlation

coefficient R was 0.264 and its SD was 0.049. The highest absolute correlation value with

statistical significance (p < 0.05) was |R| = 0.346 for depression and |R| = 0.400 for

dementia. Features with significant correlation related to depression tend to yield weak to

moderate negative Pearson correlation values (average absolute R ± SD = 0.289 ± 0.05)

while features with significant correlation related to dementia tend to yield weak to

moderate positive Pearson correlation values (average absolute R ± SD = 0.281 ± 0.06).

Corrected two-tailed t-test shows significant differences of features in HNR, ZCR, GTCC

coefficients 4–14, mean frequencies, median frequencies, MFCC coefficients 4–13,

spectral centroid, and spectral rolloff points. No significant difference was found in Pitch

and Energy.

There was no significant correlation was found between sex and clinical assessment tools

(depression R = 0.021, p = 0.853; dementia R = 0.142, p = 0.928). Age has no significant

correlation with depression’s clinical assessment tools (R = 0.097, p = 0.403) but

significant, moderate correlation between age and dementia’s clinical assessment tools

was found (R = 0.424, p = 0.0046).

Page 61: Multimodal Feature Extraction for Psychiatric Disorder

51

5.4.2. Machine learning

In this section, the results of machine learning experiment were presented. The evaluation

results from unsupervised learning with kMeans algorithm was shown on Table 5.4. For

the SVM with linear kernels, 26 features were completely rejected in the feature selection,

resulting in their removal during creation of the model for second phase. The rejected

features were related to pitch, GTCCs 1–3, MFCCs 1–3, signal energy, spectral centroid,

and spectral cutoff point. Feature selection in SVM with 3rd order polynomial kernel

results in removal of 28 features. The rejected features were related to pitch, GTCCs and

MFCCs (1–3, 12–13), signal energy, spectral centroid, and spectral cutoff. LASSO with

RBF-SVM similarly rejects 28 features related to low-order (1–4) and high-order (10–

13) MFCC and GTCC coeffcients, pitch, signal energy, spectral centroid, and spectral

cutoff. Machine learning evaluation results for phase 2 were shown on Table 5.5-Table

5.7, and the results for phase were shown on Table 5.8. Results with and without LASSO

algorithm also shown in these tables to confirm effectiveness of feature selection. Here,

the label “positive” represents depression patients and “negative” is for dementia patients.

Table 5.4 Unsupervised machine learning result

Metric Score (%)

Accuracy (ACC) 62.7

True Positive Rate (TPR) 89.9

True Negative Rate (TNR) 35.2

Positive Predictive Value (PPV) 58.3

Negative Predictive Value (NPV) 77.5

F1 score 70.8

Cohen’s kappa 25.2

Matthew’s Correlation Coefficient (MCC) 30.0

Page 62: Multimodal Feature Extraction for Psychiatric Disorder

52

Table 5.5 Supervised machine learning result - SVM with linear kernel

Metric Training (mean ± SD %) Testing (mean ± SD %)

No LASSO With LASSO No LASSO With

LASSO

Accuracy (ACC) 90.1 ± 2.4 95.2 ± 0.7 84.2 ± 5.3 93.3 ± 7.7

True Positive Rate

(TPR)

94.4 ± 0.9 98.3 ± 0.9 88.8 ± 10.5 97.8 ± 4.7

True Negative Rate

(TNR)

85.7 ± 4.6 92.6 ± 1.2 79.6 ± 11.5 89.4 ± 13.7

Positive Predictive

Value (PPV)

87.1 ± 3.5 92.1 ± 1.2 82.5 ± 8.8 90.4 ± 11.7

Negative Predictive

Value (NPV)

93.8 ± 1.0 98.4 ± 0.8 88.8 ± 8.9 98.0 ± 4.2

F1 score 90.6 ± 2.0 95.1 ± 0.7 84.8 ± 5.5 93.5 ± 7.2

Cohen’s kappa 80.2 ± 4.7 90.5 ± 1.4 68.3 ± 10.5 86.7 ± 15.0

Matthew’s Correlation

Coefficient (MCC)

80.5 ± 4.4 90.6 ± 1.4 69.8 ± 10.3 87.8 ± 13.5

Table 5.6 Supervised machine learning result - SVM with 3rd order polynomial kernel

Metric Training (mean ± SD %) Testing (mean ± SD %)

No LASSO With LASSO No LASSO With

LASSO

Accuracy (ACC) 91.5 ± 3.1 94.6 ± 8.1 79.1 ± 7.6 89.7 ± 11.4

True Positive Rate

(TPR)

96.4 ± 2.4 99.1 ± 1.0 85.3 ± 10.8 96.7 ± 5.4

True Negative Rate

(TNR)

86.5 ± 4.0 90.0 ± 16.1 72.6 ± 14.3 83.1 ± 22.9

Positive Predictive

Value (PPV)

87.9 ± 3.5 92.3 ± 9.9 76.9 ± 8.3 87.6 ± 13.8

Page 63: Multimodal Feature Extraction for Psychiatric Disorder

53

Metric Training (mean ± SD %) Testing (mean ± SD %)

No LASSO With LASSO No LASSO With

LASSO

Negative Predictive

Value (NPV)

95.9 ± 2.7 98.9 ± 1.2 84.1 ± 9.9 96.9 ± 5.0

F1 score 91.9 ± 2.9 95.3 ± 6.1 80.3 ± 6.9 91.1 ± 8.2

Cohen’s kappa 82.9 ± 6.3 89.2 ± 16.2 58.0 ± 15.2 79.7 ± 21.7

Matthew’s Correlation

Coefficient (MCC)

83.3 ± 6.2 90.1 ± 13.7 59.4 ± 14.6 81.8 ± 17.9

Table 5.7 Supervised machine learning result - SVM with RBF kernel

Metric Training (mean ± SD %) Testing (mean ± SD %)

No LASSO With LASSO No LASSO With

LASSO

Accuracy (ACC) 90.4 ± 6.2 95.6 ± 1.9 75.3 ± 12.4 88.7 ± 7.9

True Positive Rate

(TPR)

96.4 ± 2.9 98.8 ± 1.0 77.5 ± 16.6 91.0 ± 10.3

True Negative Rate

(TNR)

84.3 ± 10.2 92.4 ± 3.0 72.9 ± 17.3 86.1 ± 13.1

Positive Predictive

Value (PPV)

86.7 ± 7.9 93.0 ± 2.6 75.6 ± 13.8 88.3 ± 10.4

Negative Predictive

Value (NPV)

95.7 ± 3.7 98.6 ± 1.2 77.6 ± 14.6 91.3 ± 8.9

F1 score 91.2 ± 5.4 95.8 ± 1.7 75.7 ± 12.5 89.1 ± 7.9

Cohen’s kappa 80.8 ± 12.3 91.2 ± 3.7 50.5 ± 24.8 77.3 ± 15.9

Matthew’s Correlation

Coefficient (MCC)

81.5 ± 11.7 91.4 ± 3.6 51.8 ± 25.0 78.3 ± 15.4

Page 64: Multimodal Feature Extraction for Psychiatric Disorder

54

Table 5.8 Machine learning result against non-age matched dataset

Metrics Linear Polynomial RBF

All LASSO All LASSO All LASSO

Accuracy (ACC) 83.5 82.6 80.2 81.4 65.7 81.0

True Positive Rate

(TPR)

87.7 83.9 82.5 82.9 66.8 82.9

True Negative Rate

(TNR)

54.8 74.2 64.5 71.0 58.1 67.7

Positive Predictive

Value (PPV)

93.0 95.7 94.1 95.1 91.6 94.6

Negative Predictive

Value (NPV)

39.5 40.4 35.1 37.9 20.5 36.8

F1 score 90.2 89.4 87.9 88.6 77.3 88.4

Cohen’s kappa 36.5 42.8 34.6 39.3 13.9 37.3

Matthew’s Correlation

Coefficient (MCC)

37.2 45.7 37.0 42.2 17.3 39.9

5.5. Discussion

As a result, we found that significant correlations with clinical interview tools in features

of GTCCs 1, 3, 12 and MFCCs 1, 3, 4, 7, 12. The sign from the Pearson’s rho is different;

negative correlation was observed for HAMD and positive correlation was observed for

MMSE. This suggests that the features were important for both depression and dementia,

and also important for differentiating depression and dementia. Another thing to note that

the highest absolute correlation value with significance (p < 0.05) was 0.346 for HAMD

and 0.400 for MMSE, suggesting a weak to moderate correlation between the audio

features and clinical rating scores.

The corrected t-test between these features in Figure 5 showed statistical differences only

in certain features. Interestingly, the standard deviation of a rather high-order MFCC

coefficient showed significant difference. Normally, most of the information are

Page 65: Multimodal Feature Extraction for Psychiatric Disorder

55

represented in the lower order coefficients and their distributions were important for

speech analysis.

Statistical comparison of acoustic features between two groups found significant

differences in both temporal and spectral acoustic features. No significant difference

between the two groups can be found in pitch and energy, both in the family of temporal

features. Although the result from unsupervised clustering algorithm was not satisfactory,

both the accuracy and inter-rater agreement show that the performance was better than

chance, denoting the underlying patterns in the data. In the second part of machine

learning, feature selection was performed using LASSO algorithm. Here, both pitch and

signal energy features were rejected alongside with other spectral features. Considering

that both pitch and signal energy also showed no statistical significance in the t-test, it

can be inferred that these features do not contribute for classification of depression and

dementia. In contrast, GTCCs 4–14 and MFCCs 4–14 had statistically significant

difference and were also selected by LASSO algorithm. GTCCs and MFCCs are similar

features, related to tones of human speech. Although GFCCs was not developed for

speech analysis, both are commonly used for speech recognition systems [121] [122].

This finding is consistent with the fact that a person’s speech characteristics might be

related with their mental health. [85].

Surprisingly, the best result of the SVM was obtained in SVM with linear kernel, although

the scores were only slightly superior to the nonlinear SVMs. Additionally, the

effectiveness of LASSO algorithm for feature selection was evaluated and interesting

result was found. For the second phase, all the SVM models benefited from having

LASSO feature selection, but for the third phase, nonlinear SVMs seemed to be the most

benefited with the feature selection. This might be related by the LASSO algorithm. As

LASSO regression is a linear regression with penalty and the feature selection step was

basically to discard features that give zero contribution to LASSO regression, linear SVM

might be similar to it and was redundant in this case.

Nevertheless, high accuracy and interrater agreement were obtained from the models in

both machine learning phases. For comparison, studies [123] [124] [125] [50], and [126]

have 87.2%, 81%, 81.23%, 89.71% and 73% as accuracy for predicting depression,

Page 66: Multimodal Feature Extraction for Psychiatric Disorder

56

respectively. [127] reports 73.6% accuracy for predicting dementia and [128] reports

99.9% TNR and 78.8% TPR. However, most of these studies compared healthy subjects

against symptomatic patients, while our study compared patients afflicted with different

mental problem. Additionally, most conventional studies measure depression by

questionnaire and not with clinical examination, so this cannot be said to be a fair

comparison. Low NPV scores and inter-rater during the third phase maybe due to the fact

that evaluation in third phase was utilized with heavily imbalanced dataset and with

higher number of samples compared to the training phase. These results suggest the

possibility of using audio features for automatic pseudodementia screening.

5.6. Conclusions

We recorded the audio of clinical interview session of depression patients and dementia

patients in a clinical setting using an array microphone. Statistical analysis shows

significant differences in audio features between depressed patients and dementia patients.

A machine learning model was constructed and evaluated; considerable performance was

recorded for distinguishing depression patients and dementia patients. Feature

contribution analysis reveal features MFCC and GTCC features to be the highest

contributing features. The top contributing features were 9th and 4th MFCC features.

Based on our findings, we conclude that automated pseudodementia screening with

machine learning is feasible.

Page 67: Multimodal Feature Extraction for Psychiatric Disorder

57

Chapter 6

Multimodal feature analysis in

depression patients and dementia

patients

6.1. Introduction

This chapter describes the feature analysis of both visual and acoustic features from

depression patients and dementia patients. In section 6.2, the data acquisition protocol is

described. In section 6.3, the analysis procedure is described and the results are described

and discussed in section 6.4. The chapter is then concluded in section 6.5.

6.2. Data acquisition

Similar to chapter 3 and 5, the data utilized in this chapter is from PROMPT database.

The PROMPT database’s audio recordings were obtained by recording a patient’s clinical

interview session with a therapist. The recording apparatus is Beyerdynamic Classis

RM30W GmbH & Co. KG with 16 kHz sampling rate. The visual recordings were

obtained using the following devices: RealSense R200 (Intel Corporation) and Microsoft

Kinect for Windows v2 (Microsoft Corporation), both with frame rate of 30 frame-per-

second (FPS). A data screening in consideration of the possible effect of age and gender

distribution to the severity of the psychiatric disorders was performed. To maintain

uniformity with chapter 3 and 5, the screening criteria are largely unchanged. The

minimum length of recordings utilized in this chapter was 5 minutes, matching with the

chapter 3. The reason why 5 minutes is selected instead of 10 minutes is because the

number of video recordings with length of 10 minutes or more are to increase the number

of samples.

After the screening, we successfully obtained data from 174 participants (MDD N = 97,

age mean ± s.d. = 49.39 ± 14.97, female ratio = 54.64%; dementia N = 77, age mean ±

Page 68: Multimodal Feature Extraction for Psychiatric Disorder

58

s.d. = 79.86 ± 8.15, female ratio = 63.64%). The dataset is then split into male subgroup

and female subgroup: Out of the 174 participants, only 46 recordings from female

subjects (MDD N=22, dementia N=24) have multimodal features and from male subjects

there were only 6 recordings (MDD N=1, dementia N=5).

6.3. Analysis

6.3.1. Facial feature analysis

The features are similar to the ones utilized in chapter 3 and 5. As in chapter 3, Facial

landmarks extraction was performed using Omron’s OKAO Vision and a preprocessing,

an outlier removal was performed by applying cubic spline interpolation in frames which

landmarks have values less than 1st percentile or greater than 99th percentile. After the

preprocessing, the feature extraction was performed. Features extracted from facial

landmarks were speed statistics of each landmark and speed statistics of the face center.

6.3.2. Audio feature analysis

Similar to chapter 5, audio intensity normalization is performed to the audio recordings,

such that the maximum absolute of the signal is 0.99 as a preprocessing. Additionally,

continuous silence in form of trailing zeroes at the front and the end of the recordings

were also removed. Afterwards, feature extraction of audio recordings was performed

with a sliding window of 10ms with no overlap and three features were extracted: Mel-

frequency cepstral coefficients (MFCC), harmonics-to-noise ratio (HNR), zero-crossing

rate (ZCR), and signal power.

The fewer choice of features reflects the result of the chapter 5. Only features that were

retained by LASSO algorithm in chapter 5 are utilized. MFCC was chosen because it

corresponds to human speech while HNR, ZCR, and signal power were chosen to

represent the temporal features of the audio recordings. Finally, the statistical features of

MFCC, HNR, and ZCR were computed and used as features: mean, median, standard

deviation, kurtosis, and skewness. GTCC was not used because their features largely

correlates with MFCC.

6.3.3. Statistical analysis

Page 69: Multimodal Feature Extraction for Psychiatric Disorder

59

The statistical analysis is similar with chapter 3 and 5. For each feature, Pearson’s

correlation was measured with the clinical assessment result. Then, Wilcoxon rank-sum

test was performed for features extracted from MDD group against features extracted

from dementia group. The objective of correlation analysis is to find features that may

positively contribute to one group but negatively contribute to the other group, hence its

utility as pseudodementia marker. Wilcoxon test was performed to check features that

have statistically significant difference between MDD group and dementia group. Then,

the statistical comparison between audio features and facial features are performed.

6.3.4. Machine learning

Various machine learning models were compared to check which models perform the

best for automatic pseudodementia screening. Machine learning models utilized were

support vector machines with various kernels (linear, polynomial, radial-basis function),

AdaBoost, random forest, and random under-sampling (RUS) boosted trees. Similar to

chapter 3 and 5, LASSO was utilized for feature selection with similar criteria (chosen at

least 10% of the time). Machine learning experiment with multimodal inputs are also

tested. However, since the available male patients’ recordings are heavily imbalanced,

the multimodal experiment was not performed with male recordings. In its place, the

performance of machine learning with multimodal features are examined using the

combined dataset of both male and female subgroups.

6.4. Results and Discussion

6.4.1. Demographics

The demographics of the screened dataset is shown in Table 6.1. As expected, the number

of datasets from female patients were numerous compared to dataset from male patients.

Such imbalance may affect the result to be biased to recordings from female patients.

Conversely, small number of sample from male subgroup also decreases the statistical

power, and type II error is likely to occur.

Page 70: Multimodal Feature Extraction for Psychiatric Disorder

60

Table 6.1 Number of recordings

Number of

Recordings

Female Male

MDD Dementia MDD Dementia

Audio 57 49 13 14

Face 24 34 3 8

Both 22 24 1 5

6.4.2. Statistical analysis

Pearson’s correlation with clinical assessment tools yield interesting result. Facial

features of MDD-female group show very strong significant correlation (p < 0.05)

between the median of face center speed with HAMD17 (R = 0.9998). MDD-male group

shows no significant correlation between facial features with HAMD17. Audio features

show significant correlation (p < 0.05) with moderate strength (average |R| = 0.6645) in

both female subgroup and male subgroup, albeit in different features. The features with

significant correlation in male subgroup are signal strength, HNR, MFCCs 1 and 2, while

in female subgroup, the features are ZCR, MFCCs 1, 2, 3, and 4.

Facial features of dementia group show no significant correlation (p < 0.05) between the

facial features of male subgroup with MMSE. Facial features of female subgroup show

significant inverse correlation with moderate strength (average |R| = 0.4481) kurtosis and

skewness of right pupil, right eye (top and bottom), glabella, and bottom nose (left point

and right point). Audio features of dementia group show significant correlation (p < 0.05)

in both male and female subgroup. In male subgroup, median of HNR (R = 0.6645),

skewness of MFCC 2 (R=-0.6324) showed significant correlation. In female subgroup,

all features showed significant correlation: signal power, ZCR, HNR, MFCCs 1-7

(average |R| = 0.4331). The results from Pearson’s correlation analysis implies that

dementia and MDD has different features, especially in audio features of MFCC. The

coefficients of MFCC tends to yield positive correlation with MMSE (dementia score)

and negative correlation with HAMD17 (MDD score).

Wilcoxon rank sum test between female group of dementia against MDD show significant

difference in all facial and audio features whilst male group show only significant

Page 71: Multimodal Feature Extraction for Psychiatric Disorder

61

difference in audio features. This implies that audio features might be a better predictor

for screening pseudodementia. Nevertheless, the number of samples from male subgroup

is very few in number and might be biased.

6.4.3. Machine learning

The accuracies of machine learning models are described in Table 6.2. The validation

method differs between female group and male group. This is caused by the insufficient

number of samples of male group. A leave-one-out validation is applied to the male group

and 10-fold cross-validation is applied for female group. As shown in the result,

acceptable classification accuracy was achieved both in female group and male group.

For unimodal features, the best performing models are nonlinear SVMs and random

forests, while boosted trees only perform well in male group with facial features as

predictors. Overfitting seems to be a problem here, as boosted trees seem to have lower

accuracy than random forest or RUSboosted trees.

Multimodal features seem to improve the accuracy of the models compared to the worst

unimodal cases, but in models trained from unimodal features are still the best performing

models. Overall, multimodal features did not seem to improve the accuracy if compared

with the best cases. However, some combinations of features and machine learning model

did not work well – for example, for male patients’ facial features did not seem to perform

well but with multimodal features, the performance was generally better. It seemed that

utilizing multimodal features improves the worst-case performances but did not improve

the average performance. One hypothesis is that audio features and face features actually

correlated with each other and the combination of both features did not drastically

improve the machine learning performance.

Page 72: Multimodal Feature Extraction for Psychiatric Disorder

62

Table 6.2 Machine learning results (accuracy)

Models

Female

10-fold

Male

Leave-one-out

Both

10-fold

Audio Face

Audio

+Face Audio Face

Audio

+Face

SVM Linear 84.9 79.3 84.3 92.6 63.6 89.8

SVM Cubic 84.9 79.3 84.2 96.3 63.6 70.5

SVM RBF 85.8 81 64.9 92.6 72.7 80.1

AdaBoost 53.8 58.5 70.0 51.9 72.7 69.7

Random Forest 80.2 86.2 73.8 81.5 45.5 62.2

RUSBoosted Trees 73.6 74.1 73.0 70.4 36.4 76.3

6.5. Conclusion

We recorded the clinical interview session of MDD patients and dementia patients in a

clinical setting extracted audio and facial features from the recordings. An imbalanced

dataset of mostly female patients was obtained. Statistical analysis shows significant

correlation in both MDD patient scores and dementia patient scores with audio and facial

features. MFCC features show positive correlation with dementia score and inverse

correlation with MDD score, suggesting its effectiveness as pseudodementia screening

tool. Statistical differences in audio features between depressed patients and dementia

patients was also found. Several machine learning model were constructed and evaluated.

The resulting performance was considerable, with the worst performance in male-group’s

machine learning using facial feature predictors and the best performance was in male-

group’s machine learning using audio feature predictors. Nevertheless, the sample size of

male group is comparatively smaller than the female group and more reliable result can

be inferred from female group’s result. The female group’s machine learning result was

Page 73: Multimodal Feature Extraction for Psychiatric Disorder

63

85.8% by using audio features, 86.2% by using facial features, and 84.2% by using both

features.

Based on our findings, we conclude that automated pseudodementia screening with

machine learning is feasible. Suggestion for future works include addressing overfitting

problem must be considered as adding predictors increases the overfitting bias.

Page 74: Multimodal Feature Extraction for Psychiatric Disorder

64

Chapter 7

Conclusion

7.1. Summary of this thesis

Diagnosis from a licensed medical practitioner is very important. The current norm is that

it is ideal for the patient and the practitioner meet face-to-face in order to perform

examination. However, with the advances in communication science and technology, less

and less examinations require medical practitioner to be physically present near the

patient. The practice of performing diagnosis, or sometimes even treatment, of a disease

without the medical practitioner being present physically is called telehealth or

telemedicine.

In telehealth, the medical practitioner communicates with the patient via internet or any

other remote meeting platforms; or in even some rarer cases, the medical practitioner is

replaced by a human-computer interface software. This is possible because smart devices

are becoming the norm of everyday life and some of those smart devices are equipped

with sensors to detect human biomarkers. A person’s blood pressure and heart rate, for

example, can be easily monitored using smartwatches. Needless to say, there is still only

few devices and software that are clinically validated, and the data from non-clinically

validated devices or software may lead to misdiagnosis if it were to be used. The clinical

validation of such devices is important and critical. Before the device is publicly available

for sale, appropriate clinical validation must be performed and license from appropriate

licensing agency must be obtained (in Japan case, the request must be performed to

PMDA: Pharmaceuticals and Medical Devices Agency).

The goals of telemedicine are mainly for accessibility, communication quality, and self-

care. Accessibility in normal times refers to rural or isolated communities or people with

limited mobility, time or transportation options. But with recent coronavirus (COVID19)

pandemic in 2020, another potential for telehealth advantage is found: accessibility for

those in the quarantine. During the pandemic, telemedicine becomes more alluring, both

Page 75: Multimodal Feature Extraction for Psychiatric Disorder

65

for patients and medical care providers. At the very least, the usage of telemedicine

reduces or eliminates the risk of COVID19 transmission.

Psychiatric screening using telehealth is also important. The social distancing and remote

work recommendation from the government across the world not only impacted the

economy but also the mental health of the workers. Public health actions, such as social

distancing, can make people feel isolated and lonely and can increase stress and anxiety.

Additionally, fear and anxiety about a new disease and what could happen can be

overwhelming and cause strong emotions in adults and children. Also, factitious disorders

or medical mimics exist even in the psychiatry field. This research is one step in the

direction of telehealth and engineering for diagnosing psychiatric disorders which

symptoms that may mimic each other.

Another type of telemedicine is called remote patient monitoring (RPM). Remote patient

monitoring allows healthcare providers to monitor patients’ health data from a far, usually

while the patient is at their own home. RPM can significantly cut down on the time a

patient needs to spend in the hospital, instead letting them recover under monitoring at

home. Remote patient monitoring is especially effective for chronic conditions, such as

heart disease to diabetes to asthma. Technology that allows patients to monitor

themselves for these conditions has existed for many years, but today, vital health data

can be shared with doctors and other healthcare professionals remotely. Cutting-edge

equipment can transmit basic medical data to doctors automatically, allowing them to

provide a much better level of care and keep an eye out for the earliest signs of trouble.

An improvement for facial tracking algorithm, which is very important for both automatic

mental health assessment and remote patient monitoring was also covered in this

dissertation.

7.2. Conclusion

In chapter 3, 5, and 6, analysis of acoustic, facial, and fusion of acoustic and facial features

from depression patients, dementia patients were performed. Correlation analysis

between features and mental health assessment tools found that acoustic features have

statistically significant correlation with moderate strength, be it positive or negative

correlation. On the other hand, only two facial features chosen in this dissertation have

Page 76: Multimodal Feature Extraction for Psychiatric Disorder

66

significant correlation with the assessment tools. Speed statistics of facial landmarks

might not be suitable for mental health assessment and emotion recognition might be

more beneficial in this case. Nevertheless, the minimum recording length of 5 minutes

may also play a role in both statistical analysis and the machine learning results. Less

information is available in 5 minute facial recordings compared to 10 minutes audio

recordings and might result in less accuracy or less correlation value.

Similarity and difference between acoustic features and facial features for depression

patient and dementia patient was performed. In contrast with correlation analysis, several

acoustic and facial features have been found to be significantly differs between the two

groups. This implies the possibility of classification using the features chosen in this

dissertation.

Machine learning experiment for classifying of depression and dementia patients was

performed. The machine learning models are relatively simple and not complex.

Nevertheless, satisfactory result of more than 80% accuracy was obtained for all cases.

With more data and more complex machine learning algorithms higher accuracy might

be obtainable.

A pose-robust facial landmark tracking algorithm which is beneficial for both automatic

screening and telehealth in general was also proposed. The algorithm yields low RMSE

error between estimated facial landmarks and ground truth.

7.3. Suggestions for future research

This research is by no means complete. Future research directions include:

1. Usage of generalized database

Current database is from PROMPT study which was conducted in Japan. As a

result, the subjects are all Japanese subjects. Although it is said that some features

such as speech transcends language and culture, validation is still important.

Validation of this research using database from non-East Asian countries is

recommended.

2. Usage of deep learning algorithms

Page 77: Multimodal Feature Extraction for Psychiatric Disorder

67

In this dissertation, all machine learning models are relatively simple. The aim

was that simple machine learning models are often interpretable and beneficial for

feature analysis. Now, the characteristics of depression and dementia have been

found and more attention to machine learning performance is recommended.

Additionally, deep learning algorithms generally performs better compared to

traditional machine learning algorithms such as the ones utilized in this

dissertation.

3. Usage of more modalities or advanced features

The modalities utilized in this dissertation are speech and facial landmarks. They

are basic features and more advanced features might be beneficial for improving

performance. Suggestions for other modalities include the content of conversation

such as from natural language processing, social network analysis, wearable

sensors, and internet-of-things sensors for human body.

4. Considerations of more psychiatric disease

As stated above, numerous medical conditions frequently encountered in the ED

can mimic psychiatric disorders, not counting the psychiatric disorders that can

mimic one another. This is especially important considering most of the automatic

diagnosis only considers a particular type of disease.

Page 78: Multimodal Feature Extraction for Psychiatric Disorder

68

References

[1] World Health Organization, "Mental disorders," [Online]. Available:

https://www.who.int/news-room/fact-sheets/detail/mental-disorders. [Accessed June 2020].

[2] Ministry of Health, Labour and Welfare of Japan (厚生労働省), "認知症施策の総合的な推

進 に つ い て ( 参 考 資 料 ) ," June 2020. [Online]. Available:

https://www.mhlw.go.jp/content/12300000/000519620.pdf. [Accessed July 2020].

[3] E. Ohnuki-Tierney, Illness and culture in contemporary Japan : an anthropological view, New

York: Cambridge University Press, 1984.

[4] National Police Agency (警察庁), "令和元年中における自殺の状況," 17 March 2020.

[Online]. Available:

https://www.npa.go.jp/safetylife/seianki/jisatsu/R02/R01_jisatuno_joukyou.pdf. [Accessed

July 2020].

[5] American Psychiatric Association, Diagnostic and statistical manual of mental disorders (5th

ed.), Arlington: American Psychiatric Association, 2013.

[6] M. Folstein, S. Folstein and P. McHugh, "“Mini-mental state”: A practical method for grading

the cognitive state of patients for the clinician," Journal of Psychiatric Research, vol. 12, no.

3, pp. 189-198, 1975.

[7] A. Budson and P. Solomon, "Chapter 2 - Evaluating the Patient with Memory Loss or

Dementia," in Memory Loss, Alzheimer's Disease, and Dementia A Practical Guide for

Clinicians, Edinburgh, Elsevier, 2017, pp. 5-38.

[8] C. Hughes, L. Berg, W. Danziger, L. Coben and R. Martin, "A new clinical scale for the staging

of dementia," Br J Psychiatry, vol. 140, no. 6, pp. 566-572, 1982.

[9] D. Wechsler, WMS-R: Wechsler Memory Scale–Revised, San Antonio: Psychological

Corporation, 1987.

Page 79: Multimodal Feature Extraction for Psychiatric Disorder

69

[10] X. Meng, A. Brunet, G. Turecki, A. Liu, C. D'Arcy and J. Caron, "Risk factor modifications

and depression incidence: a 4-year longitudinal Canadian cohort of the Montreal Catchment

Area Study," BMJ Open, vol. 7, no. 6, 2017.

[11] X. Zhou, B. Bi, L. Zheng, Z. Li, H. Yang, H. Song and Y. Sun, "The Prevalence and Risk

Factors for Depression Symptoms in a Rural Chinese Sample Population," PLoS One, vol. 9,

no. 6, 2014.

[12] H. Razzak, A. Harbi and S. Ahli, "Depression: Prevalence and Associated Risk Factors in the

United Arab Emirates," Oman Med J. , vol. 34, no. 4, p. 274–282., 2019.

[13] M. Shadrina, E. Bondarenko and P. Slominsky, "Genetics Factors in Major Depression

Disease," Front Psychiatry, vol. 9, p. 334, 2018.

[14] F. Lohoff, "Overview of the Genetics of Major Depressive Disorder," Curr Psychiatry Rep.,

vol. 12, no. 6, pp. 539-546, 2010.

[15] D. Hamilton, "A RATING SCALE FOR DEPRESSION," J Neurol Neurosurg Psychiatry, vol.

23, no. 1, pp. 56-62, 1960.

[16] S. Gautam, A. Jain, M. Gautam, V. Vahia and S. Grover, "Clinical Practice Guidelines for the

management of Depression," Indian J Psychiatry, vol. 59, no. Suppl 1, 2017.

[17] J. Jakobsen, C. Gluud and I. Kirsch, "Should antidepressants be used for major depressive

disorder?," BMJ Evidence-Based Medicine, vol. 25, no. 130, 2020.

[18] M. Sado, A. Ninomiya, R. Shikimoto, B. Ikeda, T. Baba, K. Yoshimura and M. Mimura, "The

estimated cost of dementia in Japan, the most aged society in the world," PLoS One, vol. 13,

no. 11, 2018.

[19] P. Lichtenberg, D. Murman and A. Mellow, Handbook of Dementia: Psychological,

Neurological, and Psychiatric Perspectives, John Wiley & Sons, 2004.

[20] L. Kiloh, "PSEUDO-DEMENTIA," Acta Psychiatrica Scandinavica, vol. 37, no. 4, pp. 336-

351, 1961.

[21] A. Burns and D. Jolley, "Pseudodementia: History, mystery and positivity," in Troublesome

Disguises: Managing Challenging Disorders in Psychiatry: Second Edition, Wiley-Blackwell,

2015, pp. 218-230.

Page 80: Multimodal Feature Extraction for Psychiatric Disorder

70

[22] B. Pitt and G. Yousef, "Depressive pseudodementia," Current Opinion in Psychiatry, vol. 10,

no. 4, pp. 318-321, 1997.

[23] S. Sahin, T. O. Onal, N. Cinar, M. Bozdemir, R. Cubuk and S. Karsidag, "Distinguishing

Depressive Pseudodementia from Alzheimer Disease: A Comparative Study of Hippocampal

Volumetry and Cognitive Tests," Dementia and Geriatric Cognitive Disorders Extra, vol. 7,

no. 2, pp. 230-239, 2017.

[24] H. Kang, F. Zhao, L. You, C. Giorgetta, V. D, S. S and P. R, "Pseudo-dementia: A

neuropsychological review," Ann Indian Acad Neurol, vol. 17, no. 2, pp. 147-154, 2014.

[25] S. Montgomery and M. Asberg, "A New Depression Scale Designed to be Sensitive to

Change," The British Journal of Psychiatry, vol. 134, no. 4, pp. 382-389, 1979.

[26] A. Beck, Depression: Causes and Treatment, Philadelphia: University of Pennsylvania Press,

1972.

[27] R. Young, J. Biggs, V. Ziegler and D. Meyer, "A rating scale for mania: reliability, validity

and sensitivity," British Journal of Psychiatry, vol. 133, no. 5, pp. 429-435, 1978.

[28] D. Buysse, C. Reynolds, T. Monk, S. Berman and D. Kupfer, "The Pittsburgh sleep quality

index: A new instrument for psychiatric practice and research," Psychiatry Research, vol. 28,

no. 2, pp. 193-213, 1989.

[29] J. Cummings, M. Mega, K. Gray, S. Rosenberg-Thompson, D. Carusi and J. Gornbein, "The

Neuropsychiatric Inventory: comprehensive assessment of psychopathology in dementia,"

Neurology, vol. 44, no. 12, 1994.

[30] E. Giles, K. Patterson and J. Hodges, "Performance on the Boston Cookie theft picture

description task in patients with early dementia of the Alzheimer's type: Missing information,"

Aphasiology, vol. 10, no. 4, pp. 395-408, 1994.

[31] M. Field, "Telemedicine: A Guide to Assessing Telecommunications in Health Care.," J Digit

Imaging, vol. 10, no. 28, 1997.

[32] M. Langarizadeh, M. Tabatabaei, K. Tavakol, M. Naghipour, A. Rostami and F. Moghbeli,

"Telemental Health Care, an Effective Alternative to Conventional Mental Care: a Systematic

Review," Acta Inform Med, vol. 25, no. 4, pp. 240-246, 2017.

Page 81: Multimodal Feature Extraction for Psychiatric Disorder

71

[33] G. Diamond, L. Suzanne, K. Bevans, J. Fein, M. Wintersteen, A. Tien and T. Creed,

"Development, validation, and utility of internet-based, behavioral health screen for

adolescents," Pediatrics, vol. 126, no. 1, 2010.

[34] Z. Adams, E. McClure, K. Gray, C. Danielson, F. Treiber and K. Ruggiero, "Mobile devices

for the remote acquisition of physiological and behavioral biomarkers in psychiatric clinical

research," Journal of Psychiatric Research, vol. 85, pp. 1-14, 2017.

[35] M. Castiglioni and F. Laudisa, "Toward psychiatry as a ‘human’ science of mind. The case of

depressive disorders in DSM-5," Front Psychol, vol. 5, 2014.

[36] M. Fakhoury, "Artificial Intelligence in Psychiatry," in Frontiers in Psychiatry. Advances in

Experimental Medicine and Biology, Singapore, Springer, 2019, pp. 119-125.

[37] P. Doraiswamy, C. Blease and K. Bodner, "Artificial intelligence and the future of psychiatry:

Insights from a global physician survey," Artificial Intelligence in Medicine, vol. 102, 2020.

[38] P. Arean, K. Ly and G. Andersson, "Mobile technology for mental health assessment,"

Dialogues Clin Neurosci, vol. 18, no. 2, pp. 163-169, 2016.

[39] K. Kroenke, R. Spitzer and J. Williams, "The PHQ-9: validity of a brief depression severity

measure," Journal of General Internal Medicine, vol. 16, no. 9, pp. 606-613, 2001.

[40] S. Hardt and D. MacFadden, "Computer assisted psychiatric diagnosis: experiments in software

design," Comput Biol Med, vol. 17, no. 4, pp. 229-237, 1987.

[41] V. Bagga, K. Kahol and S. Chandra, "Game Design for Pre-screening Patients with Mental

Health Complications Using ICT Tools," in International Conference on Ambient Media and

Systems, Athens, 2013.

[42] A. Dezfouli, H. Ashtiani, O. Ghattas, R. Nock, P. Dayan and C. Ong, "Disentangled

behavioural representations," in Neural Information Processing Systems Conference,

Vancouver, 2019.

[43] R. Mandryk, M. Birk, A. Lobel, M. Rooij, I. Granic and V. Abeele, "Games for the Assessment

and Treatment of Mental Health," in The ACM SIGCHI Annual Symposium on Computer-

Human Interaction in Play, Amsterdam, 2017.

[44] A. Sano, A. Philips, A. Zu, A. McHill, S. Taylor, N. Jaques, C. Czeisler, E. Klerman and R.

Picard, "Recognizing academic performance, sleep quality, stress level, and mental health

Page 82: Multimodal Feature Extraction for Psychiatric Disorder

72

using personality traits, wearable sensors and mobile phones," in 12th International

Conference on Wearable and Implantable Body Sensor Networks, Cambridge, 2015.

[45] S. Abdullah and T. Choudhury, "Sensing Technologies for Monitoring Serious Mental

Illnesses," IEEE Multimedia, vol. 25, pp. 61-75, 2018.

[46] U. Archaya, S. Oh, Y. Hagiwara, J. Tan, H. Adeli and D. Subha, "Automated EEG-based

screening of depression using deep convolutional neural network," Computer Methods and

Programs in Biomedicine, vol. 161, pp. 103-113, 2016.

[47] J. Newson and T. Thiagarajan, "EEG Frequency Bands in Psychiatric Disorders: A Review of

Resting State Studies," Front Hum Neurosci, vol. 12, p. 521, 2018.

[48] Z. Wan, H. Zhang, J. Huang, H. Zhou, J. Yang and N. Zhong, "Single-Channel EEG-Based

Machine Learning Method for Prescreening Major Depressive Disorder," International

Journal of Information Technology & Decision Making, vol. 18, no. 5, pp. 1579-1603, 2019.

[49] G. Giannakakis, D. Grigoriadis and M. Tsiknakis, "Detection of stress/anxiety state from EEG

features during video watching," in 37th Annual International Conference of the IEEE

Engineering in Medicine and Biology Society, Milan, 2015.

[50] R. Nakamura and Y. Mitsukura, "Feature analysis of electroencephalography in patients with

depression," in IEEE Life Sciences Conference, Montreal, 2018.

[51] A. Ozdas, R. Shiavi, S. Silverman, M. Silverman and D. Wilkes, "Investigation of vocal jitter

and glottal flow spectrum as possible cues for depression and near-term suicidal risk," IEEE

Transactions on Biomedical Engineering, vol. 51, no. 9, pp. 1530-1540, 2004.

[52] J. Shah, B. Cahn, S. By, E. Welch, L. Sacolick, M. Yuen, M. Mazurek, C. Wira, A. Leasure,

C. Matouk, A. Ward, S. Payabvash, R. Beekman, S. Brown, G. Falcone, K. Gobeske, N.

Petersen, A. Jasne, R. Sharma, J. Schindler, L. Sansing, E. Gilmore, G. Sze and Rose, "Portable,

Bedside, Low-field Magnetic Resonance Imaging in an Intensive Care Setting for Intracranial

Hemorrhage (270)," Neurology, vol. 94, no. 15 Supplement, 2020.

[53] M. Conway and D. O'Connor, "Social Media, Big Data, and Mental Health: Current Advances

and Ethical Implications," Curr Opin Psychol, vol. 9, p. 77–82., 2017.

[54] I. Pantic, "Online Social Networking and Mental Health," Cyberpsychol Behav Soc Netw, vol.

17, no. 10, pp. 652-657, 2014.

Page 83: Multimodal Feature Extraction for Psychiatric Disorder

73

[55] G. Park, H. Schwartz, J. Eichstaedt, M. Kern, M. Kosinski, D. Stillwell, L. Ungar and M.

Seligman, "Automatic personality assessment through social media language," J Pers Soc

Psychol, vol. 108, no. 6, pp. 934-952, 2015.

[56] E. Kross, P. Verduyn, E. Demiralp, J. Park, D. Lee, N. Lin, H. Shablack, J. Jonides and O.

Ybarra, "Facebook use predicts declines in subjective well-being in young adults," PLoS One,

vol. 8, no. 8, 2013.

[57] J. Jashinsky, S. Burton, C. Hanson, J. West, C. Giraud-Carrier, M. Barnes and T. Argyle,

"Tracking suicide risk factors through Twitter in the US," Crisis, vol. 35, no. 1, pp. 51-59,

2014.

[58] L. Li, K. Ota, Z. Zhang and Y. Liu, "Security and Privacy Protection of Social Networks in Big

Data Era," Mathematical Problems in Engineering, 2018.

[59] M. Smith, C. Szongott, B. Henne and G. von Voigt, "Big data privacy issues in public social

media," in 6th IEEE International Conference on Digital Ecosystems and Technologies,

Campione d'Italia, 2012.

[60] J. Neevaleni and M. Devasana, "Alzheimer Disease Prediction using Machine Learning

Algorithms," in 6th International Conference on Advanced Computing and Communication

Systems, Coimbatore, 2020.

[61] B. Yalamanchili, N. Kota, M. Abbaraju, V. Nadella and S. Alluri, "Real-time Acoustic based

Depression Detection using Machine Learning Techniques," in 2020 International Conference

on Emerging Trends in Information Technology and Engineering, Vellore, 2020.

[62] L. He, D. Jiang and H. Sahli, "Automatic Depression Analysis Using Dynamic Facial

Appearance Descriptor and Dirichlet Process Fisher Encoding," IEEE Transactions on

Multimedia, vol. 21, no. 6, 2019.

[63] X. Zhou, P. Huang, H. Liu and S. Niu, "Learning content-adaptive feature pooling for facial

depression recognition in videos," Electronics Letters, vol. 55, no. 11, pp. 648-650, 2019.

[64] S. Khatun, B. Morshed and G. Bidelman, "A Single-Channel EEG-Based Approach to Detect

Mild Cognitive Impairment via Speech-Evoked Brain Responses," IEEE Transactions on

Neural Systems and Rehabilitation Engineering, vol. 27, no. 5, pp. 1063-1070, 2019.

Page 84: Multimodal Feature Extraction for Psychiatric Disorder

74

[65] P. Lodha, A. Talele and K. Degaonkar, "Diagnosis of Alzheimer's Disease Using Machine

Learning," in Fourth International Conference on Computing Communication Control and

Automation, Pune, 2018.

[66] A. Konig, A. Satt, A. Sorin, R. Hoory, O. Toledo-Ronen, A. Derreumaux, V. Manera, F.

Verhey, P. Aalten, P. Robert and R. Davida, "Automatic speech analysis for the assessment of

patients with predementia and Alzheimer's disease," Alzheimers Dement (Amst), vol. 1, no. 1,

pp. 112-124, 2015.

[67] S. Kloppel, C. Stonnington, C. Chu, B. Draganski, R. Scahill, J. Rohrer, N. Fox, C. Jack Jr., J.

Ashburner and R. Frackowiak, "Automatic classification of MR scans in Alzheimer's disease,"

Brain, vol. 131, no. 3, pp. 681-689, 2008.

[68] T. Kishimoto, A. Takamiya, K. Liang, K. Funaki, T. Fujita, M. Kitazawa, M. Yoshimura, Y.

Tazawa, T. Horigome, Y. Eguchi, T. Kikuchi, M. Tomita, S. Bun, J. Murakami, B. Sumali, T.

Warnita, A. Kishi, M. Yotsui, H. Toyoshiba, Y. Mitsukura, S. Koichi, Y. Sakakibara and M.

Mimura, "The Project for Objective Measures Using Computational Psychiatry Technology

(PROMPT): Rationale, Design, and Methodology," Contemporary Clinical Trials

Communications, 2020.

[69] P. Wright, J. Stern and M. Phelan, Core Psychiatry 3rd Edition, Elsevier, 2012.

[70] Y. Yacoob and L. Davis, "Recognizing human facial expressions from long image sequences

using optical flow," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18,

no. 6, pp. 636 - 642, 1996.

[71] K. Anderson and P. McOwan, "A real-time automated system for the recognition of human

facial expressions," IEEE Transactions on Systems, Man, and Cybernetics, Part B

(Cybernetics), vol. 36, no. 1, pp. 96 - 105, 2006.

[72] P. Aleksic and A. Katsaggelos, "Automatic facial expression recognition using facial animation

parameters and multistream HMMs," IEEE Transactions on Information Forensics and

Security, vol. 1, no. 1, pp. 3-11, 2006.

[73] C. Ding and D. Tao, "Robust Face Recognition via Multimodal Deep Face Representation,"

IEEE Transactions on Multimedia, vol. 17, no. 11, pp. 2049 - 2058, 2015.

Page 85: Multimodal Feature Extraction for Psychiatric Disorder

75

[74] E. Shishido, S. Ogawa, S. Miyata, M. Yamamoto, T. Inada and N. Ozaki, "Application of eye

trackers for understanding mental disorders:Cases for schizophrenia and autism spectrum

disorder," Neuropsychopharmacology Reports, vol. 39, no. 2, p. 2019, 72-77.

[75] Y. Li, Y. Xu, M. Xia, T. Zhang, J. Wang, X. Liu, Y. He and J. Wang, "Eye Movement Indices

in the Study of Depressive Disorder," Shanghai Arch Psychiatry, vol. 28, no. 6, p. 326–334,

2016.

[76] A. Peckham, S. Johnson and J. Tharp, "Eye Tracking of Attention to Emotion in Bipolar I

Disorder: Links to Emotion Regulation and Anxiety Comorbidity," Int J Cogn Ther., vol. 9,

no. 4, pp. 295-312, 2016.

[77] C. M. University, "The CMU Multi-PIE Face Database," [Online]. Available:

http://www.cs.cmu.edu/afs/cs/project/PIE/MultiPie/Multi-Pie/Home.html. [Accessed July

2020].

[78] H. Song, J. Kang and S. Lee, "ConcatNet: A Deep Architecture of Concatenation-Assisted

Network for Dense Facial Landmark Alignment," in 25th IEEE International Conference on

Image Processing, Athens, 2018.

[79] H. Ouanan, M. Ouanan and B. Aksasse, "Facial landmark localization: Past, present and

future," in 4th IEEE International Colloquium on Information Science and Technology,

Tangier, 2016.

[80] Y. Liu, A. Jourabloo, W. Ren and X. Liu, "Dense Face Alignment," in IEEE International

Conference on Computer Vision Workshops, Venice, 2017.

[81] C. Zhuo, G. Li, X. Lin, D. Jiang, Y. Xu, H. Tian, W. Wang and X. Song, "The rise and fall of

MRI studies in major depressive disorder," Translational Psychiatry, vol. 9, no. 335, 2019.

[82] R. Bansal, L. Staib, A. Laine, X. Hao, D. Xu, J. Liu, M. Weissman and B. Peterson,

"Anatomical Brain Images Alone Can Accurately Diagnose Chronic Neuropsychiatric

Illnesses," PLoS One, vol. 7, no. 12, 2010.

[83] A. Sankar, T. Zhang, B. Gaonkar, J. Doshi, G. Erus, S. Costafreda, L. Marangell, C. Davatzikos

and C. Fu, "Diagnostic potential of structural neuroimaging for depression from a multi-ethnic

community sample," BJPsych Open, vol. 2, no. 4, pp. 247-254, 2018.

Page 86: Multimodal Feature Extraction for Psychiatric Disorder

76

[84] B. Hage, B. Britton, D. Daniels, K. Heilman, S. Porges and A. Halaris, " Low cardiac vagal

tone index by heart rate variability differentiates bipolar from major depression," The World

Journal of Biological Psychiatry, vol. 20, no. 5, pp. 359-367, 2019.

[85] J. Moriarty, "Recognising and evaluating disordered mental states: a guide for neurologists,"

Journal of Neurology, Neurosurgery, and Psychiatry, vol. 76, 2005.

[86] D. A. Sauter, F. Eisner, A. J. Calder and S. K. Scott, "Perceptual cues in non-verbal vocal

expressions of emotion," Q J Exp Psychol (Hove)., vol. 63, no. 11, pp. 2251-2272, 2020.

[87] E. Kraepelin, "Manic depressive insanity and paranoia," J Nerv Ment Dis., vol. 53, no. 4, p.

350, 1921.

[88] D. Low, K. Bentley and S. Ghosh, "Automated assessment of psychiatric disorders using

speech: A systematic review," Laryngoscope Investig Otolaryngol., vol. 5, no. 1, p. 96–116,

2020.

[89] G. Cho, J. Yim, Y. Choi, J. Ko and S. Lee, "Review of Machine Learning Algorithms for

Diagnosing Mental Illness," Psychiatry Investig., vol. 16, no. 4, pp. 262-269, 2019.

[90] M. Hearst, S. Dumais, E. Osuna, J. Platt and B. Scholkopf, "Support vector machines," IEEE

Intell Syst Appl., vol. 13, p. 18–28, 1998.

[91] J. Friedman, "Greedy function approximation: A gradient boosting machine," Annals of

Statistics, vol. 29, no. 5, pp. 1189-1232, 1999.

[92] Y. Freund and R. Schapire, "A Decision-Theoretic Generalization of On-Line Learningand an

Application to Boosting," Journal of Computer and System Sciences, vol. 55, pp. 119-139,

1997.

[93] T. Ho, "Random decision forests," in Proceedings of 3rd International Conference on

Document Analysis and Recognition, Montreal, 1995.

[94] T. Cover and P. Hart, "Nearest neighbor pattern classification," IEEE Transactions on

Information Theory, vol. 13, no. 1, pp. 21-27, 1967.

[95] R. Tibshirani, "Regression Shrinkage and Selection Via the Lasso," JOURNAL OF THE

ROYAL STATISTICAL SOCIETY, SERIES B, vol. 58, no. 1, pp. 267-288, 1996.

Page 87: Multimodal Feature Extraction for Psychiatric Disorder

77

[96] P. Kulkarni and M. Patil, "Clinical Depression Detection in Adolescent by Face," in

International Conference on Smart City and Emerging Technology, Mumbai, 2018.

[97] A. Pampouchidou, K. Marias, M. Tsiknakis, P. Simos, F. Yang, G. Lemaitre and F.

Meriaudeau, "Video-based depression detection using local Curvelet binary patterns in

pairwise orthogonal planes," in 38th Annual International Conference of the IEEE Engineering

in Medicine and Biology Society, Orlando, 2016.

[98] A. Pampouchidou, O. Simantiraki, C. Vazakopoulou, C. Chatzaki, M. Pediaditis, A. Maridaki,

K. Marias, P. Simos, F. Yang, F. Meriaudeau and M. Tsiknakis, "Facial geometry and speech

analysis for depression detection," in 39th Annual International Conference of the IEEE

Engineering in Medicine and Biology Society, Seogwipo, 2017.

[99] B. N and H. Rajaguru, "Classification of Dementia Using Harmony Search Optimization

Technique," in IEEE Region 10 Humanitarian Technology Conference, Malambe, 2018.

[100] F. Zhu, X. Li, D. Mcgonigle, H. Tang, Z. He, C. Zhang, G. Hung, P. Chiu and W. Zhou,

"Analyze Informant-Based Questionnaire for The Early Diagnosis of Senile Dementia Using

Deep Learning," IEEE Journal of Translational Engineering in Health and Medicine, vol. 9,

2019.

[101] X. Xiong and F. De la Torre, "Supervised descent method and its applications to Face

Alignment," in IEEE Conference on Computer Vision and Pattern Recognition, Portland, 2013.

[102] S. Ren, X. Cao, Y. Wei and J. Sun, "Face alignment at 3000 fps via regressing local binary

features," in IEEE Conference on Computer Vision and Pattern Recognition, Colombus, 2014.

[103] H. Lai, S. Xiao, Y. Pan, Z. Cui, J. Feng, C. Xu, J. Yin and S. Yan, "Deep Recurrent Regression

for Facial Landmark Detection," PREPRINT, 2016.

[104] S. Zhu, C. Li, C. Loy and X. Tang, "Face alignment by coarse-to-fine shape searching," IEEE

International Conference on Computer Vision and Pattern Recognition, Boston, 2015.

[105] X. Xiong and F. De la Torre, "Global supervised descent method," in IEEE Conference on

Computer Vision and Pattern Recognition, Boston, 2015.

[106] S. Zhu, C. Li, C. Loy and X. Tang, "Unconstrained Face Alignment via Cascaded

Compositional Learning," in IEEE Conference on Computer Vision and Pattern Recognition,

Las Vegas, 2016.

Page 88: Multimodal Feature Extraction for Psychiatric Disorder

78

[107] 一. 川本, "プティカルフロー駆動型運動モデルによる対象領域追跡," in 画像の認識・

理解シンポジウム, Hiroshima, 2007.

[108] H. Drucker, C. Burges, L. Kaufman, A. Smola and V. Vapnik, "Support vector regression

machines," in 9th International Conference on Neural Information Processing Systems,

Denver, 1996.

[109] M. Kostinger, P. Wohlhart, P. Roth and H. Bischof, "Annotated Facial Landmarks in the Wild:

A large-scale, real-world database for facial landmark localization," in IEEE International

Conference on Computer Vision Workshops, Barcelona, 2011.

[110] G. Chrsyos, E. Antonakos, S. Zafeiriou and P. Snape, "Offline Deformable Face Tracking in

Arbitrary Videos," in IEEE International Conference on Computer Vision Workshop, Santiago,

2015.

[111] J. Shen, S. Zafeiriou, G. Chrysos, J. Kossaifi, G. Tzimiropoulos and M. Pantic, "The first facial

landmark tracking in-the-wild challenge," in IEEE International Conference on Computer

Vision Workshop, Santiago, 2015.

[112] G. Tzimiropoulos, "Project-Out Cascaded Regression with an application to face alignment,"

in IEEE Conference on Computer Vision and Pattern Recognition, Boston\, 2015.

[113] K. Mueller, B. Hermann, J. Mecollari and L. Turkstra, "Connected speech and language in mild

cognitive impairment and Alzheimer’s disease: A review of picture description tasks," J. Clin.

Exp. Neuropsychol, vol. 40, pp. 917-939, 2018.

[114] J. Mundt, A. Vogel, D. Feltner and W. Lenderking, "Vocal Acoustic Biomarkers of Depression

Severity and Treatment Response," Biol. Psychiatry, vol. 72, p. 580–587, 2012.

[115] J. Darby and H. Hollien, "Vocal and Speech Patterns of Depressive Patients," Folia Phoniatr.

Logop., vol. 29, pp. 279-291, 1977.

[116] S. Gonzalez and M. Brookes, "PEFAC - A pitch estimation algorithm robust to high levels of

noise," IEEE Trans. Audio, Speech Lang. Process., vol. 22, pp. 518-530, 2014.

[117] H. Kim, N. Moreau and T. Sikora, MPEG-7 Audio and Beyond: Audio Content Indexing and

Retrieval, Chichester: John Wiley & Sons, Ltd, 2005.

Page 89: Multimodal Feature Extraction for Psychiatric Disorder

79

[118] M. Sahidullah and G. Saha, "Design, analysis and experimental evaluation of block based

transformation in MFCC computation for speaker recognition," Speech Communication, vol.

54, pp. 543-565, 2012.

[119] X. Valero and F. Alias, "Gammatone Cepstral Coefficients: Biologically Inspired Features for

Non-Speech Audio Classification," IEEE Trans. Multimedia, vol. 14, p. 1684–1689, 2012.

[120] G. Peeters, "A large set of audio features for sound description (similarity and classification)

in the CUIDADO project," 2004. [Online]. Available:

http://recherche.ircam.fr/anasyn/peeters/ARTICLES/Peeters_2003_cuidadoaudiofeatures.pdf.

[Accessed July 2020].

[121] N. Sugan, N. Sai Srinivas, N. Kar, L. Kumar, M. Nath and A. Kanhe, "Performance

Comparison of Different Cepstral Features for Speech Emotion Recognition," in 2018

International CET Conference on Control, Communication, and Computing,

Thiruvananthapuram, 2018.

[122] A. Adiga, M. Magimai and C. Seelamantula, "Gammatone wavelet Cepstral Coefficients for

robust speech recognition.," in 2013 IEEE International Conference of IEEE Region 10, Xi'an,

2013.

[123] M. Stanners, C. Barton, S. Shakib and H. Winefield, "Depression diagnosis and treatment

amongst multimorbid patients: a thematic analysis," BMC Fam Pract, vol. 15, p. 124, 2014.

[124] M. Gavrilescu and N. Vizireanu, "Predicting Depression, Anxiety, and Stress Levels from

Videos Using the Facial Action Coding System," Sensors, vol. 19, 2019.

[125] L. Wu, J. Pu, J. Allen and P. Pauli, "Recognition of Facial Expressions in Individuals with

Elevated Levels of Depressive Symptoms: An Eye-Movement Study," Depression Research

and Treatment, 2012.

[126] D. Gerhard, E. Wohleb and R. Duman, "Emerging treatment mechanisms for depression: focus

on glutamate and synaptic plasticity," Drug Discovery Today, 2016.

[127] G. Henderson, E. Ifeachor, N. Hudson, C. Goh, N. Outram, S. Wimalaratna, C. Del Percio and

F. Vecchio, "Development and assessment of methods for detecting dementia using the human

electroencephalogram," IEEE Trans. Biomed. Eng., vol. 53, 2006.

Page 90: Multimodal Feature Extraction for Psychiatric Disorder

80

[128] H. Song, W. Du, X. Yu, W. Dong, W. Quan, W. Dang, H. Zhang, J. Tian and T. Zhou,

"Automatic depression discrimination on FNIRS by using general linearmodel and SVM," in

7th International Conference on Biomedical Engineering and Informatics, Dailian, 2014.