evaluation using regression analysis

8/10/2019 Evaluation using Regression Analysis

1/101

Institute for Data ProcessingTechnische Universitat Munchen

Masters thesis

Evaluation of Acceleration Data for FallsRisk Prediction using Classication and

Regression Analysis

Christoph Bachhuber

October 6, 2014


2/101

Christoph Bachhuber. Evaluation of Acceleration Data for Falls Risk Prediction using Classication and Regression Analysis. Masters thesis, Technische Universit at Munchen,Munich, Germany, 2014.

Supervised by Prof. Dr.-Ing. K. Diepold and Cristina Soaz, M.Sc.; submitted on October6, 2014 to the Department of Electrical Engineering and Information Technology of theTechnische Universitat Munchen.

2014 Christoph Bachhuber

Institute for Data Processing, Technische Universit at Munchen, 80290 Munchen, Ger-many, http://www.ldv.ei.tum.de .

This work is licenced under the Creative Commons Attribution 3.0 Germany License.To view a copy of this licence, visit http://creativecommons.org/licenses/by/3.0/de/or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco,California 94105, USA.
http://www.ldv.ei.tum.de/http://www.ldv.ei.tum.de/


3/101

AbstractFalls are a major problem not only for older adults, but also for health care systems,which have to spend enormous amounts of money in hospitalizations and long-termrehabilitation as result of injurious falls. We aim to assess the risk of falling as easy,accurate and inexpensive as possible to be able to identify those persons at high risk of functional decline and trigger appropriate clinical interventions. A new method are func-tional assessments with inertial sensors like the actibelt . We analyzed the predictivevalue of the data recorded with this device in 171 community-dwelling female seniorswith a mean age of 68 years included in the EU-Project VPHOP.

Results showed that acceleration-based parameters provide an improvement in riskprediction over classical features. We could prove 18 variables retrieved by the inertialsensor to be valid or plausible. To identify variables with predictive power, we appliedthe following feature selection and dimension reduction techniques: sparse Partial LeastSquares (sPLS), Elastic Nets (EN) and Principal Components Analysis (PCA). We foundthat PCA is useless for this kind of data, whereas EN are most applicable according toour results. From a range of different classication methods, Logistic Regression (LR)and Support Vector Machines (SVM) showed the best performance. LR presented ahigher specicity in the classication of fallers/non-fallers with a True Positive Rate(TPR), or rate of correctly classied fallers, equal to 0.30 and a True Negative Rate(TNR), or rate of correctly classied non-fallers, equal to 0.92. The results of SVM weremore sensitive with TPR=0.60 and TNR=0.85. The latter results are more suitablebecause the consequences of incorrectly classifying a non-faller are less harmful thanvice versa. The commonly used accuracy has been shown to be insufficient to describethe performance of classiers on datasets with strongly differing class sizes. Instead, wesuggest the use of Youdens J or the F-score.

3


4/101


5/101

Contents

1. Introduction 71.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3. Classication and Regression Models . . . . . . . . . . . . . . . . . . . 81.4. Related Work in Falls Risk Prediction . . . . . . . . . . . . . . . . . . 101.5. Notation and Used Software . . . . . . . . . . . . . . . . . . . . . . . 131.6. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.7. Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2. Datasets 172.1. Falls Risk Factors and Falls Risk Prediction . . . . . . . . . . . . . . . 182.2. Set A: Clinical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3. Set B : actibelt Data . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3. Preprocessing 373.1. Filling in Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2. Standardization Versus Normalization . . . . . . . . . . . . . . . . . . 373.3. Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.4. Handling the Class Imbalance and Overlapping . . . . . . . . . . . . . . 394. Reducing the number of features 41

4.1. Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . 414.2. Partial Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.3. Sparse Partial Least Squares . . . . . . . . . . . . . . . . . . . . . . . 474.4. Elastic Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5. Classication 495.1. Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.2. k Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.2.1. Classication Model . . . . . . . . . . . . . . . . . . . . . . . . 535.2.2. Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 545.2.3. Choice of the Distance Metric . . . . . . . . . . . . . . . . . . 555.2.4. The Parameter k . . . . . . . . . . . . . . . . . . . . . . . . . 565.2.5. Comparison and Remarks . . . . . . . . . . . . . . . . . . . . . 56

5


6/101

Contents

5.3. Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . 575.3.1. Classication Model . . . . . . . . . . . . . . . . . . . . . . . . 585.3.2. Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 585.3.3. Parameter choice . . . . . . . . . . . . . . . . . . . . . . . . . 625.3.4. Comparison and Remarks . . . . . . . . . . . . . . . . . . . . . 62

5.4. k-fold Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . 635.5. Measures for Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.5.1. Classication Performance . . . . . . . . . . . . . . . . . . . . 645.5.2. Regression Performance . . . . . . . . . . . . . . . . . . . . . . 67

5.6. Challenges and difficulties with Classication . . . . . . . . . . . . . . . 695.6.1. Under- and Overtting . . . . . . . . . . . . . . . . . . . . . . 695.6.2. Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . . 71

6. Results 736.1. kNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736.2. Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766.3. Support vector machines . . . . . . . . . . . . . . . . . . . . . . . . . 776.4. Comparison of Feature Selectors via Youdens J Statistic . . . . . . . . 796.5. Conventional Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 806.6. Feature Analysis and Comparison to Other Authors . . . . . . . . . . . 816.7. Alternative Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6.7.1. Multiple Fallers . . . . . . . . . . . . . . . . . . . . . . . . . . 846.7.2. Balanced Classes . . . . . . . . . . . . . . . . . . . . . . . . . 84

7. Conclusions 85

Appendix A. Denitions 87A.1. Ellipse Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87A.2. Scales and Correlation Coefficients . . . . . . . . . . . . . . . . . . . . 87A.3. P-Value Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89A.4. Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

Appendix B. Dataset Variables 91B.1. Dataset A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91B.2. Dataset B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92B.3. Variables Chosen by Elastic Nets . . . . . . . . . . . . . . . . . . . . . 93

6


7/101

1. Introduction

1.1. Motivation

Falls are common in the elderly, 32% of people of age 75 years or older fall every year [72].Consequences of falls can be divided into three main groups. The rst ones are physicalconsequences for the person: 24% of falls involve serious injuries, 6% cause fractures[72]. Some falls are even lethal [68], be it directly through injuries or indirectly through aradical change to a passive bedridden lifestyle. Injuries due to falls are the leading causeof death in that stage of life [56]. The second group comprises the psychological impactof falls. Due to fear of falling and possibly decreased physical abilities following an injury,elder people narrow their activities and can lose a certain degree of independence. Thisway, a fall can push them into a cycle of deteriorating mobility, activity and quality of life. The third group are economic consequences for the entire society. Only in the U.S.,20 Billion $ had to be expended [68] for treating fallers in the year 2000.

Overall, falls are socially and economically too important to be ignored. The commonapproach is trying to identify people at high risk of falling and to consequently initiatecountermeasures like physical training [58] to prevent people from falling. Other meansare reducing risk factors in their daily life, like inappropriate medication, unsafe footwear[54] or obstacles in their homes [32]. To identify fallers, many different methods have

been proposed. Currently used traditional protocols like the Tinetti Balance AssessmentTool [71] employ an expert to judge the performance of the patient. This is subjectiveand suboptimal in terms of cost and time needed for the assessment. In addition, theseprotocols have been shown to offer very limited distinction of fallers from non-fallers[60]. A new, promising approach is the use of inertial sensors during those assessmentsfor analyzing the mobility of subjects.

1.2. Goals

We aim to predict the risk of falling based on the subjects functional status assessedthrough the performance of several standardized clinical functional tests recorded with a

waist-worn accelerometer. We do not have extrinsic variables like the quality of lightingor loose cables in a persons household to describe the probability that a dangeroussituation occurs. We also investigated the risk of falling in multiple fallers. Multiplefallers are persons, who fell at least twice in the measured time period. The rationale

7


8/101

1. Introduction

Figure 1.1.: Timeline of Data collection

behind this approach is that frequent falls are more likely to reect impairments ordiseases [47], which could be easier detected by inertial sensors.

The available data for our prediction model was retrieved from the EU-Project VPHOP.The process of data collection is depicted in gure 1.1. It contains, as shown, four maingroups of data. The rst are retrospective falls, for which the patients were asked if they fell at least once in the year before the main assessment. During this assessment, a

clinical questionnaire was answered and functional tests were conducted while wearing aninertial sensor. The clinical questionnaire contains the second group, so-called intrinsicvariables about age, diseases and others. The third group are the protocol tests, whichgive functional data. The fourth group, prospective fall data, is retrieved in the postassessment, which took place one year after the main assessment. Test subjects wereasked, how often they fell during the last year. These prospective falls are our classlabel, so we try to predict if a subject is a faller or non-faller in the year after the mainassessment.

1.3. Classication and Regression Models

We employ classication and regression methods for that risk prediction. Both kinds of algorithms have the use of one or more predictor variables, typically concatenated into avector x, in common. They utilize these variables to predict the value of a dependent orresponse variable y=f(x) using the previously computed model f(). The main differencesare the scale and dimension of y: regression algorithms can handle multidimensional,

8


9/101

1.3. Classication and Regression Models

ratio-scaled dependent variables. Scales are extensively explained in appendix A.2. Fornow, we can think of ratio scaled variables as the real numbers. An example is predictinga persons ear circumference from his or her age.

In contrast to that, classication algorithms are thought to handle an one-dimensional,nominal scaled dependent variable y, which is in this case also called class label. Thus,the response variable is not any real number, but taken from a set of values, which canbe compared for (in-)equality, but not for difference or ratio. One case of classication isclassifying cancerous cells by their DNA as one of the two classes malignantor benign,called binary classication. This widespread kind of classication is the task we performin our investigations.

Regression algorithms can also be applied to classication problems. This is done byassigning numbers to the class labels, for example 0 to benign cells and 1 to malignantones. Since the output of f( x) can be any real number that does not have to coincidewith a class label, the class label number closest to the real number resulting fromregression is chosen. Here, this can be done by a simple threshold t = 0.5 . All resultsabove or equal to the threshold are classied as malignant, all below as benign.

Having introduced variables, which can also be called features or factors, we next needto present samples. Samples are realizations of the variables and come in pairs (x i,y i),i = 1,2,..., n . In other words, one of the n samples is a binding of the realizationsof the dependent variable y and the predictor variables x. So, a generic dataset withn samples and p predictor variables is a matrix of size n x (p + 1), if the dependentvariable y is also stored in that matrix. Classier or regression models are trained onthese samples, for which the dependent variable y is known. This is called supervisedlearning. Subsequently, the models are used to predict the dependent variables or labelsof samples (x,?) with an unknown dependent variable. In contrast to supervised learning,

unsupervised learning has only samples (x,?) with unknown label y available and hasto create own labels based on the distribution of the samples in space. Typically, thisis only used for classication tasks. One example for unsupervised learning is k-means[49].

The last thing to be dened here is class imbalance. This describes the scenario, whenthe a-priori probabilities of the classes are not equal, in real-world datasets representedby different amounts of class samples in each classes. An extreme example is frauddetection (fraud/non-fraud), in which the class of frauds is diminutive compared to theclass of non-frauds. Problems like these belong to the most challenging in data mining[80]. The datasets we deal with also show notable imbalance, which is why we carefullyanalyze problems that stem from imbalance and present possible solutions in section 3.4.

Metrics for this kind of data are displayed in sections 2.2 and 5.5.1.

9


10/101

1. Introduction

1.4. Related Work in Falls Risk Prediction

In most research disciplines, there are papers which review the status quo of research.This is also the case for falls risk assessment: Howcroft et al. [36] reviewed 40 papersabout falls risk assessment using inertial sensors. They compared variables, models andresults. Oliver et al. [53, 54] criticized falls risk prediction tools for hospital inpatients,because their predictive value was found to be almost negligible. Perell et al. [56] gavean overview over 21 articles, describing methods, variables and results. In comparisonto Howcroft, the article focuses more on the use of different class labels, in this contextcalled scales, for falls risk assessment. All three papers focus less on the algorithmic leveland are instead result-oriented. In section 6.6, we compare our results to other papers inthe eld. To be able to do this, we analyze the previous publications here. We detectedalgorithmic and methodological aws in some assessment models presented in table 1.1that we divided into two main categories: limitations in the data collection methodologyand limitations in statistical methods. We have marked obviously faulty methods in redcolor in that table.

The rst thing in data collection is the used class label, for which we use prospectivefalls. Others like Giansanti and Gietzelt use the Tinetti or STRATIFY model, respectively.Thus, both of them do not predict actual falls, but the respective scores. They achievegood prediction performance because each test is similar to the traditional model. Theirefforts are similar to validity tests, comparable to what we do in section 2.3. Follow-up papers should investigate the predictive power for actual falls, because the Tinettiscore for example is regarded as golden standard in the eld, but it is indeed far fromperfect fall prediction [60]. In their prospective study, Rache et al. [60] performed fallsrisk prediction using Tinettis test scores on 225 community-dwelling individuals andachieved TPR=0.70 and TNR=0.52. Another possible class label, used by Caby, Greeneand Kojima are retrospective falls. This is close to correct, but not perfect. Previousfalls inuence the performance of subjects during test protocols [69] either because thepatient is inuenced by fear of falling or because a physical consequence indicates theprevious fall and will ease the classication of such. Such physical consequences can besevere injuries, which cause a period of reduced activity leading to decreased performancein test protocols. Fear of falling also leads to lower performance in functional assessments[10]. In summary, it is presumably easier to classify retrospective falls than prospectiveones.

Another problem in data collection are the class sizes. We know that around 30%of persons of 65 years or older fall every year [21, 39, 55], creating an imbalanceddataset. Caby, K onig, Marschollek and Weiss use far more balanced datasets. Weisseven uses more fallers than non-fallers, which is in particular interesting, since they useonly multiple fallers, who account for 9.3% of all subjects in the study of Lord et al. [48].In Weiss paper they even use more multiple fallers than others. The authors includeda predened amount of healthy non-fallers as control subjectsafter having collected asample of fallers. These procedures bypass all problems stemming from imbalanced sets,

10


11/101

1.4. Related Work in Falls Risk Prediction

Author/Method Validation TPR TNR Acc Remarks

Caby et al.[9] 20-fold CV 1 1 1 only retrospec-tive falls, 20subjects, inpa-tients, editedclass labels

Doi et al. [19] not speci-ed

0.69 0.84 0.81

Giansanti et al. [27] xed split 0.98 0.97 0.97 Class label:

Tinetti scoreGreene et al. [29] 10-fold CV 0.77 0.76 0.77 only retrospec-

tive 5-year fallhistory

Gietzelt et al. [28] 10-fold CV 0.89 0.91 0. 90 Class label:STRATIFYscore

Konig et al. [43] not speci-ed

0.74 0.76 0.75

Kojima et al. [42] not speci-ed

0.61 0.68 0.62 only retrospec-tive falls

Marschollek et al. [50] 10-fold CV 0.58 0.96 0.80 hospital inpa-tients

Weiss et al. [78] not speci-ed

0.91 0.83 0.88 41 subjects,fallers aremajority, mul-tiple fallers,no p-valuecorrection

Table 1.1.: Best classication results of authors performing FRP with inertial sensors

11


12/101

1. Introduction

which enables better prediction performance. The drawback of such methods is thatthe retrieved models do not reect reality, which provides more non-fallers than fallers.Greene et al. also have more fallers than non-fallers, but here the reason is that theylook at a 5-year falling history. This is a feasible procedure, but the authors trade off prediction accuracy for temporal precision.

The nal issue in data collection is manual intervention. One kind of manual inter-vention is the extensive exclusion of patients. Some authors like Weiss apply multipleexclusion criteria like diseases and the type and the number of falls. Like the previouslypresented balanced class sizes, this is thought to improve prediction, but narrows thegenerality of the models. The other kind of intervention is for example present in Cabysarticle, in which the authors used the one-year history of falls, but they manually edited10% of the class labels to t an experts opinion. The effects of this can be severe, sinceit does not only improve the results of classication, but also how the model is built.Such a procedure can therefore extremely improve the results, in particular for so fewsubjects. Caby et al. recruited 20 subjects for their assessments.

The second family of dangers is of mathematical nature. As can be seen in table1.1, four authors do not state anything about their validation method, which lets usassume that they used auto-validation. Auto-validation means using the same samplesfor building a classication model and for testing it. We assume the use of that methodbecause it is the easiest way to use available data without spending thoughts aboutthe validation method. We checked if the software used by these four performs crossvalidation by default. All these authors used MATLAB in the 2008 to 2011 releases,which does not include default cross-validation for the used classiers. Thus, all signsindicate the use of auto-validation. This method does not show overtting and thereforeit can extremely overestimate the predictive power of a model. Overtting is not that

severe for linear classiers like logistic regression, used by Doi and Konig, since both havea sufficient amount of samples compared to the their number of variables. Nevertheless,auto-validation still overestimates the performance of the classier.

The second main mathematical problem is missing p-value correction. When exploringthe usefulness of a set of variables, the family-wise error rate has to be taken into accountand countered by a suitable control method like Bonferroni correction. Neither Greenenor Weiss apply this, which is why Greene et al. nd 29 signicant variables.

A last remark is that Howcroft et al. compare the performances of different authorsin their review. They make the careful assumption that intelligent computing methodslike neural networks or Bayesian classiers may be better for falls risk prediction thanothers [36]. We say that with so many methodological differences between the papers,

such a comparison is not permissible. There are just too many things that inuence theresult, rendering the inuence of the classication method insignicant.

12


13/101

1.5. Notation and Used Software

1.5. Notation and Used Software

In this paper, scalars are lower case roman (y), vectors are lower case bold roman (y ) andmatrices are upper case bold roman ( Y ). All vectors are column vectors unless otherwisespecied. Random variables are depicted in sans serif fonts (y,y and Y). The used normis, if not specied otherwise, the euclidean norm. Reports of statistical signicance aregiven in the following form: r(167)=0.5 with p < 0.01 . In this example, the correlationvalue (r) with 167 degrees of freedom is equal to 0.5. The probability (p) of gettingthe sample values, which led to the current correlation value, is smaller than 0.01 whenassuming uncorrelated variables. Thus, the Null-Hypothesis the correlation value isequal to zerocan be discarded. For all computations, R 3.0.2 [ 59] was used.

1.6. Contributions

We identied three main problems on the datasets which hinder straightforward usageof classication models. The rst two are imbalanced class sizes and largely overlap-ping classes. To quantify the degree of overlapping, we improved the noise measureintroduced by Murphy et al. [52]. The third group of problems is caused by the curseof dimensionality, which affects our high dimensional data. We developed an algorithmwhich is able to overcome these pitfalls in the present datasets and produce useful fallsrisk prediction. Class weights for k-nearest-neighbors and support vector machines andbalancing methods for logistic regression reduce the inuence of imbalance, feature se-lectors like elastic nets allow us to deal with the curse of dimensionality.

Regarding the classication results, we found that accuracy is widely used in the fallsrisk assessment eld although it does not represent the actual prediction performance

when using datasets with imbalanced class sizes. Thus, accuracy should be replaced withYoudens J or the F 1 -score. The best method uses support vector machines, performingclassication on features selected by elastic nets. This result is achieved with variablesbased on data from inertial sensors. These features generally enable a better classicationthan conventional variables, as also tested in this thesis. We could furthermore conrmthat the underlying problem is of generalized linear kind.

Finally, we conducted statistical analysis of the new variables. We proved effectivelytwelve actibelt variables valid or plausible. Six of them can be called new, since they arenot directly related to any conventional test protocols.

1.7. Outline

The structure of this thesis is oriented by our algorithm for testing different featureselection, balancing and classication methods on different datasets. The algorithm isschematically depicted in gure 1.2 and shall be used to give an outline of this work.At the top of the gure, one of our datasets, described in chapter 2, is put into the

13


14/101

1. Introduction

algorithm. In that chapter, we also present the test protocol for retrieving the data,analyze class imbalance, class overlapping and the validity of new variables. The ratio of the class sizes is 1:4.7 in favor of the non-fallers, the degree of overlapping, which canalso be called noise, is found to be very high on the set recorded by the inertial sensor.18 of 54 variables are found to be valid or plausible. To analyze overlapping, we proposean improved measure in section 2.2, the validity tests are done with standard measuresdescribed in appendix A.2.

But back to the algorithm: in the preprocessing part of the algorithm, describedin chapter 3, missing entries of the dataset are lled in. Next, all variables except thedependent are standardized. The last step of preprocessing is feature selection. This wasnecessary due to the multitude of variables and shortage of subjects to avoid overttingand phenomena linked to the curse of dimensionality. Three methods for feature selectionare applied, presented in chapter 4: Principal Components Analysis, sparse Partial LeastSquares and Elastic Nets, from all of which Elastic Nets prove to be most appropriatefor our task and Principal Components Analysis the least. The model building phase,described in chapter 5, starts after that. It contains k-fold cross validation, the balancingmethod SMOTE and the learning of classication models. The user can choose from aset of classiers which are introduced in that chapter. Out of these classiers, SupportVector Machines with a linear kernel mostly surpass the performance of other algorithmslike k-Nearest Neighbors, Logistic Regression and others. The last part of this chapteris the description of some metrics and challenges in classication.

After the model building and testing, the algorithm computes overall metrics likeaccuracy, true positive rate et cetera to describe the result of the prediction. Theresults are extensively presented and interpreted in chapter 6, in which we show that theacceleration data is helpful for falls risk prediction and constitutes an improvement over

a conventional approach also investigated. We wanted to compare our results to similarapproaches, but found suboptimal methods in many papers. A thorough discussion isgiven in section 1.4, the verdict of the comparison in the results section 6.6. In summary,our method has a prediction performance that can at least compete with comparablepapers in the eld, while overcoming some drawbacks.

14


15/101

1.7. Outline

Figure 1.2.: Procedure

15


16/101


17/101

2. DatasetsThis chapter gives an overview of the current state of falls risk prediction and the datawe use for that task. The full dataset retrieved from the VPHOP project presentedin 1.1 was divided into two variable-wise partially overlapping sets, depicted in gure2.1. Dataset A contains classical clinical features (intrinsic factors) like age, Body MassIndex (BMI), illnesses and others plus simple functional parameters like how long theperson needed to walk 10 meters. All variables of this set are presented in table B.1.Parameters like these are already widely used for falls risk prediction in different methodslike Tinettis model [71]. The other set, dataset

B , contains the mentioned conventional

clinical features augmented with the gait data collected from an acceleration sensorcalled actibelt . The complete presentation of the variables of dataset B is given intable B.2. In gure 2.1, the subjects are stored in the rows of the matrices, the samplesin the columns. The gure depicts that only for 171 of all participants, acceleration datais given, while for all participants, simple functional and clinical data is available. Notethat this gure is not in scale.

Both datasets are presented thoroughly in the following sections, after the presentationof the basics of state-of-the-art falls risk prediction. One of the main goals of this workwas to analyze if the new approach using the sensory data can deliver overall betterfalls risk assessment in comparison to simple gait tests. We considered excluding testsubjects based on criteria like their alcohol consumption, certain diseases and more, asfor example Konig et al. did [43]. But we discarded that idea since we do not onlyconduct Performance Oriented Mobility Assessment(POMA), but want to do falls riskprediction. Exclusion narrows the generality of our results, so we want to avoid it wherepossible.

On both datasets, the rate of fallers is at a rst look surprisingly low. Typically, aroundone third of community-dwelling people of age 65 years or older will fall at least onceevery year [21, 39, 55]. On the given clinical dataset A, only 17.44% (45 out of 277)fell during the one-year test period, the reason for this is probably the relatively youngtesting group, which had a mean age of 68.78 years at the beginning of the assessments,with 61 persons being younger than 65 years. This also overrides the fact that bothdatasets contain only women, who have a higher risk of falling [17]. The patients of the combined dataset B make up a part of all patients from dataset A, as in B are onlywomen, who have not only answered the questions about clinical data, but have alsoundergone the functional assessment tests. Out of all patients in this dataset, 17.54%(30 out of 171) are fallers, at a mean age of 68.02 years, with 42 women being youngerthan 65 years. These unbalanced classes led to difficulties with some classiers, therefore

17


18/101

2. Datasets

A

B

Figure 2.1.: The entire dataset and the subsets A and B

we tried to restore balance using the synthetic minority over-sampling technique [ 11].The results of this data-level algorithm were not as good as dealing with imbalance onan algorithmic level, like kNN and SVM offer. This is presented in section 5 and will beused for the prediction results presented in chapter 6.

Initially, the performance of classiers on both sets was evaluated by splitting thesamples of both sets into xed training and test set to conduct cross-validation. Thepartitioning was given by an external trustee, the split size deviates considerably fromthe usual 2:1 separation [79], which suggests 67% of the samples in the training set and33% in the test set. Here the provided splitting gave ratios of approximately 1:1.9 forset A and 1:1.7 for set B , in each case such that the testing set is the bigger subset.We used the given split for prediction and found that it hindered meaningful prediction.This was not only because of the small training set, but also because some tendencies intraining and test set are contradictory. For example the fallers in the training set wereyounger than the non-fallers, which was reversed in the test set. So, we discarded thegiven split and decided to use 10-fold cross validation, described in section 5.4 on bothsets.

2.1. Falls Risk Factors and Falls Risk PredictionThe research community in this eld tries to make a falls risk prediction based on fallrisk factors, also called features or variables. These factors can be many different thingsas the age of the patient [ 32], if the examined person suffers from a relevant disease

18


19/101

2.2. Set A: Clinical Data

like impaired vision [32], beyond that results from certain physiological tests or evenextrinsic factors [24] that describe the environment of the subjects daily life. From theseexamples, we see that the variables can be categorical as well as continuous. Theycan be classied into three main groups, depicted in gure 2.2. The rst two groups,intrinsic and functional variables, mainly describe the probability of recovering from astumble-like situation. Extrinsic factors should represent the probability that a stumbleor a loss balance happens. The general approach in this eld is, as depicted by thisgure: use the presence of falls risk factors to predict the risk of falling. This can bedone in different ways, for example by a weighted addition of the factors, with the resultof that sum determining the nal fall risk. The goal was to directly apply this approachwithout any classier model, using the odds ratios found in former research papersas weights for the sum. But these weights were not comparable, as they descendedfrom strongly differing cohorts. These cohorts distinguished themselves for examplein the mean age, their gender, ethnicity and their living circumstances, which mainlydepended on dwelling in a community or in a retirement home or hospital. Because of these differences, the weighted addition approach was discarded. Instead, we conductedthe following methods: we applied regression and classication methods, introduced inchapter 5, to the datasets.

2.2. Set A: Clinical DataThis dataset contains 17 variables for every one of the 277 women. A list of the variables,including their units, is given in Appendix B.1. These so-called features can be retrievedby a questionnaire and a few practical tests. They are among others, according to theAmerican Geriatrics Society [1]: history of falls in the last year, age , depression , sleep

disturbances , the use of an assisitve device like a wheeled walker, arthritis and visualdecits . Volpato et al. [ 76] suggest that the Body Mass Index (BMI) is also a relevantfactor, which is why we included BMI . Not all of the factors suggested by the AGScould be used and are therefore not written down here and depicted in gray color ingure 2.2. This is because they were just not included in the given dataset or did nothave a sufficient number of subjects who answered positively. The latter is the casefor rheumatoid arthritis [ 66], of which only two test subjects suffered and for the useof sedatives, which is according to Tinetti et al. [72] a very strong predictor for fallsrisk. Five women took sedatives, which is not enough to build a model using this factor.Furthermore, in Tinettis study, 13 of 14 persons taking sedatives fell. On our dataset,this was one out of ve women, which indicates fall patterns differing from the ones in

Tinettis work. Hypertension was also found [25] to be an important risk factor, aswell as fear of falling [63]. The last clinical risk factor used here, pain in the ankle orfoot , was among others suggested by the WHO [73].

In addition, the AGS mentions gait and balance decits as fall risk factors. These twoare in the given dataset represented by the time, a subject needs to walk 10 meters , if the

19


20/101

2. Datasets

IntrinsicFactors

Muscle Weakness Age BMI History of Falls Arthritis Depression Sleep disturbance

Visual Decits Hypertension Rheumatoid

decits

Fear of falling Foot pain

Impaired activi-ties

Cognition decits

psychotropicmedication

FunctionalFactors

Gait Decit Balance Decit

Mobility Decit Assistive Devices

Level of Activity

ExtrinsicFactors

House Exterior Floors Living Alone Lighting Furniture Stairways Kitchen Bathroom Carpets Wires on Floor

Intrinsic FallRisk Score

Functional FallRisk Score

External FallRisk Score

Total FallRisk Score

Figure 2.2.: Factors for falls risk prediction

20


21/101


subject can reach to an object over a certain distance, how difficult lifting and carryingan object is, how easy a woman considers washing herself, bending forward and howgood the self-estimated balance is. Balance can take on ten levels, 10meaning perfectbalance, 0no balance at all. The previous four variables are ternary. Their three levelsrepresent the answers without difficulty, with some difficultyand unable or able onlywith help. From these variables, 14 are discrete, with the majority, nine factors, beingbinary.

Statistics of the variables are given in table 2.1. Note that these are with exceptionof the group size only actual variables, which are later used for classication. For non-dichotomous variables, the mean and standard deviation are given in the unit previouslypresented. In the case of dichotomous variables, the number of yes-answers is depictednext to the percentage of these answers in comparison to all answers to give betterinsight. The last column comprises the statistics of entire dataset A, the rst two datacolumns split the dataset in fallers and non-fallers. Reaching, carrying, washing andbending are for this table encoded as without difficulty

= 0, with some difficulty

=

5 and unable or able only with help = 10. It can be clearly seen that most variablesfrom this dataset tend to explain the dependent variable fallsin a comprehensible way.Fallers are for example on average older and suffer more diseases. However, the differencein mean values mostly is small compared to the standard deviation, which is a difficultsetting for classication.

Lack of the level of activity

Persons, who are physically very active have an increased danger of falling, especiallyduring those activities. Compared to that, inactive, but physically t patients with amoderate level of activity have a lower risk of falling. Very inactive persons have again ahigher risk because their tness is degraded, what has negative effects on their posturalstability. Falls risk plotted against activity describes a U-shape [30] as in gure 2.3.This is just an illustration of discovered facts approximating the true graph, which is notexactly known. For example, it does not have to be symmetric. Furthermore, there areno quantitative values dened for the axes yet. Our dataset does not include the level of physical activity of any subject, which is why we omitted seven fallers, who obviously fellwhile performing an activity that can be connected to a generally high level of activity,for example skiing. We could not take their level of physical activity into account whensetting up the classication models, as these samples would act as outliers, because theyare very likely to have properties suggesting a low fall risk, but they fell nevertheless.Thus, they would distort the risk prediction. After omitting these patients, the resultsslightly improved.

We also thought of performing the complementary action: discarding subjects, whoare obviously in bad condition, but did not fall. But we do not have any justication toexclude them apart from the data which we want to examine for predictive value. Thus,this procedure would forestall our results, which is why we discarded it.

21


22/101

2. Datasets

Fallers Non-Fallers All

Size of Group (%) 45 (16.25) 232 (83.75) 277 (100)

Age ( SD) 69.22 5.51 68.70 5.1 68.78 5.18History of Falls (%) 18 (40) 84 (36.21) 102 (36.82)BMI ( SD) 24.20 4.56 25.41 3.57 25.22 3.77

Depression (%) 9 (20.0) 32 (13.79) 41 (14.80)

Sleep Disturbance (%) 11 (24.44) 71 (30.60) 82 (29.60)

Arthritis (%) 1 (2.22) 8 (3.45) 9 (3.25)

Visual Decits (%) 2 (4.44) 6 (2.59) 8 (2.89)

Hypertension (%) 19 (42.22) 39 (16.81) 58 (21.94)

Time 10 Meters ( SD) 9.93 2.00 9.83 2.44 9.84 2.37Balance ( SD) 6.73 2.33 7.12 2.25 7.06 2.27

Fear of Falling (%) 24 (53.33) 90 (38.79) 114 (41.12)

Foot Pain (%) 14 (31.11) 53 (22.84) 67 (24.19)

Assistive Devices (%) 2 (4.33) 10 (4.31) 12 (4.33)

Reaching ( SD) 1.11 2.36 0.71 1.81 0.78 1.91Carrying ( SD) 3.33 3.54 2.61 3.45 2.73 3.47Washing ( SD) 0.33 1.26 0.19 0.97 0.22 1.02Bending ( SD) 1.11 2.10 0.78 1.93 0.83 1.96

Table 2.1.: Statistics of Dataset A

Physicalactivity

Fall risk

low average high

low

high

Figure 2.3.: U-shaped graph for activity versus fall risk

22


23/101


All other features are expected to have a linear or generalized linear dependency,what should be easy to comprehend. For instance, fall risk increases with ascendingage, the number of related diseases or a higher amount of previous falls. Generalizedlinear dependencies express a generalization of strict linear relations. In strict lineardependencies, the dependent variable is a linear function of a for example normallydistributed predicting variables, therefore it has to be normally distributed. This is nota condition for generalized linear models, in which the distribution of the dependentvariable may be any from the exponential family, for example the Bernoulli distribution.

Test of the FRAX-tool

We tested the FRAX-tool [22], which predicts hip fractures and general fractures, butnot falls themselves. The tool uses twelve variables, from which a few like the age arethe same features we will later use for the falls risk prediction. An expectation wasthat the prediction would give valuable information, because fractures are very oftencaused by falls, as 90% of hip fractures are a consequence of falls for elder women[31]. But this relationship is not symmetric, only around 5% of falls cause a fracture[39]. The percentage of falls resulting in a hip fracture ranges between 1% and 14%[31]. Therefore, it should be feasible to use fracture risk as very specic, but not verysensitive fall risk in this age group. The results had a very bad sensitivity and specicity,therefore this idea was discarded. This was a rst indication for the high noise on thedataset. Noise means in this context largely overlapping distributions of the distinctclasses like in gure 2.4. But in seventeen dimensions, it is impossible to see directly if the data is noisy. We investigated two methods to test the noise on the data, presentedin the following section.

Noise measurements

The rst variant, suggested by Murphey et al. [52], is to measure the noise level on labeleddata. The noise can also be thought of as degree of overlap of the class distributions.The distributions of two classes are overlapping, if there is an area, in which samplesfrom both classes lie. Quantifying the noise respectively degree of overlapping works asfollows: one takes all given samples and calculates pairwise distances. For every sample,the class of the nearest neighbor is found out. If for most samples, this class coincideswith the class of the examined sample, an accurate classication should be possible.If the number of samples, whose nearest neighbor belongs to the opposing class, ishigh, classication is difficult and might produce high error rates. These ndings are

represented by the ratio = MN , where M is the number of samples with the nearestneighbor from the opposing class and N the number of all samples in the set. A higherratio is equivalent to more noise on the data set. The highest noise of sets tested byMurphey et al. was = 23.64% , which they described as a High noise level. For thereaders visual comparison: the noise level of the kNN example data in gure 5.3 is

23


24/101

2. Datasets

= 15.00% . The noise level on the clinical data is 26.71%, which can be regarded asvery high, in particular when considering the imbalance of the data. Class imbalanceis explained in section 3.4. We remember that in set A, the majority class makes up83.75% of all samples. For a sample of this class, it is very probable that the nearestneighbor is of the same class, just because there are so many. The inuence of thisaspect increases with further imbalanced sets. Imagine a set with 99% of the samplesin the majority class. This noise measure would give a seemingly good ratio of at worst = 2% , in which case all samples of the minority class would lie closest to a sample of the majority class. This small is misleading, because it does not reect the actual,extremely high noise level.

To counter this drawback, we suggest a weighted noise measure: let Nmin be thenumber of samples in the minority class, Nmaj the number of samples in the majorityclass. For the minority class, compute the amount Mmin of samples, which are closestto a sample of the majority class, and from this the ratio min = Mmin Nmin . Calculate classweight

wmin = 1 Nmin

Nmin + Nmaj , (2.1)

which compensates the lower a priori probability by assigning a high weight to theminority class. Perform the analog procedure for the majority class. The overall, newnoise level is now given by the weighted sum of the ratios

= min wmin + maj wmaj [0, 1], (2.2)which is easy to interpret: it represents the probability that a new sample, which

is drawn with from one of the classes equal probability, lies next to a sample of theopposing class. It is a much better representation of the noise on the dataset forimbalanced classes, but it works also ne on balanced datasets. This method can easilybe extended to datasets with multiple classes. Using this measure, the noise on data set

A is = 78.41% . This result can be properly compared to the noise level of the kNNexample data from gure 5.3, which does not change compared to the standard measureproposed by Murphey et al. [52], since the classes are of equal size, so = = 15.00% .And now, extra amount of noise in data set A compared to the kNN example data isstriking. This impression will be endorsed by the visualization at the end of this section.A third alternative to the rst two metrics is looking directly at the two ratios min and maj , which tell the amount of noise or the difficulty of correct classication for thedistinct classes. This gives a more detailed view of the situation, but is not as short asthe other measures.

A potential drawback of this metric is: in particular the second variant does not haveto perfectly reect the situation on the dataset. Imagine the majority class in a circle,the minority class are four points spread around that circle, very close to the edge of thecircle. Classication is possible, while the proposed measure is > 0.5, indicating a bad

24


25/101


performance, which it probably is, because of the small margins surrounding the decisionborder. In other words, the proposed metric predicts the accuracy of 1NN when choosinga new sample with uniform probability from the classes, which is certainly an indicator,but not a denite prediction for the nal classication performance. To relate this metricto other real-world data, we tested it on the seismic bumps dataset 1, which containsinformation about the seismic energy of bumps and distinguishes states as hazardousor non-hazardous. The minority class of hazardous samples contributes for 6.1% of allsamples, an even more severe imbalance as on our datasets. The metrics are = 11.30%and = 82.27% , indicating that almost all samples of the minority class lie next to asample of the majority class, comparable to our results on the dataset. The performanceof classiers applied to this set is as expected slightly lower than the performance of ourtests on our datasets, with Youdens J of 0.38 on the seismic bumps data, achieved witha classier called q-ModLEM [64], and up to 0.51 on our data with a SVM classier.Youdens J is explained in section 5.5.

Finally, we apply multidimensional scaling [45] (MDS), which represents noise visu-ally. It projects the samples onto a low-dimensional space, while the distances betweenthe samples are approximated as good as possible. For good depiction, we project toa two-dimensional space. In the result is depicted in gure 2.4, the blue cross classis obviously the majority class. For the present dataset, these are non-fallers, the redsamples represent women who fell. The plot conrms the result from the noise mea-surement, with both classes having similar and largely overlapping distributions in thetwo-dimensional space. It has to be considered, that this plot has to be interpretedcarefully, because it only approximates the true distances in the 17-dimensional space,not representing them ideally. The sum of the rst two eigenvalues of the dissimilaritymatrix accounts only for 48.62% of the cumulation of all eigenvalues, this ratio is called

the Mardia t measure [20]. A higher value indicates a good representation of the actualdistances. A value equal to 100% would indicate perfect representation, in which casethe samples would lie in a two-dimensional subspace. Having such a comparably lowvalue, the algorithm has omitted more than half of the information during this projec-tion. So despite having this plot which points towards non-separable classes, it couldbe possible that the classes are good to separate in the full 17-dimensional space, whichwe cant illustrate on paper. But this conjecture can be precluded because of the resultof the rst noise test, which conrms the intuition imposed by gure 2.4. T-stochasticneighbor embedding [74], another visualization method, conrmed this but did not giveany new insight, which is why we will not pursue it any further.

Coupled with the imbalance between the two classes of fallers and non-fallers, a

meaningful classication becomes very difficult on this dataset.

1https://archive.ics.uci.edu/ml/datasets/seismic-bumps - last visit: Jul 10, 2014

25
https://archive.ics.uci.edu/ml/datasets/seismic-bumpshttps://archive.ics.uci.edu/ml/datasets/seismic-bumps


26/101

2. Datasets

4 2 0 2

3

2

1

0

1

2

X

Y

Figure 2.4.: MDS map of dataset A

2.3. Set B : actibelt DataThe competing dataset B consists of two parts. The rst part comprises 12 clinicalvariables from set Awithout the ve simple functional variables time to walk 10 meters ,reaching to an object, lifting and carrying an object, the ease of washing themselvesand bending forward. And second the 54 variables retrieved by the actibelt sensor in11 functional tests, which replace the simple functional variables and are expected to givemore detailed information. Altogether, this makes 66 variables, which are depicted intable B.2. They are available for 171 patients who have taken the functional assessmenttests. Obviously, these are far less than the 277 subjects from dataset A. One has tokeep in mind that having so few samples with so many variables hampers building aproper prediction model [52], which we discuss in section 5.6.2.

The Device

The actibelt

is a small device to capture human motion, developed at the SylviaLawry Centre for Multiple Sclerosis Research The Human Motion Institute [14]. Itis installed into a belt buckle, and stores acceleration data while the patient completesnumerous functional tests. The advantage of this device in comparison to stopwatchtests, clinicians judgments, standardized scales or sophisticated force plates is that itunies benecial properties like very detailed and objective results, ease of use and

26


27/101

2.3. Set B : actibelt Data

Time Details Objectivity Cost TrainingStopwatch + + +Subjective tests +

+

Camera System + + +actibelt + + + +Force Plate + +

Table 2.2.: Comparison of fall risk assessment methods

inexpensiveness. A short overview of advantages and disadvantages of a selection of methods is depicted in table 2.2. In this table, + stands for good or given, formedium or average and

for bador not given. The rows stand for different fall

risk assessment methodologies, the columns are their properties. Let us take a morethorough look at table 2.2. The rst column, Time describes the time necessary tocomplete all tests contained in an assessment method. Since clinicians time is precious,the quicker the method is, the better. The next property, Details, tells how detailedthe resulting data is. For example, the outcome of the 10 meter walk test obtained withthe stopwatch does only provide with information corresponding to the time needed tocomplete the test, whereas the actibelt outcome may provide with information about gaitasymmetry as well. The different methods are also distinguished by their price. Methodsusing stopwatches and simple scales printed on paper are the cheapest, the actibelt sitsin the middle, topped by more complex devices like camera systems and force plates soldfor four-digit prices. Next up is Training. It describes, how much training is needed fora supervisor to conduct proper assessments. Less training is comprehensibly desirable.Altogether, it is safe to say that the acceleration sensors , in particular single waist-worn sensors, unify most advantageous behavior in comparison to the other assessmentmethods.

The reason for attaching the device to the test subjects belt near the navel is that atthis position, the inertial sensor approximates the bodys center of mass, according toHowcroft et al. [36]. Acceleration data from this position gives a good representation of the overall body movement. It is therefore the most common sensor location in inertialsensor studies [36]. Nevertheless, other sensor sites like ankles, knees, elbows et ceterahave been studied [9].

Test Protocols

The conducted tests can be split up into three different kinds: walk tests (1-4), risetests (5,6) and stance tests (7 -11). The 11 tests for retrieving the actibelt data arein detail:

27


28/101

2. Datasets

1. 10 meter walk test: the patient walks 10 meters at self-selected speed.

2. Repetition of test 1.

3. Distracted 10 meter walk test: 10 meter walking at self-selected speed, whilecounting backwards from 100 subtracting 7 and stating the numbers out aloud.

4. Tandem walk: walking along a 5 meter line at self-selected speed with arms foldedacross chest. Tandem walk means walking with one foot leading another.

5. Timed up and go test: rising from a hard surfaced chair with arm and backsupports with 42 cm in seating height, walking three meters, turn around, walkback and sit down again.

6. Chair rise test with arms folded across chest: again rising from a hard surfacedchair with arm and back supports with 42 cm in seating height. This time no

walking, the test subject just stands up and sits down again ve times, as fast aspossible.

7. Romberg stance test: for a maximum of 10 seconds, the patient remains as stillas possible with his/her feet together such that the ankles are touching. Patientsmay move arms to control balance, with their eyes focusing on a point on the wallat eye height. If the patient loses balance before the 10 seconds are over, the timeis recorded.

8. Semi-tandem stance: same protocol as test 7, but now the patient places the heelof one foot alongside the big toe of the other foot before the test starts. Thepatient may choose, which leg he or she puts in front.

9. Tandem stance: same protocol as test 7, with the only difference that the chosenfoot is directly in front of the other.

10. One-legged stance on right leg: same protocol as test 7, now the patient is standingsolely on the right leg.

11. One-legged stance on left leg: same protocol as test 7, now the patient is standingsolely on the left leg.

See the dissertation of Cristina Soaz [65] for more in-depth information about thetesting protocol.

Extracted Variables

While conducting the tests, acceleration values have been recorded in all three dimensionswith a frequency of 100 Hertz. In the following, the dimensions are given from the testsubjects point of view. X is the vertical dimension, Y is left and right of the patient

28


29/101


and Z is forward/backward direction. From these values, different variables have beencomputed [65], which are now introduced with units, if applicable. A full table of variables of this dataset is given in appendix B.2. For the walk tests 1 to 3 , these are:

Number of steps: facilitates algorithms that can recognize steps from the accel-eration data. After processing the raw data [ 65], the number of steps, which aperson needed to complete a test, is stored.

Asymmetry in X-, Y- and Z-direction [%]: the pairwise cross correlation coef-cients between the waveforms of two subsequent steps is computed. The averageof the coefficients between all steps is the nal value. This can be interpreted asthe average consistency or similarity of the steps.

Standard deviation of asymmetry: the previously calculated correlation coeffi-cients form a distribution, from which the standard deviation is extracted.

Mean speed [km/h]: the mean walking speed of a test subject.

Duration [s]: time needed to complete the test. From initiation of movementuntil the patient stops at the end line. Calculated by comparison of the bodyacceleration with an activity threshold.

Step length [m]: the post-processing code can calculate the step length bydividing the distance by the number of steps.

Cadence [steps/min]: the number of steps a test subject takes per minute,which can easily be calculated from the number of steps divided by the durationin minutes.

If interested in the exact formulas and discussion of the measurement process, thereader is advised to look up [65]. For the tandem walk test 4 , three variables for eachdimension are analyzed, summing up to nine variables for this test:

Peak to peak difference in acceleration in X-, Y- and Z-direction [g]: thesevariables represent the difference between the highest negative and positive ac-celeration in one of the three dimensions. The result is given relative to thegravitational acceleration.

Standard deviation of the acceleration in X-, Y- and Z-direction: the stan-

dard deviation of the distribution of all acceleration values over the test time. Balance Count in X-, Y- and Z-direction [%]: The percentage of time samples,

during which the acceleration is above a threshold, with a high value indicatingthe test subjects reaction to a loss of balance or clumsy movement.

29


30/101

2. Datasets

! 0.05 0.00 0.05

! 0

. 0 5

0 . 0

0

0 . 0

5

left ! right acceleration [g]

b a c

k w a r d !

f o r w a r d a c c e l e r a

t i o n

[ g ]

Figure 2.5.: Example data for a one-legged stance [65]

When conducting the two rise tests 5 and 6 , the only feature measured is the durationin hundredths of a second. For the ve stance tests 7 to 11 , the acceleration in theplane spanned by the left-right and backward-forward axes from the perspective of thetest subject is plotted and an ellipse containing 95% of the plotted points is created, asexemplary depicted in gure 2.5. For this ellipse, the two parameters eccentricity andarea [m2 / s4 ] are computed. They are calculated according to formula A.1 for eccentricity,

respectively formula A.2 for area in appendix A.1, which are both the standard formulasfor calculating ellipse properties.Both variables are comprehensible, as a larger eccentricity, a property of a stretched

ellipse, means sway in one direction, indicating worse balance skills or a balancing de-ciency of the test subject. In contrast, approximately equal sway in every directionrepresents the normal noise of actors and sensors of a human body. A greater area isequal to overall higher sway, which has been associated with the falls risk factors his-tory of falling and poor near visual acuity [46]. The units are as before ratios of thegravitational acceleration.

As for dataset A, we want to give a short presentation of the statistics of chosenvariables in table 2.3. This time, we omit variables from the set that give no additional

information which is valuable for understanding to achieve a shorter representation. Thenumber in front of the last 13 variables in table 2.3 indicates the test number. Forvariables with no description behind the variable name, the mean and standard deviationare given. The mean and absolute values in both columns seem to depend on the

30


31/101


condition Faller or Non-Faller only to a small amount, not always comprehensible.We will nd out if algorithms are able to extract useful information out of this data.

Validity and Plausibility of the actibelt

DataClinicians want to have a proof for the validity and plausibility of a new tool beforeactively using it. To perform this proof for features retrieved from the actibelt , wetest the relation of those acceleration-based variables to the simple functional datafrom dataset A. By this, we show that the tests go in line about the basic results,leaving the opportunity for more precise results of the new methodology open. If thereare signicant correlations between analog variables from actibelt and simple test,the inertial data can be considered valid. Furthermore, we perform plausibility tests.These demonstrate comprehensible connections between variables of the two sets. Forexample, older subjects should clearly need more time to complete walk or rise tests asthe tests 1 to 6. To show these connections, we compute different correlation values.Cohen et al. [13] dened the following thresholds for interpreting Pearson correlationvalues: a correlation is strong, if the Pearson correlation coefficient r , dened in formulaA.5, is greater than 0.5, moderate if 0.5 r 0.3 , small for 0.3 < r 0.1 andinsubstantial for r < 0.1 . Point biserial correlation, which we use later in this section,can also be interpreted in this way, as well as Kendalls tau and phi correlation. Usingthis segmentation, we dene variables as valid if they correlate strongly with an almostidentical or analog variable. For the plausibility tests, at least a moderate correlationis required. We chose to do this, because analog variables should contain very similardata, leading to a high correlation value, while data related per plausibility does not haveto be so strongly related. We dened the lower threshold to cut off insignicant andsuperuous small correlations that do not give additional information. By employing .3as lower threshold, a signicance level of = 0.05 is well approximated for 171 samples.The different correlation coefficients are necessary because of the different scales of thevariables. The scales and correlation coefficients are explained in appendix A.2. We testthe signicance of correlations with Bonferroni-corrected p-values on the signicancelevel = 0.05 , the correction factor is the number of hypotheses. P-value correction isexplained in A.3. Since we test each of the 54 actibelt variables against one of the 17variables from set A, we would set the correction factor to 54 17 = 918 . In section 6.6,we also test all variables from both sets for signicant correlation to prospective falls,leading to an additional 17 + 66 hypotheses, rendering the overall correction factor forthis thesis to 1001.

We start with the validity tests, in particular the time, a test subject needed to walk10 meters, since this variable is present in both datasets. The time from the actibelt data is not directly measured via stopwatch, but automatically computed on basis of theacceleration data, which is why it needs to be validated. Two ratio scaled variables arecorrelated, therefore the Pearson correlation coefficient computation is applied. Usingthe clinical test as ground truth, it can be said that this variable from the rst test of

31


32/101

2. Datasets

Fallers Non-Fallers AllSize of Group (%) 30 (17.54) 141 (82.46) 171 (100)

Age 68.67 4.70 67.89 4.62 68.02 4.63History of Falls (%) 14 (46.67) 60 (42.26) 74 (43.27)

BMI 24.84 4.12 25.12 3.64 25.07 3.72Depression (%) 6 (20.0) 21 (14.89) 27 (15.79)

Sleep Disturbance (%) 7 (23.33) 51 (36.17) 58 (33.92)

Arthritis (%) 0 (0) 7 (4.96) 7 (4.09)

Visual Decits (%) 2 (6.67) 4 (2.84) 6 (3.51)

Hypertension (%) 12 (40.0) 19 (13.48) 31 (18.13)

1: Number of Steps 16.79 1.58 16.86 2.36 16.85 2.241: X-Asymmetry 15.98 5.21 16.78 6.72 16.64 6.471: SD of X-Asymmetry 6.74 2.59 7.44 3.06 7.31 2.991: Duration 9.04 1.31 9.06 2.13 9.06 2.013: Number of Steps 18.67 1.99 17.96 2.96 18.08 2.824: X-Peak to Peak 0.74

0.37 0.79

0.42 0.78

0.41

4: SD of X-Acceleration 0.07 0.02 0.07 0.03 0.07 0.034: X-Balance count 1.50 0.62 1.37 0.45 1.39 0.485: Duration 10.58 1.78 11.13 3.03 11.03 2.867: Eccentricity 0.65 0.17 0.68 0.16 0.68 0.167: Area 0.04 0.02 0.05 0.04 0.04 0.039: Eccentricity 0.78 0.12 0.77 0.14 0.77 0.149: Area 0.42 0.55 0.22 0.26 0.25 0.33

Table 2.3.: Some statistics of dataset B

32


33/101


our test protocol is valid, because the correlation value r(170) = 0.86 , with p < 0.001after Bonferroni correction between the two is very large. This can as well be stated forthe number of steps in the rst test, which correlates strongly with the 10 meter timefrom the clinical data: r(170) = 0.75 with p < 0.001 . It is also the case for the meanspeed (r(170) = .79 with p < 0.001 ), the step length ( r(170) = .73 with p < 0.001 )and the step cadence ( r(170) = .64 with p < 0.001 ) of the rst test. Negative signsof r indicate a reciprocal relationship: for example, a subject who needs more time tocomplete the test, will have a lower mean speed. The validity naturally holds true for thesecond test, which is just a repetition of the rst test. The next feature from the clinicaldataset for comparison is the nominal scaled self-estimated balance, for which only onesignicant correlation value from the calculation of Kendalls tau was found despitehaving balance tests in the actibelt test protocol, which should correlate strongly tothat. The variable is the area of stance test 7 with a Kendalls tau of r(170) = .25 withp < 0.01 . The negative correlation is comprehensible since a subject, that rates itself highly balanced, should have a smaller area of sway in the balance tests. It surprisesthat this is only weakly apparent for test 7 and no other balance test, which is forus not enough to regard this variable as validated. The problem is not in the inertialdata, but in imprecise answers of the test subjects, who were obviously not able toclassify their balancing skills properly. The reason for that is most probably an ill-posedquestion. The subjects should have been given descriptions of different balance levels,not only rate themselves from 0 to 10 with descriptive references only for the minimumand maximum value. To prove this statement, self-estimated balance was correlated tosome previously unused, but proven variables: while conducting the 11 tests accordingto the protocol, times were also measured in the conventional way with a stopwatch. Intests 5 and 6, these times were simply the timespan from start to end of a test, in the

stance tests 7 to 11, times were recorded, if the patient did not succeed to stand for 10seconds and had to abort the test. These variables should be negatively correlated toself-estimated balance, but they are not. Not a single signicant Pearson correlation wasfound, which proves the previous assertion. These probably valuable variables are notused for prediction because they have only been recorded for 81 or approximately half of the test subjects from dataset B . The nal simple functional variables from the clinicaldataset are reaching, washing and bending. To relate these ternary, ordinary variablescorrectly to the continuous variables computed from acceleration data, again KendallsTau is suitable. Using this, no signicant correlations after Bonferroni correction havebeen found on the signicance level < 0.05 . The reason for this result is probablyagain the defective judging of the test subjects. Overall, the variables of the 10 meter

walk tests 1 to 3 with exception of the asymmetry features have been proven valid.Let us continue with plausibility tests, since we run out of functional variables. Forthese tests, the remaining clinical data from set A is used. Starting with age, which canbe related using Pearsons correlation, the following signicant moderate correlations arefound: age and the number of steps in tests 1 and 2 correlate with r(170) = 0.31 andr(170) = 0.34 with p < 0.001 respectively. Furthermore, the sway areas from the tests

33


34/101

2. Datasets

7 to 9 and 11 are moderately correlated to the womens ages. Their correlation valuesrange from r(170) = 0.33 to r(170) = 0.35 , all of them with a signicance p < 0.001 .Bonferroni correction was again applied to all p-values to adjust the family-wise errorrate. Test 10, which consists of standing on the right leg, does not correlate signicantlywith age. The correlation between test 11, which is analog to test 10 but using the leftleg, and age is signicant, thus we will use test 10 for prediction in table 6.3 as well.Next, BMI correlates moderately r(170) = .31 and signicantly p < 0.05 to the steplength of test 2. The number of steps in test 2 is just not signicantly correlated, bothrelations are comprehensible and constitute a strong argument for the usefulness of thesevariables of test 2. Other variables from dataset B , which correlate weakly to age arealready validated and therefore discarded in the plausibility check, like the number of steps. The rest of the variables had an insignicant correlation, which is why they arenot mentioned here. The nal variables from dataset A to test plausibility are history of falls, disease variables and the use of an assistive device. These are binary on a nominalscale, Pearson correlation does not apply to them. For relating nominal and continuousscaled variables, point biserial correlation coefficient has been introduced, see appendixA.2. Variable history of falls is assigned the value zero for a person, who didnt fall withinthe previous year and the value one for a person that fell at least once. It is negativelycorrelated to the asymmetry in Y-direction in both the rst and the second test. Thecorrelations are moderate with r(170) = .30 with p < 0.01 . Negative correlation makessense in this case, because the asymmetry values are the result of the auto-correlationof the waveforms of two subsequent steps. A high value suggests very consistent stepsthat do not differ in their pattern. This is the property of a healthy person, who hasa lower risk of falling and therefore has more probably not fallen before. In contrast,if the auto-correlation between two steps is low, the person walks irregular and may

have a history of falling. The point biserial correlations of the history of falls with othervariables from dataset B are negligible. This is interesting, because the asymmetry inX- and Z- direction is also not correlated signicantly, leading to the statement that theasymmetry in y-direction is the best dimension for nding previous falls. Remember thatthis dimension is the left/right direction from the perspective of the patient, whichrenders this correlation even more comprehensible. This is because one would thinkof unstable walkers mainly swaying to the sides, not up and down nor forward andbackwards, unless they suffer from a serious injury or disease at time of the assessment.The next investigated variables are depression, sleep disturbance, hypertension, fear of falling and pain in the ankle or foot. For all of them, no signicant correlations could befound. The only plausible and signicant correlation is the balance count in Z-direction

of the tandem walk test 4. The correlation value r(170) = 0.35

with p < 0.001

. Thereare also no consistent correlations for visual decits and the use of assistive devices.In summary, gait variables like durations and mean speed have been proven valid,

furthermore the asymmetry in Y-direction and the sway areas for the stance tests isshown to be plausible. No proof or deduction could be made for the other directions,the standard deviation of the asymmetries or any variable of test 4 except the balance

34


35/101


count in Z-direction. The eccentricity variables of the stance tests could not be validatedin contrast to the area variables, which were all validated except the area of test 10.This does not mean that eccentricity is useless, these variables could give additionalinformation, which the conventional dataset

A does not contain.

The nal correlation tests that should be conducted are internal tests on dataset B to determine dependencies within the dataset. Since these variables are all continuous,Pearson correlation can be employed. We did not conduct these tests to not further in-crease the correction factor. For 54 pairwise tests, this would have been 54 53 / 2 = 1431additional hypotheses, making some of the previous results insignicant. We conductedtests on subsets and now give just hints instead of xed statements. Variables likemean speed, duration, number of steps, step length and step cadence in tests 1 to 3 arepartially not only statistically, but even analytically related. We suggest to use only ana-lytically independent variables. Otherwise, variance-based dimension reduction methodslike PCA and PLS, introduced in chapter 4, are skewed2 because the contribution of thecommon underlying factor of these variables is increased. The gait variables of tests 1to 3 are our main point of criticism, we did not nd any other problematic correlationswithin the datasets. We solve this problem by applying one of the feature selectionmethods from chapter 4 or alternatively using not all of the variables of tests 1 and 2.For example we select only duration and number of steps, in which no analytical relationis evident. The subset containing these variables among others is depicted in bold fontin table B.2.

Noise Measurements

As already done for dataset A, we want to conduct noise measurements also on dataset

B . Murphys ratio = 26.90% suggests that high noise is present on the dataset. This

dataset is, as can be seen in table 2.3, considerably imbalanced in terms of fallers andnon-fallers, so we also apply the weighted noise measure, which gives again a muchhigher noise level = 71.33% . In summary, both datasets are similarly noisy.

To visualize noise, MDS is applied a second time. The plot in gure 2.6 depicts thesituation, conrming the intuition of largely overlapping class distributions. The Mardiat measure is 33.36%.

2http://stats.stackexchange.com/questions/50537/ - last visit: Aug 19, 2014

35
http://stats.stackexchange.com/questions/50537/http://stats.stackexchange.com/questions/50537/


36/101

2. Datasets

5 0 5 10 15

8

6

4

2

0

2

4

X

Y

Figure 2.6.: MDS map of dataset B

36


37/101

3. PreprocessingThis chapter reviews methods which have to be applied prior to using classicationor feature selection methods on a dataset. With the exception of outlier control, allmethods presented in this chapter are used in the nal implementation of our algorithm.

3.1. Filling in Missing Values

On both datasets, a few entries are missing due to errors in the data collection process.

These are one in set A (0.27 of all entries) and 53 in set B (4.72 ). One could just exclude the samples or variables with missing values, but we decided not to do thatbecause this would require either omitting a sample or a variable and since we alreadyhave a shortage of samples on our dataset, we want to investigate as many variables aspossible. In order to do so, we decided to use low-rank matrix completion. The methodapproximates the given, non-complete matrix with a low-rank matrix. Subsequently, itreplaces the empty cells of the original matrix with the corresponding cells from thelow-rank matrix. It assumes normally distributed variables. This is given for all missingentries of set B , and for the missing value of set A. The inuence of non-normallydistributed variables on the correctness of the result is assumed to be negligible.

3.2. Standardization Versus NormalizationDistance-based classiers like k-nearest-neighbors and support vector machines fromchapter 5 need the input variables to be preprocessed when they have different units sothat the classier does not erroneously place too much importance on a variable withhigh variance, while the magnitude of the variance is just caused by the unit of themeasurement. For example, one variable which contains a distance measured in meters,has a much higher variance than another variable, which contains the same distance,but this time measured in kilometers. In numbers: the variance difference is factor 10 6 ,rendering the second variable insignicant in the eyes of distance-based classiers.

Furthermore, standardization is necessary for when applying principal componentsanalysis, presented in section 4.1. This is because rst, PCA utilizes the directions of maximum variance in the dataset. If one variable has a signicantly higher variancethan the remaining, the rst singular vector will point mainly in the direction of thisvariable, whereas it is not known, how valuable the contribution of this feature is. Thisargument is equivalent to the distance argument from the classiers, since the same

37


38/101

3. Preprocessing

variable measured in a smaller metric has a higher variance. Therefore, to create a fairPCA, the variances of the variables are set to one or standardized to a reasonable extent.Second, the mean of each feature has to be zero, because non-zero mean variables woulddistort the singular value decomposition, which is part of PCA. The rst and largestsingular value would be the euclidean distance from the origin to the overall mean, thecorresponding singular vector would point to this mean value. But the singular vectorsof PCA are thought to represent variance, so the mean is set to zero, such that there isno singular vector representing it.

There are many different methods of standardizing a dataset, the most widely usedare normalization and standardization. Standardization sets the mean of every variableto zero and the respective standard deviation to one. To conduct this, we transformevery feature x orig ,i into

x std ,i xorig ,i i

i, (3.1)

which readily is the standardized version. i is the mean of x orig ,i and i the accordingstandard deviation. Note that the scalar i is subtracted from vector xorig ,i, which is inthis case a component-wise subtraction. The drawback of this method is that it assumesa normal distribution of the variable, which is certainly not the case for the dichotomousfeatures. For normalization, a similar formula

x norm ,i xorig ,i min(x orig ,i)

max(x orig ,i) min(x orig ,i)(3.2)

is used. Normalization projects the variables to [0,1]. We further apply a subsequentmean-centering. The projection is done by applying formula 3.2, then mean-centering

by subtracting the mean from every variable. The drawback of normalization is that thestandard deviations are not equal after processing the data. This distorts the PCA, inour case a difference among the variables with the highest and lowest standard deviationof a factor over 10 after applying normalization to dataset B . The difference cannot betolerate, therefore, we used only standardization, accepting that we violate the normalityassumption for binary variables. We assume that this has no deteriorating consequencesfor prediction.

3.3. Outliers

A further problem of normalization can be caused by outliers with extreme values in

certain variables. By extreme, we mean several powers of ten bigger or smaller thanthe typical values of a feature. If normalization or standardization is applied, the datawithout the outlier lies in a small area, decreasing the discriminative power of thatvariable. One possibility of handling outliers is the use of winsorizing[61], which cuts allvalues outside a for example 95% interval of all values to the borders of that interval.

38


39/101

3.4. Handling the Class Imbalance and Overlapping

There are normalization methods which utilize the median instead of the mean. Thisis again useful to suppress a distortion of the centering by heavy outliers. But usingthe median for normalizing dichotomous variables is not feasible, because it will takethe value of either the smaller or the larger level. In summary, except the removal of some outliers in subsection 2.2, no steps were taken to mitigate any inuence of heavyoutliers, simply because there were none detected.

3.4. Handling the Class Imbalance and Overlapping

The classes of a dataset are unbalanced, if they comprise a different number of samples.In a two-class problem, the class with more samples is called majority class, the smallerclass minority class. Building a model on such a set works ne as long as the classes areclearly separated by the distributions of their samples. If the classes overlap, meaningthat there are areas containing a mixture of samples from both classes in which no proper

discrimination can be made, building good classication models becomes very difficult,since models tend to classify everything as being part of the majority class. As seen inchapter 2, the present datasets are unbalanced and contain strongly overlapping classes.

Kotsiantis et al. [44] reviewed different methods of dealing with this kind of dataset,from which the most promising and therefore extensively discussed are presented here.First, there are data level methods like under- and over-sampling. Under-sampling omitssamples of the majority class in an intelligent way, whereas oversampling introducesnew samples for the minority class, either by creating synthetic new samples as anintelligent combination of existing samples from the smaller class or by cloning theavailable samples.

There are also approaches on the algorithmic level including distinct class weights for

kNN and different class costs for SVM, both introduced in chapter 5.Beyond that, a threshold-based method has been developed. Classiers that compute

a score representing to which degree the sample belongs to a class, can take advantageof this method. For example, a wrapper function would decide for the class with thehigher score. Using the threshold method, the wrapper function can be told to extendthe decision area in favor of the minority class. This can also be used for regressionmethods.

One-class learning ignores one class, preferably the majority class because this classalready dominates the decision rules proposed by a non-suitable classier. On the re-maining class, for example outlier detection can be applied.

Cost-sensitive learning assigns a misclassication cost to every sample, optimizing for

the overall lowest cost. Introducing unconstrained cost gives the user lots of freedom tot the algorithm to the circumstances, for example assigning a high cost for misclassifyinga sample of the minority class.

We mainly investigate the sampling methods and algorithmic level schemes, sincethis whole topic is not the central point of this thesis. As sampling method, we use

39

evaluation using regression analysis

Documents