a deep learning approach to predicting diagnosis code from...

IN THE FIELD OF TECHNOLOGYDEGREE PROJECT INDUSTRIAL ENGINEERING AND MANAGEMENTAND THE MAIN FIELD OF STUDYCOMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2018

A Deep Learning Approach to Predicting Diagnosis Code from Electronic Health Records

ELLINOR HÅKANSSON

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

A Deep Learning Approach toPredicting Diagnosis Codefrom Electronic HealthRecords

ELLINOR HÅKANSSON

Master Degree Project in the Field of TechnologyIndustrial Engineering and Management and the Main Field ofStudy Computer Science and EngineeringDate: November 18, 2018Supervisor: Johan GustavssonExaminer: Viggo KannSwedish title: Djupinlärning för prediktion av diagnoskod utifrånelektroniska patientjournalerSchool of Electrical Engineering and Computer Science

i

Abstract

Electronic Health Record (EHR) is an umbrella term encompassing de-mographics and health information of a patient from many differentsources in a digital format. Deep learning has been used on EHRsin many successful studies and there is great potential in future im-plementations. In this study, diagnosis classification of EHRs withMulti-layer Perceptron models are studied. Two MLPs with differ-ent architectures are constructed and run on both a modified versionof the EHR dataset and the raw data. A Random Forest is used asbaseline for comparison. The MLPs are not successful in beating thebaseline, with the best-performing MLP having a classification accu-racy of 48.1%, which is 13.7 percentage points lower than that of thebaseline. The results indicate that when the dataset is small, this ap-proach should not be chosen. However, the dataset is growing overtime and thus there is potential for continued research in the future.

Keywords: EHR, diagnosis code, ICD-10, classification, deep learning,Multi-layer Perceptron.

ii

Sammanfattning

Elektronisk patientjournal (EHR) är ett paraplybegrepp som användsför att beskriva en digital samling av demografisk och medicinsk datafrån olika källor för en patient. Det finns stor potential i användandetav djupinlärning på dessa journaler och många framgångsrika studierhar redan gjorts på området. I denna studie undersöks diagnosklassi-ficering av elektroniska patientjournaler med Multi-layer perceptron-modeller. Två MLP-modeller av olika arkitekturer presenteras. Dessakörs på både en anpassad version av EHR-datamängden och på denråa EHR-datan. En Random Forest-modell används som baslinje förjämförelse. MLP-modellerna lyckas inte överträffa baslinjen, då denbästa MLP-modellen ger en klassifikationsnoggrannhet på 48,1%, vil-ket är 13,7 procentenheter mindre än baslinjens. Resultaten visar att enliten datamängd indikerar att djupinlärning bör väljas bort för dennatyp av problem. Datamängden växer dock över tid, vilket gör områdetattraktivt för framtida studier.

Nyckelord: EHR, diagnoskod, ICD-10, klassificering, djupinlärning,Multi-layer Perceptron.

iii

Acknowledgements

I would like to express my gratitude to Johan Gustavsson, my super-visor at KTH, for his valuable advice and research experience. I haveno doubt that this report is a lot better thanks to you.

I would also like to thank Doctrin and my supervisor Sonja PetrovicLundberg, for allowing me to do this project and for your help. Yourenthusiasm for and expertise within machine learning is truly inspir-ing and your advice is always insightful and direct.

On a more personal note, I would like to thank Elin Lutz for keep-ing me company during so many of the hours I spent working on thisthesis and for cheering me on when I needed it. I’m not sure I couldhave done it without you.

And to my parents, for their endless support: I am eternally grateful.Thank you for always believing in me.

Contents

1 Introduction 11.1 Machine Learning in Medicine . . . . . . . . . . . . . . . 11.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Deep Predictive Modeling on Electronic Health Records . 31.4 Problem definition . . . . . . . . . . . . . . . . . . . . . . 31.5 Research Question . . . . . . . . . . . . . . . . . . . . . . 41.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background 62.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Training and Evaluation . . . . . . . . . . . . . . . . . . . 62.3 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . 82.4 Artificial Neural Networks . . . . . . . . . . . . . . . . . . 92.5 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . 102.6 Variations of Artificial Neural Networks . . . . . . . . . . 11

2.6.1 Multi-Layer Perceptrons . . . . . . . . . . . . . . . 112.6.2 Recurrent Neural Networks . . . . . . . . . . . . . 122.6.3 Convolutional Neural Networks . . . . . . . . . . 13

2.7 Relevant works . . . . . . . . . . . . . . . . . . . . . . . . 132.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Methods 173.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 Processing of Data . . . . . . . . . . . . . . . . . . . . . . 173.3 Network Architecture . . . . . . . . . . . . . . . . . . . . 19

3.3.1 MLP-a . . . . . . . . . . . . . . . . . . . . . . . . . 193.3.2 MLP-b . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.4 Training and Evaluation . . . . . . . . . . . . . . . . . . . 21

iv

CONTENTS v

4 Results 224.1 Comparison models . . . . . . . . . . . . . . . . . . . . . 224.2 Performance on Dmin . . . . . . . . . . . . . . . . . . . . . 234.3 Confusion matrices for MLPs on Dmin . . . . . . . . . . . 234.4 Performance on Dalt . . . . . . . . . . . . . . . . . . . . . 244.5 Confusion matrices for MLPs on Dalt . . . . . . . . . . . . 24

5 Discussion 255.1 Ethical Discussion . . . . . . . . . . . . . . . . . . . . . . . 275.2

Bibliography

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 28

29

Chapter 1

Introduction

The objective of this master degree project is to examine the possi-bilities to create a decision support tool for clinicians, using machinelearning methods and a dataset consisting of Electronic Health Records(EHRs) to perform disease prediction. Electronic Health Record is anumbrella term encompassing demographics and health information ofa patient from many different sources on a digital format. Examples ofsources include laboratory test results, radiology images, patient de-mographics, progress notes, diagnoses, allergies, immunization datesand medical history. EHRs electronically store patient health informa-tion, facilitating the possibility to track patients health status over time.

In this introduction chapter some examples of uses of Machine Learn-ing and its applications in the field of medicine is given, as well as apresentation of the principal along with a problem definition. Finally,the research question of the project is presented and some limitationsdiscussed.

1.1 Machine Learning in Medicine

Machine learning is commonly defined as a field of computer sciencein which machines learn by processing data, instead of being explic-itly programmed. Due to an upsurge in computational power in thelast few years, increasingly complex problems can be solved using ma-chine learning.

It seems the field of medicine has many traits that make it suitable

1

2 CHAPTER 1. INTRODUCTION

for machine learning implementations [11]. One such trait is the abun-dance of data — much more than any clinician could ever digest —such as clinical trial results, medical images of skin conditions etc. Ma-chine Learning can be used to unveil meaningful patterns in the datathat have previously gone unnoticed [18]. Medicine as a field has seenmany experimental implementations of machine learning, often withpromising results. There are huge gains to be made by using machinelearning to make healthcare more accurate, efficient and accessible. Adecision support tool that could aid clinicians when diagnosing pa-tients may lead to faster and more accurate care interventions, whichwould mean both better quality of care for the patient and increasedefficiency, avoiding wasted resources.

1.2 Deep Learning

Deep learning is a family of machine learning methods focused onlearning data representation, rather than task-specific algorithms. Al-though deep learning models have the drawback of being computa-tionally demanding to train, such models have the ability to find hid-den patterns in high-dimensional data. Therefore, deep learning ap-proaches have resulted in increased prediction accuracy for many ofthe machine learning applications where they have been implemented[23, 9, 28, 22]. Deep learning models have faced criticism for their"black box" nature, regarding the difficulty for a human to interpretthe reasons behind the models’ decisions. Suresh et al. [27] showedmethods of increasing the interpretability of such models on an imple-mentation that predicted clinical interventions in intensive care units.

In machine learning, the attributes of input data are called features.In traditional methods, preparing the features for usage in the modelis a demanding process. The models are highly dependent on the for-mat and weight put on each feature, meaning it often requires domainexpertise to prepare the data. With deep learning, the algorithm it-self optimizes the features and the weights during the training of themodel. This allows for the model to find patterns that might not beobvious even to human domain experts and is one of the explanationsas to why deep learning is superior to traditional methods in manyapplications.

CHAPTER 1. INTRODUCTION 3

1.3 Deep Predictive Modeling on ElectronicHealth Records

Previous research has shown a lot of promising results when apply-ing deep learning on medical data [6] and a lot of these applicationsmake secondary use of EHRs [25]. For instance, the system DoctorAI was developed with a deep learning algorithm to predict futuredisease diagnosis along with a corresponding suitable medication in-tervention, based on paired observations of clinical events and timestamps from EHRs. Doctor AI achieved accuracies similar to thoseof practicing clinicians [7]. Weng et al. [28] experimented with severalmachine learning models to predict cardiovascular risk. These models’performances are compared to an established approach of cardiovas-cular risk assessment used in medicine and all of the methods outper-formed the established approach to some degree. The most successfulmethod is a neural network that increased the prediction accuracy by7.6%.

It is thus of interest to study if deep learning could be used to attainincreased accuracy compared to the provided method for predictingpatients diagnosis class, due to its notable pattern finding ability. Re-search concerning predictive deep learning on EHRs is used as a basisfor this project.

1.4 Problem definition

The external principal of this degree project is Doctrin, a company de-veloping an online tool for digital primary care visits. Their systemis gathering structured information about patients’ health status withEHRs, in order to support health care professionals in making moreinformed decisions about subsequent care.

Doctrin is interested in diagnosis class prediction in order to inves-tigate the possibilities of enabling automated decision support in theirsystems. An automated decision support system could be beneficialfor clinicians in order to prevent human error that occur due to factorssuch as stress, fatigue or failure to consider all the information loggedin the EHR. Thus, it could lead to improved accuracy and efficiency for

4 CHAPTER 1. INTRODUCTION

the company and improved quality of the medical care they provide.

Doctrin’s system entails an application that allows patients to seekmedical attention through an online questionnaire. The applicationis currently in use on several primary care agents’ websites and canbe used as an alternative to calling the agents. In the application, thepatient is prompted to answer questions about their health status andsymptoms. When a patient starts the procedure of filling out the ques-tionnaire in Doctrin’s application, they are prompted to state their rea-son for seeking care, choosing from a list of up to 200 possible ailments(depending on how many the primary care agent has chosen to use inthe specific implementation of Doctrin’s system). The list contains rea-sons such as cold or flu, acne, headache etc. Based on the reason chosenby the user, the following questionnaire will contain slightly differentquestions about the health status and symptoms of the patient. Today,the resulting EHR is used by doctors as a basis for giving treatmentvia an online interface or, when needed, to direct them to a physicalexamination at a health center. The EHRs derived from Doctrin’s ap-plication are a subset of traditional EHRs, as they contain only patientdemographics and health status.

A collection of these EHRs have been given a corresponding diagno-sis code by a clinician, based on the health information given by thepatent. The EHRs and their diagnosis code served as the primary datasource in this project, in an attempt to perform diagnosis code classifi-cations of the EHRs.

The principal has provided a machine learning model for diagnosisprediction to use for comparison. It is based on a traditional machinelearning method called Random Forest [10, 3], which operates on aseries of decision trees in order to perform classification.

1.5 Research Question

The research question examined in this project is:

“How well does a deep learning approach to predict diagnosis code for EHRsperform compared to Random Forest?"

CHAPTER 1. INTRODUCTION 5

1.6 Limitations

A randomized and anonymized dataset containing 60 000 EHRs la-belled with one of 38 possible diagnosis codes is provided by the prin-cipal.

Approximately 36% of the EHRs belong to one of 38 of the diagnosiscode classes. For the least common diagnosis there are only 46 sam-ples. This means the dataset is unbalanced, which needs to be handledin the preprocessing of the input data or in the model design. Failingto consider the unbalance in the dataset would likely lead to an over-fitted model.

A property of deep learning is that it requires a larger amount of in-put data than traditional models do. It is hard to say what constitutesenough data, as it is dependent on the complexity of the problem. Oneof the most famous datasets used for deep learning, the MNIST datasetfor classifying handwritten digits [17], contains 70 000 samples. How-ever, there are only ten target classes and this project has 38, which isone indicator of this being a more complex problem. Additionally, thefeature space of the MNIST dataset is smaller and more homogeneousthan that of the EHR dataset used in this project. Even though there isa risk that the dataset in this project is too small to train a deep learningmodel to perform accurate classification, there is a lot of potential insaid approach. The principal will also continue to gather data for thedataset, making any insights concerning this approach valuable for fu-ture work.

For confidentiality reasons, a free text response that is part of the EHRsis left out of the dataset. This may prove to be a limitation for theclassification, as these fields hold a lot of the information used by thedoctors to set a diagnosis, according to the principal.

Chapter 2

Background

2.1 Classification

A common machine learning task, including this project, is classifica-tion. Classification means assigning input samples to one of K classesbased on the samples’ attributes, or features. An algorithm with a setof adjustable parameters form a machine learning model, which canoutput an approximation based on the input feature vector x. This canbe expressed as mapping the input vector x to an output y.

x→ y

where x = [x1, x2, ..., xi] is the feature vector of i features and y ∈{1, 2, ..., K} is the class assignment. The output variable y is also com-monly modeled as a probability distribution over the possible classesconditioned on the input vector x:

P (y | x)

as opposed to y being the distinct class assigned to the sample by themodel.

2.2 Training and Evaluation

Training a machine learning model is the process of introducing datafrom which the model can learn the mapping from input to output.The learning process is done either in a supervised or unsupervised man-ner. Supervised learning is used on datasets containing pairs of input

6

CHAPTER 2. BACKGROUND 7

vector x and the corresponding output label y. The training phase thenentails finding the mapping between input and output pairs, so thatthe finished model is able to produce the correct output for unseenexamples of x. In contrast, unsupervised learning does not demandknowing y. Instead, it learns from the structure of the input data. Un-supervised learning is commonly used for clustering problems or di-mensionality reduction. However, as this thesis aims to solve a classi-fication problem, only methods suitable for such tasks are discussed.

Before training, the data is usually split into three parts: training, vali-dation and test set. After training the model using the training set, thevalidation set is used to assess the model’s performance and to finetune the model parameters in order to optimize the mapping from in-put to output. Finally, the test set is used to do the evaluation of themodel’s performance. The validation set can not be used for the fi-nal evaluation, as the model will be slightly overfitted to that set, thusproducing unrealistically good results when tested on it. The test set,on the other hand, is kept isolated during the training and validationphase and can therefore give an indication on the model’s ability togeneralize to unseen data.

K-fold cross validation is a training and evaluation method used tomaximize the usage of data available for testing [15]. It is carried outby dividing the data into k parts and iterating the training process ktimes, using the kth partition of the data as test set. When k iterationsof training and testing have been done, the results are averaged.

Using the test set, the accuracy of the model — which is a very com-mon performance measure in machine learning — can be evaluated.The model assigns an output label ymodel to each sample of the test set,based on the mapping its learned. These labels are then compared tothe labels yreal of the test set. The accuracy is given as the proportionof the data where the model’s output label is the same as the test setlabels, ymodel = yreal.

Precision and recall are also evaluated on the test set. Both measure-ments are calculated per class and then averaged over all classes. Pre-cision is a measurement of the fraction of relevant instances (true pos-itives) among the predicted instances and is given by

8 CHAPTER 2. BACKGROUND

Precision =True positives

True positives+ False positives

for each class. Recall is a measurement of the fraction of true positivesthat have been predicted over the total amount of relevant instances,calculated for each class as

Recall =True positives

True positives+ False negatives

There is often a trade-off between precision and recall based on op-timization choices for the algorithm. F1 score is a weighted measureof both precision and recall, using the harmonic mean of the two.

F1 score =2 ∗Recall ∗ PrecisionRecall + Precision

2.3 Random Forest

The Random Forest is an ensemble method which operates on a seriesof decision trees to classify input. The algorithm was first presentedby Ho [10]. Each decision tree in the series is trained on a random sub-set of the dataset. During training, decision trees sequentially split thesubset using the feature that gives the most information about classinherence, until each split only contains samples from one distinctiveclass. The resulting tree structure is used to classify new samples. Dur-ing testing, the class prediction is chosen by majority vote from the de-cision trees belonging to the RDF. With a setup like this, each individ-ual classifier (i.e. decision tree) can be weak — the resulting ensembleclassifier will be strong.

Modern Random Forests, including this project’s implementation of


the algorithm, encompass a modification made by Breiman [3]. As op-posed to Ho’s algorithm, Breiman’s version does not split the node onthe most important feature at each split. Instead, the best feature outof a random subset of features is chosen. This is done to avoid cor-relation between the different decision trees that can otherwise occur.Correlation arises if one or a few features are very strong class pre-dictors in a dataset and thus are chosen in many of the trees. Withthis modification a diversity of splitting features is attained, usuallyresulting in a stronger model. Random Forests run efficiently on largedatasets, are good at preventing overfitting and work well with unbal-anced datasets.

2.4 Artificial Neural Networks

Artificial neural networks (ANN) in machine learning are built by cre-ating networks of nodes, also called neurons, which are mathematicalmodels of the neurons in the human brain. The neurons are modelledwith three essential parts:

1. A set of weighted inputs that connect the neurons to each other

2. An adder that sums the input signals

3. An activation function that controls if the neuron fires

Figure 2.1: Artificial Neuron representation, retrieved from [20].

Figure 2.1 shows a set of input nodes on the left (x1, x2, ..., xm). In ahuman brain, these inputs would be outputs from other neurons, butin the case of neural networks they correspond to the input vector x.The synaptic weights (wk1, wk2, ..., wkm) decide how strongly the corre-sponding input element xi will influence the following neuron in thenetwork. The weight wi and the input element xi are multiplied andthen the result is summed with the results of the other signals that


connect to the kth neuron, using the adder function. The variable bkrepresents the neuron bias, a scalar which allows the neuron to shiftits activation function along the x-axis in order to optimize the neuronoutput.

h =m∑i=1

wkixi + bk

The value h of the adder function is then used as the input parameterto the activation function, which decides if the signal is strong enoughfor the neuron to fire. A very simple example of an activation functionis

σ = g(h) =

{1 if h > θ

0 if h ≤ θ

which indicates that the neuron will fire (i.e. produces output 1) if thevalue of h is greater than some threshold value θ [20]. Neural networksare constructed by connecting sets of such neurons and training themto perform a particular task.

2.5 Backpropagation

Backpropagation is a common way of training neural networks. Thegoal of backpropagation is to minimize the network’s cost functionC(w, b),which is a measure of how well the model performs classification. Thequadratic cost function is one of several possible cost functions and isdefined as

C(w, b) =1

2n

∑x

‖ yreal(x)− ymodel(x) ‖2

where yreal are the labels of the training set that the network is tryingto approximate and ymodel is the current output of the model for the in-put x. n is the total number of training examples. The weight and biasof the model are denoted by w and b. Although not explicitly stated inthis notation, the output ymodel is dependent on w, b and x. Since theinput instances x cannot be changed, w and b must be tuned to mini-mize the cost C.

The backpropagation algorithm starts by randomly initializing the model


parameters and then propagates forward to get ymodel. Then, the costfunction is calculated. Based on partial derivatives of the cost func-tion as well as averaging the cost over all the training instances x, theweights and biases of the network are updated. This procedure is re-peated until the cost is minimized.

2.6 Variations of Artificial Neural Networks

There are many different types of ANNs and new ones are constantlybeing developed to better suit needs in various research fields. Eachtype has traits that makes it more or less suitable for different applica-tions. Three common types of neural networks are Multi-layer Percep-tron, Recurrent Neural Network and Convolutional Neural Network.

2.6.1 Multi-Layer Perceptrons

Multi-layer perceptrons (MLPs), also known as Feed-Forward NeuralNetworks (FFNN), are a common type of ANN. Feed-Forward refersto the flow of data between the layers and nodes in the model. It in-dicates that the data only moves in one direction instead of loopingbackwards in the model, as is the case with other types of networks.

Figure 2.2: Representation of a 4-layer Multi-Layer Perceptron withtwo hidden layers.

An MLP consists of an input layer, an output layer and one or mul-tiple hidden layers. Every node in a layer li is fully connected to eachnode in the next layer li+1. Each node in a hidden layer uses a non-linear activation function whose input is the weighted sum of the out-puts from the previous layer. The activation function is traditionally


the Sigmoid activation function

σ(h) =1

1 + e−h

or the Tanh (hyperbolic tangent) activation function

σ(h) =eh − e−heh + e−h

A more modern variant [11] is the Rectified Linear Units (ReLU) acti-vation function,

σ(h) = mac(h, 0)

which means that the activation function outputs 0 if the input is lessthan 0 and raw output when it is not.

2.6.2 Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are a type of neural networkscommonly implemented in applications built on sequential or longi-tudinal data (e.g. data collected at different time steps). It does so byintroducing a recurrent connection in its hidden layers. This meansthe model does not only consider the current input data, but also theneuron output at the last time step. Thanks to this, the model can learncorrelation between certain input elements.

Figure 2.3: Representation of a network with recurrent cells.

The model parameters are shared over all the time step evalua-tions, leading to increased efficiency in the form of lower computa-tional costs.


Gated RNNs are variants of RNNs that replace the interconnected hid-den units of the basic network structure with cells containing an inter-nal recurrence loop and a gate system for controlling information flowthrough the model. Long Short-term Memory (LSTM) as well as GatedRecurrent Unit (GRU) are examples of such variants that are commonlyused in practice. It has been shown that gated RNNs are better at han-dling longitudinal data that has many sequential dependencies thanbasic RNNs [11].

2.6.3 Convolutional Neural Networks

The popularity of Convolutional Neural Networks (CNNs) has increaseda great deal in recent years, much because of the many successful im-plementations for image recognition and computer vision. The prop-erty of CNNs that make them so suitable for image applications is theconcept of local connectivity, which takes into account the spatial in-formation of the input data. For instance, where the input is an image,it would account for the proximity between the pixels.

CNNs have a distinctive type of layer called convolutional layer, con-sisting of a number of convolutional filters. Convolutional filters takemultiple input signals and convolve them into one single output sig-nal. Input signals can be either single- or multi-dimensional.

Convolutional layers are usually followed by a pooling layer. Whilethe convolutional filters extract important features of the input signal,the pooling layer aggregates them.

2.7 Relevant works

There are many successful applications of the discussed deep learn-ing methods in medicine. As mentioned above, CNNs have becomemost famous for their successful application on image data. Rajpurkaret al. [22] developed an algorithm for arrhythmia detection using a34-layer CNN. Training such a network can be very computationallycostly and time demanding. Here, layers are connected using shortcutconnections, which allows for good propagation during the training


of a deep network with a large amount of layers. The sequence-to-sequence mapping is done from ECG samples (recording of electricalactivity of the heart) to rhythm classes. The underlying dataset con-tained approximately 64,000 ECG records from 29,000 patients that hasbeen hand annotated by experts. During testing, it is concluded thatthe model developed by Rajpurkar et al. [23] exceeded expert perfor-mance.

Rajpurkar et al. [23] used the CNN architecture with 121 layers asa basis for their CheXNet algorithm, that uses chest X-rays to detectpneumonia. The dataset consisted of more than 100,000 X-ray imagesannotated with 14 diseases by experts. Given an X-ray image as aninput, the model outputs the probability of pneumonia as well as theX-ray image with a heatmap indicating the areas that show signs ofpneumonia. The test phase showed that CheXNet got a better F1-score(measurement of performance in statistical analysis for binary classifi-cation) than practicing radiologists.

Although image applications are often the first to be mentioned in re-spect to CNNs, there are other successful applications. Cheng et al. [5]used temporal matrix representation of EHR data (a 2D matrix withevent on one axis and time stamp on the other) to perform phenotyp-ing and chronic disease risk prediction. The matrix is used as input inthe first layers of the 4-layer CNN. The second layer performed con-volutional filtering to extract phenotypes. The remaining layers per-formed pooling and prediction. The model is tested on its ability topredict two chronic diseases, congestive heart failure and chronic ob-structive pulmonary disease with 1127 real cases vs. 3859 control casesand 477 real cases vs. 2385 control cases, respectively. The predic-tion window is set to 180 days and the training data for over 319,000patients over four years are used. Different variants of the model istested, with all performing well over baseline.

Lipton et al. [19] are the first to employ LSTM, a subtype of RNN, onEHR data, with the goal of learning to classify 128 diagnoses from 13frequently sampled clinical measurements. LSTM is chosen due to itsability to model data with sequential dependencies. It is previouslyknown [4] that long-term dependencies hidden in the EHR data areimportant for prediction modeling, but that other state-of-the-art mod-


els failed to consider them. The final best performing model turnedout to be an ensemble framework with the LSTM combined with athree-layer MLP.

The DeepCare framework is developed by Pham et al. [21]. Deep-Care is an end-to-end system that reads EHRs, infers present illnessstates and predict future outcomes. It is also built with LSTM, to ac-count for the long-term dependencies in patient health history for pre-dicting disease progression, intervention recommendation and futurerisk. The framework is compared to state-of-the-art baseline classifica-tion methods and is shown to outperform them all.

One difficulty with using EHR data in machine learning applications isthat a lot of the information is given as unstructured text from doctors’notes. Examples of such information is diagnosis, adverse drug eventsor medication. Extracting relevant medical information and tempo-ral dependencies from such data is thus an important task for gainingsemantic understanding of the information in EHRs, but can be chal-lenging. The notes are often noisy, contain incomplete sentences, med-ical jargon, abbreviations etc. [12]. Jagannatha and Yu [12] and theirfollow-up study [13] view the extraction of information as a sequencelabeling task. The studies vary somewhat in their implementation,with the follow-up study [13] having a slightly more sophisticated la-beling system, that divides labels into two major categories; Medicalevents (containing labels such as Adverse Drug Event or Indication) andAttributes (containing labels such as Severity or Frequency). The twostudies experiment with quite a few different RNN-based architecturesand compare them to then state-of-the-art models and showed that allRNN-based models significantly outperformed them.

Choi et al. [8] found a standard MLP to give the best performancewhen predicting heart failure using a distributed representation. TheMLP architecture is not built to model sequential dependencies andtherefore might overlook important long-term dependencies in the EHRdata. A Representation Learning approach inspired by Natural Lan-guage Processing’s Skip-gram is used to map heterogeneous medicaldata to a low-dimensional space where similar concepts become clus-tered. Medical concept vectors are constructed in this manner. A fea-ture of Skip-gram word vectors is that they support syntactically and


semantically meaningful linear operations. Based on this, [8] assumethat their medical concept vectors will support clinically meaningfulvector additions, a feature which they use to create patient represen-tation vectors by adding all medical concept vectors present in the pa-tient history. These vectors are then used to train the MLP to predictheart failure and compared to the same model trained on the raw cat-egorical data. The MLP trained on concept vectors is shown to outper-form the MLP trained on the raw data.

2.8 Summary

State-of-the-art works show that deep learning is a method with muchpotential and many successful applications when performing machinelearning tasks on EHRs. They emphasize the importance of choosinga deep learning model based on the underlying data as well as thedesired output type (e.g. probability prediction, sequence-to-sequencemapping, classification etc.). As there is neither obvious spatial norlongitudinal patterns in the dataset used in this project and the goal isto perform classification, a classical MLP seems to be the logical modelto choose for experiments. The hypothesis is that an MLP would beable to better learn the complex mapping between the EHRs and thediagnosis code than the Random Forest is able to.

Chapter 3

Methods

3.1 Approach

The models used for the experiments are two MLPs with differentarchitectures, one Random Forest provided by the principal and onenaive model.

The naive model makes the assumption that the class distribution inthe given dataset is representative of the distribution in the real worldand always predicts the most common class in the dataset. The RFDand the naive model are used as baselines for comparison.

Since the dataset is small, experiments also entails performing variouspermutations of the data to evaluate if that would result in better per-formance. The Random Forest is run on a version of the dataset thathas been engineered to suit that model, as is standard procedure withtraditional models. Each MLP is run on two versions of the dataset,one with minimal alterations and one with various permutations.

Evaluation of each model is done using 5-fold cross-validation. Eval-uation metrics used are accuracy, precision, recall and F1-score.

3.2 Processing of Data

The data consists of 69141 samples. Each sample has 1209 features.Out of these 1209 features, 1208 are numerical, indicating different an-

17

18 CHAPTER 3. METHODS

swers given by patients to questions in the questionnaire. These nu-merical values have been derived from different types of answers andconsists of a set of ternary columns represented by numbers as 0=No,1=Not sure, 2=Yes and a set of columns with ranked numeric values,some normalized and some not. One feature column named issues hascategorical string values describing the reason for seeking care statedby the patient. Five different unique values are present in this column,including no reason given. As the MLPs expected numerical inputvalues, this column needed to be converted. Simply representing thereasons with different numeric values would risk giving preferentialtreatment to some reasons in the prediction model, as reasons repre-sented by a higher numerical value would have a stronger impact onthe model. Therefore a one hot encoding sequence is chosen to rep-resent this feature, meaning that five new columns representing eachunique issue are added to the dataset, with value 1 in the column cor-responding to the reason for seeking care for a specific data sampleand value 0 in the other columns. The labels consist of icd10 codes,an international diagnosis code system for diagnosis classification. Inaddition, one (self-explanatory) label is called "No diagnosis could beset".

The data is sparse, with 70.5% of the datapoints being missing val-ues. A missing value indicates that the question has either not beenshown to the patient, or the patient has chosen not to answer it. Boththese cases are deemed to hold information valuable for the classifi-cation. The missing values are therefore converted to −1. The abovealterations are deemed necessary for both the dataset with minimal al-terations, Dmin and the adapted dataset, Dalt. Several additional alter-ations are made in Dalt. As described in figure 3.1a, class occurrencesin the dataset are imbalanced, with one large majority class and somevery small classes. Random oversampling is employed to grow thesmaller classes with the method SMOTE [2]. Random undersamplingis used to decrease the size of the majority class.

Dalt shared many traits with the dataset adapted for the RDF. Forinstance, both datasets implied the importance of the missing valuesby giving them a numerical value that is not equal to 0. In DRDF ,missing values in ternary and columns with ranked numeric values

CHAPTER 3. METHODS 19

(a) Before sampling. (b) After sampling.

Figure 3.1: Class distribution of Dalt before and after over- and under-sampling. Classes are unique icd10 codes in the dataset.

are assigned different values, whereas Dalt didn’t distinguish betweenthe two. In both datasets values are normalized and ternary columnsare converted to a one hot encoding. DRDF , as opposed to Dalt, is notbalanced.

3.3 Network Architecture

The basic architecture of the MLPs are implemented with choices ofcost function, activation function and optimization method. Parame-ters of each model such as number of layers and nodes, dropout prob-ability, batch size etc. are experimentally varied and the parametersgiving the best performance are used when evaluating and comparingthe models.

In [14] a default learning rate of 0.001 is suggested as starting point,which is employed in this project. The learning rate defines how fastwe want to update the weights in our model during training. If thelearning rate is too big, the model might miss the optimal solution andif its too small, we might need too many iterations to find the solution.

3.3.1 MLP-a

After tuning of the parameters, MLP-a is chosen as a 3-layer networkwith 30 neurons in the first hidden layer, 40 in the second and 15 in

20 CHAPTER 3. METHODS

the last. The activation function chosen is a rectified linear unit, whichoutputs zero if the neuron input sum is less than zero and raw outputotherwise.

Table 3.1: Specification of MLP-a

Activation function ReLUCost function Mean Squared ErrorLayer-1, width 30Layer-2, width 40Layer-3, width 15Learning rate 0.002Dropout keep rate 0.7Epochs 450Batch size 300

3.3.2 MLP-b

MLP-b is chosen as a 3-layer network with 10 neurons in the first hid-den layer, 20 in the middle and 15 in the last.

A softmax activation function is used. It has properties similar to thesigmoid activation function but is more suitable for multi-class classi-fication implementations [16]. It takes an input vector and returns acompressed output vector with values ranging from 0 to 1. The valuesin the output vector sum to one.

The cost function is also a softmax function with cross entropy, whichis an indicator of the distance between the predicted class probabilitydistribution and the actual class probability distribution, commonlyused as an alternative to squared error.

Table 3.2: Specification of MLP-b

Activation function SoftmaxCost function Softmax with cross entropyLayer-1, width 10Layer-2, width 20Layer-3, width 15Learning rate 0.009Dropout keep rate 0.8Epochs 400Batch size 300

CHAPTER 3. METHODS 21

3.4 Training and Evaluation

Training of the MLPs is done using the Adam Optimizer [14] as theoptimization algorithm, due to recommendations in [24], an extensivereview of gradient decent learning algorithms. It is a method basedupon the classical stochastic gradient decent procedure with the differ-ence that the learning rate is not constant during the training. Instead,all weights in the network have adaptive learning rates that change asthe learning unfolds.

Dropout [26] is applied during training. Its a method for randomlyclosing down nodes during iterations of the training, in order to avoidoverfitting.

Evaluation of all the models is done using k-fold cross-validation withthe dataset randomly split into 80% training and 20% test sets for eachfold. k is set to 5. A higher value for k, such as 10, might be preferablein order to decrease bias of the performance results [15]. However, asthe models are very time consuming to run, k is set to 5 for cost rea-sons.

Chapter 4

Results

In this chapter the results of the experiments are reported. Note thatthe performance metrics for the Random Decision model and the naivemodel are the same throughout all sections, as the Random Forest isjust run on its own alterations of the dataset and the naive model isbuilt on the assumption that the probability distribution in the givendataset represents the real distribution. Their performance metrics aredescribed in the following section and then displayed in each subse-quent section for easy comparison with the MLPs.

4.1 Comparison models

The Random Forest model achieved an accuracy of 61.8% on its ownalteration of the dataset. The average precision is 63.4% and the av-erage recall is 62.2%. The F1 score is then 62.8%. Employing a naivemodel and guessing only the majority class yields an accuracy of 38.8%and a shy 0.94% for precision, 2.6% for recall and 1.4% for F1 score.

22

CHAPTER 4. RESULTS 23

4.2 Performance on Dmin

Figure 4.1: Performance on Dmin averaged over 5-fold cross-validation. Error bars represent standard deviation.

The accuracy for the model MLP-a on the dataset with minimalalterations, Dmin, is 38.8%. The precision came in at 31.3% and therecall at 27.3%. The F1 score is 29.2%. MLP-b has accuracy 42.1%when running on Dmin. Precision is 32.7%, recall 25.7% and F1 score28.8%.

4.3 Confusion matrices for MLPs on Dmin

The confusion matrix describes the relation between between ypred andyreal for the MLPs on Dmin, with yreal being shown along the y-axis andypred along the x-axis.

(a) MLP-a on Dmin. (b) MLP-b on Dmin.

Figure 4.2: Confusion Matrix for the MLPs operating on Dmin

24 CHAPTER 4. RESULTS

. 4.4 Performance on Dalt

Figure 4.3: Performance on Dalt averaged over 5-fold cross-validation.Error bars represent standard deviation.

When running Dalt on MLP-a, an accuracy of 44.26% is achieved,which is 5.47 percentage points higher than onDmin. The model reachesa precision of 35.28% and a recall of 29.22%. This gives an F1 score of31.97%. MLP-b gives an accuracy of 48.12% for Dalt, which is 6.05percentage point better than when running on Dmin. The precision is37.32%, the recall 28.54% and the F1 score 32.34%.

4.5 Confusion matrices for MLPs on Dalt

The confusion matrix describes the relation between between ypred andyreal for the MLPs on Dalt, withyreal being shown along the y-axis andypred along the x-axis.

(a) MLP-a on Dalt. (b) MLP-b on Dalt.

Figure 4.2: Confusion Matrix for the MLPs operating on Dalt

Chapter 5

Discussion

Two MLPs (MLP-a and MLP-b) are evaluated on their ability to per-form prediction of diagnosis code belonging to a EHR. They are com-pared agains a baseline model and a naive model. As we can see,none of the deep learning approaches are able to beat the baselinemodel, Random Forest. Both models surpass the performance of thenaive model, especially when regarding precision and recall, indicat-ing that both models are able to learn from the dataset, although notwell enough to beat the baseline.

The models are evaluated on two versions of the dataset. One withminimal alterations, calledDmin, and one with several alterations com-monly used on machine learning datasets, called Dalt. Both modelsperform better on Dalt than on Dmin. As stated in the limitations sec-tion of this thesis, one of the risks of this project is that the size of thedataset might not be sufficient for the deep models to properly learn.The fact that performance on Dalt is better than on Dmin supports this.In a data-rich environment, neural networks are appreciated for theirability to learn without preceding feature engineering. But, complex-ity of the data, the model and the task all affect the amount of dataneeded to train the model. In Dalt, it seems the model is helped by thereduction of the complexity of the data (e.g. by reducing the featurespace). This could also be an explanation for the baseline model’s su-perior performance compared to the MLPs. Since the baseline model isa model of lower complexity than the MLPs, it is able to generalize bet-ter from a small dataset. However, a model of lower complexity alsofaces limitations trying to learn the full mapping x → y for a complex

26

CHAPTER 5. DISCUSSION 26

task. As we can see, the baseline model’s performance (e.g. accuracyof 61.8%) also has a lot of room for improvement.

As we can see from the confusion matrices for both models on Dmin,the models overestimate the probability of the class named j069. Thisis due to the models overfitting to the majority class in the training set.Dropout is applied to avoid overfitting, but when tuning the probabil-ity of a node being closed down during an iteration of the training, atrade-off has to be made. By increasing the probability of closing off,the risk of overfitting is lower, but it also makes the model harder totrain, demanding more data (and additional epochs). With a probabil-ity that is too high with not enough data to learn from, the model willstart making what seems as random guesses.

The models also overestimate the probability of the class No diagno-sis set. This class is very bothersome to handle in the classification, asit can contain basically anything and therefore is very hard to learn.

MLP-b consistently performs better than MLP-a on both datasets. Thiscould be due to the best width and depth of MLP-b being found to bea bit smaller than those of MLP-a. A shallower model with less nodesmight be easier to train. It could also be due to the different cost andactivation functions. A good combination of cost and activation func-tion can result in a smoother cost curve, which is easier to optimize.Generally, smoother cost functions allow for higher learning rates. Thebest learning rate for MLP-a is found to be lower than that of MLP-b,which indicates a cost curve that is easier to optimize.

Also mentioned in the limitations section is the exclusion of a freetext response from the EHRs in the dataset. This project has assumedthat missing values in the dataset carry meaning, i.e. that if a patientdeemed a question irrelevant to answer or a doctor deemed the ques-tion irrelevant to show, that has meaning for the classification. If infor-mation is given in the free text response, some of the missing valuesmight actually have a real value, just hidden in the free text and there-fore not present in the dataset. If a lot of highly relevant information isgiven in the free text, as the principal stated, this might be a consider-able error source in our classification. For future research, it would bevery interesting to see what could be done when including the free text

27 CHAPTER 5. DISCUSSION

in the dataset, perhaps by using LSTMs to extract relevant keywordsfor classification. Such information might help to increase distance be-tween classes in the feature space, making them easier to distinguishinto different classes.

A larger dataset might have made it easier to draw clear conclusionsregarding the deep learning approach in this project. A possible im-provement of the study would have been to look for viable datasetsavailable for research. Adapting the baseline model to suit a sourcedEHR dataset of larger size, so that the MLPs would have had sufficientdata for training, might have made it possible to give a clearer answerto the research question regarding MLPs performance on EHRs. Othertypes of neural networks (e.g. CNNs or RNNs) could also have beenincluded in the study to give a broader representation of deep learn-ing.

5.1 Ethical Discussion

The ethical aspect is always of high importance when introducing tech-nological advances into peoples lives and perhaps even more so whenit regards their health. Digitalized care visits are already a fact andit comes with both advantages and risks. Although digitalized carehas led to increased efficiency and accessibility of care for some, it hasalso been argued [1] that these technical implementations have beenmade without the proper scientific support, leading to a number ofproblems. One of a doctors primary purposes is to make out a diag-nosis for the patient they are seeing. This is done by listening to thepatients story (anamnesis) and by performing a physical examination.Although it is possible to stream sound and video in a digitalized visit,it seems obvious that some elements of the physical examination arelost. There is also little to no relevant research as to how the conver-sation regarding the patients health is affected, according to [1]. Al-though technological transformations in the health care sector havepotential to have a very positive effect, it is important to base imple-mentations on scientific evidence.

CHAPTER 5. DISCUSSION 28

5.2 Conclusions

There have been many successful applications of deep learning ap-proaches on EHRs and there is still much potential to realize. In thisproject, it is found that the deep learning models are able to learn fromthe small dataset of EHRs, although not well enough to pass the per-formance of the baseline model. In a setting with little data, choosinga simpler method, such as Random Forest, and spending more timeon feature engineering, seems like a better decision. However, as thedataset continues to grow, further research on deep learning classifica-tion would be of great interest.

Bibliography

[1] Rolf Ahlzén et al. “Vetenskapligt stöd saknas för digitala diag-noser”. In: Dagens Nyheter (Apr. 2018). URL: https://www.dn.se/debatt/vetenskapligt- stod- saknas- for-digitala-diagnoser/.

[2] Kevin W Bowyer et al. “{SMOTE:} Synthetic Minority Over-samplingTechnique”. In: CoRR abs/1106.1 (2011), pp. 321–357. ISSN: 10769757.DOI: 10.1613/jair.953. arXiv: 1106.1813. URL: http://arxiv.org/abs/1106.1813.

[3] Leo Breiman. “Random Forests”. In: Machine Learning 45.1 (2001),pp. 5–32. ISSN: 1098-6596. DOI: 10.1017/CBO9781107415324.004. arXiv: arXiv:1011.1669v3.

[4] V. Courtney Broaddus et al. Murray & Nadel’s Textbook of Respi-ratory Medicine. 2015, p. 3047. ISBN: 9780323261937. URL: https://www.sciencedirect.com/science/book/9781455733835%20https://books.google.com.eg/books?id=Hux1BwAAQBAJ%7B%5C&%7Dpg=PA767%7B%5C&%7Ddq=It+was+then+that+Dr.+William+Briscoe+at+the+ninth+Aspen+Emphysema+Conference+in+1965+first+introduced+the+term+%E2%80%9CCOPD.%E2%80%9D+Several+years+lat.

[5] Yu Cheng et al. “Risk Prediction with Electronic Health Records:A Deep Learning Approach”. In: SIAM International Conferenceon Data Mining (2016), pp. 432–440. DOI: 10.1145/2783258.2783352. URL: http://epubs.siam.org/doi/pdf/10.1137/1.9781611974348.49.

[6] Travers Ching et al. “Opportunities And Obstacles For Deep Learn-ing In Biology And Medicine”. In: bioRxiv (2017), p. 142760. DOI:10.1101/142760. arXiv: 142760. URL: https://greenelab.

29

BIBLIOGRAPHY

github.io/deep-review/manuscript.pdf%20https://www.biorxiv.org/content/early/2017/05/28/142760.

[7] Edward Choi et al. “Doctor AI: Predicting Clinical Events via Re-current Neural Networks”. In: Proceedings of Machine Learning forHealthcare 2016 JMLR W&C Track 56 (2015), pp. 1–12. ISSN: 1938-7288 (Electronic). DOI: 10.1002/aur.1474.Replication.arXiv: 1511.05942. URL: http://nematilab.info/bmijc/assets/170607%7B%5C_%7Dpaper.pdf%7B%5C%%7D0Ahttp://arxiv.org/abs/1511.05942.

[8] Edward Choi et al. “Medical Concept Representation Learningfrom Electronic Health Records and its Application on HeartFailure Prediction”. In: arXiv (2016), p. 45. arXiv: 1602.03686.URL: http://arxiv.org/abs/1602.03686.

[9] Andre Esteva et al. “Dermatologist-level classification of skincancer with deep neural networks”. In: Nature 542.7639 (2017),pp. 115–118. ISSN: 14764687. DOI: 10 . 1038 / nature21056.URL: http://dx.doi.org/10.1038/nature21056.

[10] Tim Kam Ho. “Random Decision Forests Tin Kam Ho Perceptrontraining”. In: Proceeding ICDAR ’95 Proceedings of the Third Inter-national Conference on Document Analysis and Recognition Volume1 (1995), p. 278.

[11] Andreas Holzinger. Machine Learning for Health Informatics: State-of-the-Art and Future Challenges. Graz: Springer Nature, 2016, p. 211.ISBN: 978-3-319-50477-3. DOI: 10.1007/978-3-319-50478-0.

[12] Abhyuday Jagannatha and Hong Yu. “Bidirectional RecurrentNeural Networks for Medical Event Detection in Electronic HealthRecords”. In: (2016). ISSN: 1527-5418. DOI: 10.1111/obr.12065.Variation. arXiv: 1606.07953. URL: http://arxiv.org/abs/1606.07953.

[13] Abhyuday Jagannatha and Hong Yu. “Structured prediction mod-els for RNN based sequence labeling in clinical text”. In: (2016).ISSN: 1527-5418. DOI: 10.1126/science.1249098.Sleep.arXiv: 1608.00612. URL: http://arxiv.org/abs/1608.00612.

BIBLIOGRAPHY

[14] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochas-tic Optimization”. In: (2014), pp. 1–15. ISSN: 09252312. DOI: http://doi.acm.org.ezproxy.lib.ucf.edu/10.1145/1830483.1830503. arXiv: 1412.6980. URL: http://arxiv.org/abs/1412.6980.

[15] Ron Kohavi. “A Study of Cross-Validation and Bootstrap for Ac-curacy Estimation and Model Selection A Study of Cross-Validationand Bootstrap for Accuracy Estimation and Model Selection”. In:Learning March 2001 (2016), pp. 1137–1143.

[16] Alex Krizhevsky, Ilya Sutskever, and Hinton Geoffrey E. “Ima-geNet Classification with Deep Convolutional Neural Networks”.In: Advances in Neural Information Processing Systems 25 (NIPS2012)(2012), pp. 1–9. ISSN: 10495258. DOI: 10 . 1109 / 5 . 726791.arXiv: 1102.0183. URL: https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf.

[17] Yann Lecun et al. “Gradient-Based Learning Applied to Docu-ment Recognition”. In: Proceedings of the IEEE 86.11 (1998), pp. 2278–2324. ISSN: 00189219. DOI: 10.1109/5.726791. arXiv: 1102.0183.

[18] Znaonui Liang et al. “Deep learning for healthcare decision mak-ing with EMRs”. In: Proceedings - 2014 IEEE International Confer-ence on Bioinformatics and Biomedicine, IEEE BIBM 2014 Cm (2014),pp. 556–559. ISSN: 0968-090X. DOI: 10.1109/BIBM.2014.6999219.

[19] Zachary C. Lipton et al. “Learning to Diagnose with LSTM Re-current Neural Networks”. In: (2015), pp. 1–18. ISSN: 16130073.DOI: 10.14722/ndss.2015.23268. arXiv: 1511.03677.URL: http://arxiv.org/abs/1511.03677.

[20] Stephen Marsland. “Machine Learning : An Algorithmic Per-spective, Second Edition”. In: CRC Press, 2014. Chap. 3, pp. 39–70.

[21] Trang Pham et al. “DeepCare: A Deep Dynamic Memory Modelfor Predictive Medicine”. In: (2017). arXiv: 1602.00357. URL:https://arxiv.org/pdf/1602.00357.pdf.

BIBLIOGRAPHY

[22] Pranav Rajpurkar et al. “Cardiologist-Level Arrhythmia Detec-tion with Convolutional Neural Networks”. In: (2017). arXiv:1707.01836. URL: http://arxiv.org/abs/1707.01836.

[23] Pranav Rajpurkar et al. “CheXNet: Radiologist-Level Pneumo-nia Detection on Chest X-Rays with Deep Learning”. In: (2017),pp. 3–9. DOI: 1711.05225. arXiv: 1711.05225. URL: http://arxiv.org/abs/1711.05225.

[24] Sebastian Ruder. “An overview of gradient descent optimiza-tion algorithms”. In: (2016), pp. 1–14. ISSN: 0006341X. DOI: 10.1111/j.0006-341X.1999.00591.x. arXiv: 1609.04747.URL: http://arxiv.org/abs/1609.04747.

[25] Benjamin Shickel et al. Deep EHR: A Survey of Recent Advances inDeep Learning Techniques for Electronic Health Record (EHR) Anal-ysis. 2017. DOI: 10.1109/JBHI.2017.2767063. arXiv: 1706.03446. URL: https://arxiv.org/pdf/1706.03446.pdf.

[26] Nitish Srivastava et al. “Dropout: A Simple Way to Prevent Neu-ral Networks from Overfitting”. In: Journal of Machine LearningResearch 15 (2014), pp. 1929–1958. ISSN: 15337928. DOI: 10.1214/12-AOS1000. arXiv: 1102.4807.

[27] Harini Suresh et al. “Clinical Intervention Prediction and Under-standing using Deep Networks”. In: Machine Learning for Health-care Conference (2017), pp. 1–16. arXiv: 1705.08498. URL: https://arxiv.org/pdf/1705.08498.pdf%20http://arxiv.org/abs/1705.08498%20http://arxiv.org/abs/1705.08498%7B%5C%%7D0Ahttps://arxiv.org/pdf/1705.08498.pdf.

[28] Stephen F. Weng et al. “Can Machine-learning improve cardio-vascular risk prediction using routine clinical data?” In: PLoSONE 12.4 (2017), pp. 1–14. ISSN: 19326203. DOI: 10.1371/journal.pone.0174944.

TRITA EECS-EX-2018:634

www.kth.se

a deep learning approach to predicting diagnosis code from...

Documents