project 2 – meningoencephalitis diagnosisberka/docs/4iz451/example-project2-menindata.pdf ·...
TRANSCRIPT
4IZ451 - Knowledge Discovery in Databases
Project 2 – meningoencephalitis diagnosis University of Economics in Prague
Oliver Gensky (xgeno00) 12-16-2017
Project 2 – meningoencephalitis diagnosis Oliver Genský
1 of 16
Table of Contents 1. Introduction ................................................................................................................................ 2
2. “Business” understanding ........................................................................................................... 2
2.1. Background ............................................................................................................................. 2
2.1.1. Meningitis ........................................................................................................................... 2
2.1.2. Process modeling ................................................................................................................ 2
2.2. Objectives................................................................................................................................ 3
3. Data understanding .................................................................................................................... 3
3.1. Data ......................................................................................................................................... 3
3.2. Target variables ....................................................................................................................... 3
3.3. Input variables ........................................................................................................................ 5
4. Data preparation ......................................................................................................................... 6
4.1. Treatment of outliers and missing values ............................................................................... 6
4.2. Variable transformations ........................................................................................................ 6
4.3. Sampling and data partitioning ............................................................................................... 6
5. Modeling ..................................................................................................................................... 7
5.1. Candidate models ................................................................................................................... 7
5.1.1. Decision Trees ..................................................................................................................... 7
5.1.2. Neural Networks ................................................................................................................. 8
5.2. Model selection approach ...................................................................................................... 8
5.3. Final model .............................................................................................................................. 9
5.3.1. Overall predictive accuracy ............................................................................................... 10
5.3.2. Observed versus predicted target values ......................................................................... 11
5.3.3. Improvement over baseline .............................................................................................. 12
6. Discussion .................................................................................................................................. 12
6.1. Assessment of model performance ...................................................................................... 12
6.2. Contribution to the solution of the problem ........................................................................ 12
6.3. Deployment recommendations ............................................................................................ 13
Project 2 – meningoencephalitis diagnosis Oliver Genský
2 of 16
1. Introduction The purpose of this work is to use data mining methods and tools in order to solve some given
objectives in the medical field of brain diseases. The analysis on medical data must be done
following the CRISP-DM methodology.
Methodology provides us with data mining process model that describes commonly used
approaches that data mining experts use to tackle problems. (wikipedia community, n.d.)
Little more about this methodology can by found in appendix.
2. “Business” understanding Before any data manipulation is initiated, we need to understand the terms used in particular
business (field), get to know it’s concepts and processes, which relate to our DM task.
2.1. Background First, it is necessary to define how terms and processes were understood from the assignment and
how they relate with given objectives.
2.1.1. Meningitis Neurological Infectious Diseases. Some bacteria or virus is invaded in the dura sheet (covered the
brain), which causes severe inflammation in dura. When the brain is inflamed, the patient is
diagnosed as "meningoencephalitis". Sometimes when bacteria forms abscess in the brain, he/she is
diagnosed as "brain abscess". (Tsumoto , 2000)
2.1.2. Process modeling Process of diagnosing and treating the patient starts with patient being accepted, ends with patient
being discharged. First in this process, doctor gathers basic information about patient’s present
history of health status and subsequently performs a physical examination. After these first two
steps, doctor is ready to draw up first diagnosis. In medical field it is a standard to always make
differential diagnosis using different information about patient to support or discard first
assumption. In our case, biological samples of patient’s body are used to elaborate laboratory
examination findings. According to these, second diagnosis is proposed. Both diagnosing steps of our
process are closer described in appendix. After having two diagnoses elaborated, doctor decides on
suitable therapy for patient. In data we received, the therapy part and status of patient after
discharged is represented by last few columns. The diagram of this process can be seen below.
Picture 1 - process of examining and treating the patient (source: Author, in draw.io)
Project 2 – meningoencephalitis diagnosis Oliver Genský
3 of 16
2.2. Objectives 1.) Please find factors important for diagnosis (DIAG and DIAG2)
2.) Please find factors for detection of bacteria or virus (CULT_FIND and CULTURE)
3.) Please find factors for predicting prognosis / predict prognosis (C_COURSE and COURSE)
Original author of the assignment proposed 3 different tasks to be dealt with. For this work, only the
last one, finding factors for prognosis, will be elaborated. Author left us with freedom in approaching
the task, saying: “Any knowledge extraction is welcome!”. The most important predictors for
predicting final condition of the patient can be identified in different stages of the process through
which the patient goes. As the process proceeds, the doctor and also our datamining tools have
more and more information about patient’s health condition. Therefore, is assumed, that
proceeding in the process will provide us with an improvement of our predicting results, but
nevertheless, it is still expected to give some additional information to doctors even after the first
data gathering, which is physical examination at admission.
These three stages will be considered for analysis:
• After the physical examination – stage1
• After the laboratory examination – stage2
• After the therapy – stage3
3. Data understanding
3.1. Data We received the table of 140 rows, where one row represents health record of one patient. In one
row, all information about patient, gathered throughout the whole process is stored. Table was
received in .csv format and subsequently imported to SAS libraries as three SAS tables using IMPORT
and SAVE DATA nodes in Enterprise Miner. Before import, table was split into 3 tables, where each
splitted table represents all data gathered until the present stage. Table of last stage is equal to
received table, as in this stage we already have all the information from finished process. Received
table contains altogether 38 possible variables. In this work course(grouped) will be used as target
variable. First split will have 18 predictors (present history + physical examination) and second split
will have 14 additional ones resulting in 32 predictors (split 1 + laboratory examination). Data could
have been also imported in one piece. Problem here was, that if we wanted to filter the columns for
modeling, it would be very time consuming as SAS EM sorted attributes alphabeticaly. This made
filtering rather difficult task.
Picture 2 - source data - graphicaly divided into 3 stages as obtained in the process (source: Author, in Excel)
3.2. Target variables Original variable “C_COURSE” represents patient’s clinical course at discharge. This variable comes
with 11 different values possible for one record. Predicting variable like this (multiple values) is not a
common data mining task. In grouped course variable we already have this variable transformed into
THIRD STAGE SECOND STAGE FIRST STAGE
Project 2 – meningoencephalitis diagnosis Oliver Genský
4 of 16
binary variable. Value “positive” means that patient at discharge is dead or not completely healthy.
Value “negative” is the opposite, meaning patient was successfully treated. Distribution of this class
variable is significantly uneven. It is 117 negatives to 23 positives. Proper operation might be to
resample distribution and make occurrences of variable even. Down-sampling would leave us with
only few records, up-sampling on the other hand, is not an implemented function inside of
Enterprise Miner. This problem can be partially solved by setting decision matrix in a way, that it will
favor hitting “Positive” values. After the model is built, we can move cut-off value in order to reach
preferred True Negative Rate or True Positive Rate. For example, default cut-off 0.5 can be moved to
0.4 which will ensure higher TPR, in our case more “Positives” hit.
C_COURSE – original variable
• negative: no symptoms
• EEG_abnormal: the patient had abnormality of EEG
• CT_abnormal: abnormality of CT
• frontal_sign: frontal sign is observed.
• attention: loss of attentions is observed.
• aphasia: the patient cannot speak.
• amnesia: retrograde amnesia
• ataxia: motor disturbance is observed.
• epilepsy: the patient suffered from epilepsy after discharge.
• memory_loss: memory disturbance.
• dead: death
COURSE(Grouped): Grouped attribute of C_COURSE
• n: negative – 117 cases (83,6%)
• p: positive – 23 cases (16,4%)
Picture 3 - Bar plot showing distribution of target variable (source: Author, in SAS EM)
Project 2 – meningoencephalitis diagnosis Oliver Genský
5 of 16
3.3. Input variables Table 1 - input variables available for target prediction/classification (Source: Author, in MS Excel)
Stage Source Attribute number
Attribute Data type Possible Values
Stage1
Personal data
1 AGE numerical [10.000 ; 84.000]
2 SEX categorical M (82), F (58)
Diagnosis 3 DIAG categorical
ABSCESS (9), BACTERIA (24), BACTE(E) (8), TB(E) (1), VIRUS(E) (30), VIRUS (68)
4 DIAG2 categorical BACTERIA (42), VIRUS (98)
Present history
5 COLD numerical [0.000 ; 35.000]
6 HEADACHE numerical [0.000 ; 63.000]
7 FEVER numerical [0.000 ; 63.000]
8 NAUSEA numerical [0.000 ; 32.000]
9 LOC numerical [0.000 ; 26.000]
10 SEIZURE numerical [0.000 ; 6.000]
11 ONSET categorical SUBACUTE (7), ACUTE (130), CHRONIC (1), RECURR (2)
Physical examination
12 BT numerical [35.500 ; 40.200]
13 STIFF numerical [0.000 ; 5.000]
14 KERNIG categorical [0.000 ; 1.000]
15 LASEGUE categorical [0.000 ; 1.000]
16 GSC numerical [9.000 ; 15.000]
17 LOC_DAT categorical - (98), + (42)
18 FOCAL categorical - (105), + (35)
Stage 2
Laboratory examination
19 WBC numerical [1070.000 ; 90009.000]
20 CRP numerical [0.000 ; 31.000]
21 ESR numerical [0.000 ; 60.000]
22 CT_FIND categorical abnormal (39), normal (101)
23 EEG_WAVE categorical abnormal (117), normal (23)
24 EEG_FOCUS categorical - (104), + (36)
25 CSF_CELL numerical [0.000 ; 63350.000]
26 Cell_Poly numerical [0.000 ; 61520.000]
27 Cell_Mono numerical [0.000 ; 7840.000]
28 CSF_PRO numerical [0.000 ; 474.000]
29 CSF_GLU numerical [0.000 ; 520.000]
30 CULTURE_FIND categorical F (107), T (33)
31 CULTURE categorical
- (107), neisseria (2), strepto (9), staphylo (2), tb (1), influenza (1), measles (1), pi(B) (6), varicella (3), rubella (2), adeno (1), herpes (5)
Stage 3
Therapy and course
32 THERAPY2 categorical
multiple (10), ABPC+CZX (12), FMOX+AMK (1), ABPC (3), ope (2), Dara_P (1), ABPC+FMOX (4), LMOX (1), PCG (1), ABPC+LMOX (2), PIPC+CTX (1), no_therapy (58), ABPC+CTX (2), INH+RFP (1), ABPC+CEX (1), Zobirax (25), ARA_A (11), INH (1), globulin (3)
33 CSF_CELL3 numerical [8.000 ; 4860.000]
34 CSF_CELL7 numerical [0.000 ; 2137.000]
Project 2 – meningoencephalitis diagnosis Oliver Genský
6 of 16
Diagnosis attributes were not suitable as input variables due to their dependence on other variables.
DIAG attribute can also be biased by doctor’s personal judgment. Attribute with almost the same
value added as DIAG2 can be created by subtracting cell_mono from cell_poly attribute (cell_poly-
cell_mono) and taking positive result of substraction as a BACTERIAL disease and negative result as a
VIRUS disease. Except „Therapy“ and „Risk“ attributes, all the other attributes are objective
measures of patient’s physical state. Therapy attribute represents an approach chosen for patient’s
treatment, therefore is objective and can be also taken into account. RISK belongs among final state
information about patient, is achieved at the same time as COURSE values, therefore makes no
sense using it in predicting the value of COURSE.
4. Data preparation
4.1. Treatment of outliers and missing values No outliers were observed in obtained dataset. Chosen modeling methods can handle those few
missing values in CSF_CELL3 attribute.
4.2. Variable transformations • In case of an interval inputs, if skewed, transformation to more normal distribution could be
considered for calculating parametric models. This is not a case using decision trees, which are
insensitive to skewed distributions of predictors. All transformations will be done in EM before
calculating specific models. Models will be also run without transformations and compared with
transformed ones in order to know, wether it did not make the results even worse.
• POLY-MONO attribute was considered to be created, nevertheless, as proven in other work, it
gives the same value added for purposes of our task as DIAG2. For this reason DIAG2 will by kept
and used from stage 2.
4.3. Sampling and data partitioning • In all models, except decision trees, a standard 70 to 30 partitioning will be used. 70% training
set and 30% validation set. The decision trees must also contain test sample as the pruned
subtrees are chosen based on their performance on validation sample.
• Where it is possible, a k-fold cross validation might be used while training the data, as the
number of observations is low.
Project 2 – meningoencephalitis diagnosis Oliver Genský
7 of 16
5. Modeling
5.1. Candidate models
5.1.1. Decision Trees For purposes of having a comparison variety, all three splitting criteria (ProbChiSq, Gini, Entropy) are
chosen for both, binary tree and multiway tree. Multiway trees are set with maximum of 5 branches.
This is done in every stage, leaving us with 18 (6x3) models in total. Decisions (as a rules) were used
due to higher value of finding those few „positive“ values. Cross validation is very appreciated
function here as the dataset is small. Setting of models can be seen below.
Table 2 - tuning parameters of Decision Trees (Source: Author, in MS Excel)
Model Splitting rules Split search Subtree Cross
Validation
Stage1
Tree1 -binary
NTC:ProbChisq, Sig.Level: 0.2, Max. Branch: 2
Use decisions: Yes, Use priors: No
Method: Largest, Measure: Decision Yes - 10
Tree2 -binary
NTC:Gini, Sig.Level: 0.2, Max. Branch: 2
Use decisions: Yes, Use priors: No
Method: Largest, Measure: Decision Yes - 10
Tree3 -binary
NTC:Entropy, Sig.Level: 0.2, Max. Branch: 2
Use decisions: Yes, Use priors: No
Method: Largest, Measure: Decision Yes - 10
Tree4 -multiway
NTC:ProbChisq, Sig.Level: 0.2, Max. Branch: 5
Use decisions: Yes, Use priors: No
Method: Largest, Measure: Decision Yes - 10
Tree5 -multiway
NTC:Gini, Sig.Level: 0.2, Max. Branch: 5
Use decisions: Yes, Use priors: No
Method: Largest, Measure: Decision Yes - 10
Tree6 -multiway
NTC:Entropy, Sig.Level: 0.2, Max. Branch: 5
Use decisions: Yes, Use priors: No
Method: Largest, Measure: Decision Yes - 10
Stage2
Tree1 -binary
NTC:ProbChisq, Sig.Level: 0.2, Max. Branch: 2
Use decisions: Yes, Use priors: No
Method: Largest, Measure: Decision Yes - 10
Tree2 -binary
NTC:Gini, Sig.Level: 0.2, Max. Branch: 2
Use decisions: Yes, Use priors: No
Method: Largest, Measure: Decision Yes - 10
Tree3 -binary
NTC:Entropy, Sig.Level: 0.2, Max. Branch: 2
Use decisions: Yes, Use priors: No
Method: Largest, Measure: Decision Yes - 10
Tree4 -multiway
NTC:ProbChisq, Sig.Level: 0.2, Max. Branch: 5
Use decisions: Yes, Use priors: No
Method: Largest, Measure: Decision Yes - 10
Tree5 -multiway
NTC:Gini, Sig.Level: 0.2, Max. Branch: 5
Use decisions: Yes, Use priors: No
Method: Largest, Measure: Decision Yes - 10
Tree6 -multiway
NTC:Entropy, Sig.Level: 0.2, Max. Branch: 5
Use decisions: Yes, Use priors: No
Method: Largest, Measure: Decision Yes - 10
Stage3
Tree1 -binary
NTC:ProbChisq, Sig.Level: 0.2, Max. Branch: 2
Use decisions: Yes, Use priors: No
Method: Largest, Measure: Decision Yes - 10
Tree2 -binary
NTC:Gini, Sig.Level: 0.2, Max. Branch: 2
Use decisions: Yes, Use priors: No
Method: Largest, Measure: Decision Yes - 10
Tree3 -binary
NTC:Entropy, Sig.Level: 0.2, Max. Branch: 2
Use decisions: Yes, Use priors: No
Method: Largest, Measure: Decision Yes - 10
Tree4 -multiway
NTC:ProbChisq, Sig.Level: 0.2, Max. Branch: 5
Use decisions: Yes, Use priors: No
Method: Largest, Measure: Decision Yes - 10
Tree5 -multiway
NTC:Gini, Sig.Level: 0.2, Max. Branch: 5
Use decisions: Yes, Use priors: No
Method: Largest, Measure: Decision Yes - 10
Tree6 -multiway
NTC:Entropy, Sig.Level: 0.2, Max. Branch: 5
Use decisions: Yes, Use priors: No
Method: Largest, Measure: Decision Yes - 10
Project 2 – meningoencephalitis diagnosis Oliver Genský
8 of 16
5.1.2. Neural Networks In neural networks results tend to be best if the target variable is relatively evenly distributed in the
learning set. (Scholderer, 2017)
Our dataset is the opposite. Target variable is unevenly , 84:16, distributed. Furthermore, we possess
only small data set with not that many variables.
Although neural networks are known for being a powerful tool for modelling when there are not too
many input variables, they are effective when the amount of data is sufficient. (Scholderer, 2017)
Two different models will be built for each stage, resulting in total of 6 models. Each stage will have
built one model with 2 hidden units and one model with 5 hidden units. The interval variables on
input goes through binning afterwards dummy variables are created from all categorical variables.
This way, the neural network is promised to work better. Neurals were also tried to run without this
tdata transformation, but the results turned out to be unacceptable. Tuning parameters can be sen
below.
Table 3 - tuning parameters of neural networks (Source: Author, in MS Excel)
Model Tuning parameters
Stage1 Neural1 Hidden units: 2, Model selection criterion:Profit/Loss, archit.: MLP
Neural2 Hidden units: 5, Model selection criterion:Profit/Loss, archit.: MLP
Stage2 Neural1 Hidden units: 2, Model selection criterion:Profit/Loss, archit.: MLP
Neural2 Hidden units: 5, Model selection criterion:Profit/Loss, archit.: MLP
Stage3 Neural1 Hidden units: 2, Model selection criterion:Profit/Loss, archit.: MLP
Neural2 Hidden units: 5, Model selection criterion:Profit/Loss, archit.: MLP
Combination, activation and error functions of Neural networks are left as a default. They are auto-
picked by SW based on the other settings in model and input variables. (SAS Institute Inc., 2016)
5.2. Model selection approach For finding best fitting model, misclassification was chosen as a selection criterion. Model will be
also chosen with respect to average profit values. Average profit is made secondary criterion, as
improving this value can be later done by moving cut-off value, which is set to 0.5 (default). Values
of this measure in the table below are rates of misclassification on validation data using neural
networks and on testing data, using decision trees. We can clearly see on different stages, that any
additional information helped neural networks to improve their performance, meaning lowering
their misclassification rate. Trees were also enriched by additional inputs in the second stage, the
third stage inputs did not help and in some cases, even worsened the results. As a splitting criterion,
none of the inputs from third stage were chosen for any of 6 DT models. Although neural networks
worked better and better on correctly classifying as the process was proceeding, it did not make any
improvement on Average Profit criterion. In the second stage, best results are achieved on binary
Gini Tree taking misclassification rate and average squared error into account. It’s Average Profit
criterion offers also solid results compared to other models, therefore this can be considered as a
winning model in second stage. In first stage, binary ProbChiSq tree won in all criterions chosen.
Stage 3 will not have its winning model chosen as it did not give much additional information for
predicting the clas variable.
Project 2 – meningoencephalitis diagnosis Oliver Genský
9 of 16
Table 4 - assesing of proposed models across 3 different stages (Source: Autor, in EXCEL)
Model / STAGE1 Misclass. Rate
Average Squared Error
ROC - index
Average Profit/Loss
Neural1 - hu2 0,256 0,224 0,578 0,930
Neural2 - hu5 0,256 0,195 0,645 0,930
Tree1 -binary - prob 0,172 0,140 0,580 1,020
Tree2 -binary - gini 0,310 0,212 0,485 0,960
Tree3 -binary - entropy 0,310 0,212 0,485 0,960
Tree4 -multiway - prob 0,172 0,171 0,435 0,793
Tree5 -multiway - gini 0,379 0,196 0,450 0,828
Tree6 -multiway - entropy 0,379 0,196 0,450 0,828
Model / STAGE2 Misclass. Rate
Average Squared Error
ROC - index
Average Profit/Loss
Neural1 - hu2 0,186 0,138 0,750 0,900
Neural2 - hu5 0,186 0,140 0,706 1,090
Tree1 -binary - prob 0,172 0,168 0,525 1,000
Tree2 -binary - gini 0,138 0,122 0,580 1,100
Tree3 -binary - entropy 0,138 0,126 0,535 1,000
Tree4 -multiway - prob 0,240 0,143 0,535 1,060
Tree5 -multiway - gini 0,206 0,144 0,620 1,380
Tree6 -multiway - entropy 0,206 0,144 0,620 1,380
Model / STAGE3 Misclass. Rate
Average Squared Error
ROC - index
Average Profit/Loss
Neural1 - hu2 0,163 0,130 0,722 1,000
Neural2 - hu5 0,163 0,144 0,798 1,000
Tree1 -binary - prob 0,241 0,183 0,525 1,000
Tree2 -binary - gini 0,138 0,122 0,580 1,100
Tree3 -binary - entropy 0,206 0,137 0,630 1,340
Tree4 -multiway - prob 0,241 0,156 0,550 1,030
Tree5 -multiway - gini 0,206 0,144 0,620 1,380
Tree6 -multiway - entropy 0,206 0,144 0,620 1,380
5.3. Final model Gini Binary Tree for the second stage and ProbChiSq Binary Tree for the first stage were chosen as
best adepts. In the picture below, splitting attributes of ProbChiSq Binary Tree and their hierarchy
can be seen:
Project 2 – meningoencephalitis diagnosis Oliver Genský
10 of 16
Picture 4 - ProbChiSq Binary Tree - visualization (Source: author, in SAS EM)
5.3.1. Overall predictive accuracy If the “Gini - binary tree” from second stage would have a task to choose a half of all dataset, where
in this half, as many events as possible should appear, in his choice 63,48% of total cases would be
found. 36,52% of events would be left in the second half of a dataset. Without having a model and
predictors (random choice), half of all events would be found in a randomly halved dataset. For
“ProbChSq - binary tree” proportion would be 67,06% to 32,94%.
Model performance (binary tree – gini -stage2) showed below:
Picture 5 - Cumulative % Captured response - model Gini Binary Tree (stage2) (Source: Autor, in SAS EM)
Project 2 – meningoencephalitis diagnosis Oliver Genský
11 of 16
Model performance (binary tree –ProbChiSq-stage1) showed below:
Picture 6 - Cumulative % Captured response - model ProbChiSq Binary Tree (stage1) (Source: Autor, in SAS EM)
5.3.2. Observed versus predicted target values From total of 5 positive cases in validation data, model would predict correctly 2 of them. That is
7,14% from all 28 cases of validation dataset and 40% from positive cases only. 3 positive cases
would be incorrectly classified as negatives (8,4 % from total).
67,86% from all cases, which is 19, would be negatives predicted correctly (it is 82,61% from
negative cases only). 14,28% of all data would be negatives classified as positives.
Picture 7 - Comparison of classification charts - model ProbChiSq Binary Tree (stage1) - train and validate (Source: Author, in SAS EM)
Project 2 – meningoencephalitis diagnosis Oliver Genský
12 of 16
Accuracy, sensitivity, specificity,... achieved on a validation set are in a table bellow:
Table 5 - Measures of predicting binary target in Validation Data - model ProbChiSq Binary Tree (stage1) (Source: Author, in MS Excel)
Measure Value Derivations
Sensitivity 0.4000 TPR = TP / (TP + FN)
Specificity 0.8261 SPC = TN / (FP + TN)
Precision 0.3333 PPV = TP / (TP + FP)
Negative Predictive Value 0.8636 NPV = TN / (TN + FN)
False Positive Rate 0.1739 FPR = FP / (FP + TN)
False Discovery Rate 0.6667 FDR = FP / (FP + TP)
False Negative Rate 0.6000 FNR = FN / (FN + TP)
Accuracy 0.7500 ACC = (TP + TN) / (P + N)
5.3.3. Improvement over baseline Plot below shows the cumulative ratio of percent captured responses within each decile to the
baseline percent response. (SAS Institute Inc., 2016) Baseline is a ratio of 1.0. If the model checked
half of the cases (depth of 50), we would hit 1,34 times more positives cases as if we were searching
randomly. This fact can be also read from the plot displayed below.
Picture 8 - Cumulative lift - ProbChiSq Binari Tree (stage1), selected model (Source: Author, in SAS EM)
6. Discussion
6.1. Assessment of model performance Performing only limited data treatment might suggest, that better performance can be expected
even on the same model, if various data treatment methods and variety of model adjustments
would be conducted. (e.g. methods of input reduction were not applied). As we worked with very
small sample of data, data mining methods could not fully demonstrate their capabilities.
Nevertheless, also on tiny dataset, our mining tool was able to create models, that can in early stage
of patient admission predict patient’s status „positive/negative“ at discharge. Models evidently
outperform simple guessing with knowledge of distribution. Other problem, besides not that
convincing performance, that could be solved with larger dataset, is model stability.
6.2. Contribution to the solution of the problem With knowledge we obtained in this work, we might assume, that final status of patient, not just
positive/negative, but also resulting disease, treatment consequence and so forth, might be
predicted already in early stages of patient’s admission. By playing with cut-off value, our models
can reach such predicting qualities, that it will find every case which is looked for. In our case, we
Project 2 – meningoencephalitis diagnosis Oliver Genský
13 of 16
could possibly capture all the cases, where patient will have resulting health issues after treatment.
All we need to do is to decrease cut off value to 0.2 as shown in picture below. That way, we will
reach true positive rate of 100%. This will also result in more false positives classified, however, that
is sacrifice, which is needed to be done. On the plot below, we se how “True negative rate” (brown
line) decreased.
Picture 9 - manipulating with cut-off value (Source: Author, in SAS EM)
6.3. Deployment recommendations After deploying these kinds of models, their recalculation on new data gathered every once in a
while is needed to be done. Models should be used only as an additional tool of doctor’s
professional working procedures.
Cut off decrease
Project 2 – meningoencephalitis diagnosis Oliver Genský
14 of 16
List of Pictures Picture 1 - process of examining and treating the patient (source: Author, in draw.io) .................................................................................... 2
Picture 2 - source data - graphicaly divided into 3 stages as obtained in the process (source: Author, in Excel) .............................................. 3
Picture 3 - Bar plot showing distribution of target variable (source: Author, in SAS EM) .................................................................................. 4
Picture 4 - ProbChiSq Binary Tree - visualization (Source: author, in SAS EM) ................................................................................................ 10
Picture 5 - Cumulative % Captured response - model Gini Binary Tree (stage2) (Source: Autor, in SAS EM) ................................................. 10
Picture 6 - Cumulative % Captured response - model ProbChiSq Binary Tree (stage1) (Source: Autor, in SAS EM) ....................................... 11
Picture 7 - Comparison of classification charts - model ProbChiSq Binary Tree (stage1) - train and validate (Source: Author, in SAS EM) .... 11
Picture 8 - Cumulative lift - ProbChiSq Binari Tree (stage1), selected model (Source: Author, in SAS EM) ..................................................... 12
Picture 9 - manipulating with cut-off value (Source: Author, in SAS EM) ........................................................................................................ 13
List of Tables Table 1 - input variables available for target prediction/classification (Source: Author, in MS Excel) ............................................................... 5
Table 2 - tuning parameters of Decision Trees (Source: Author, in MS Excel) ................................................................................................... 7
Table 3 - tuning parameters of neural networks (Source: Author, in MS Excel) ................................................................................................ 8
Table 4 - assesing of proposed models across 3 different stages (Source: Autor, in EXCEL) .............................................................................. 9
Table 5 - Measures of predicting binary target in Validation Data - model ProbChiSq Binary Tree (stage1) (Source: Author, in MS Excel) .... 12
References Berka, P. & Kocka, T., 2003. Meningoencephalitis Data Analysis Based on the CRISP-DM Methodology, Prague: University of Economics in
Prague.
SAS Institute Inc., 2016. SAS Enterprise Miner Reference Help, s.l.: s.n.
Scholderer, J., 2017. Lecture 13 - Neural networks, Aarhus: Aarhus: BSS - Aarhus university.
Tsumoto , S., 2000. Guide to the meningoencephalitis Diagnosis Data Set. [Online]
Available at: http://www.ar.sanken.osaka-u.ac.jp/pub/washio/jkdd/menin.htm
[Cit. 29 December 2017].
wikipedia community, dátum neznámy Cross-industry standard process for data mining. [Online]
Available at: https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining
[Cit. 29 December 2017].
Project 2 – meningoencephalitis diagnosis Oliver Genský
15 of 16
Appendix
Diagnose approaches:
Diagnosis 1: These symptoms are checked:
• high fever (present history data)
• severe headache (present history data)
• nausea (present history data)
• vomit (no info in data)
• neck stiffness (physical examination data)
• Kernig sign (physical examination data)
• Lasegue sign (physical examination data)
Diagnosis 2 (differential): The differential diagnosis is made as follows: 1.) Check the cell count in Cerebulospinal fluid(CSF). 2.) If polynuclear cells are dominant, bacterial meningitis is diagnosed. 3.) If mononuclear cells are dominant, viral meningitis is diagnosed. 4.) For diagnosis of brain abscess, CT will be used for confirmation of diagnosis.
CRISP-DM The CRISP-DM Methodology CRISP-DM (CRoss-Industry Standard Process for Data Mining) is a
European Commission funded project for defining a standard process model for carrying out data
mining projects. CRISP-DM addresses the needs of all levels of users in deploying data mining
technology to solve business problems.The project aim is to define and validate a data mining
process that is generally applicable in diverse industry sectors. According to CRISP-DM, the life cycle
of a data mining project consists of six phases shown in Fig.1. We will follow these phases during our
work with the meningoencephalitis data. (Berka & Kocka, 2003)
Appendix Picture 1 - phases of CRISP-DM
Project 2 – meningoencephalitis diagnosis Oliver Genský
16 of 16
Diagrams
Appendix Picture 2 - diagram of DTs (Source: Author, in SAS EM)
Appendix Picture 3 - diagram of NNs (Source: Author, in SAS EM)