breast cancer diagnostics

30
Breast Cancer Diagnostics with Bayesian Networks Reevaluating the Wisconsin Breast Cancer Database with BayesiaLab Stefan Conrady, [email protected] Dr. Lionel Jouffe, [email protected] March 5, 2011 Conrady Applied Science, LLC - Bayesia’s North American Partner for Sales and Consulting

Upload: caldevilla

Post on 22-Oct-2015

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Breast Cancer Diagnostics

Breast Cancer Diagnostics with Bayesian Networks

Reevaluating the Wisconsin Breast Cancer Database with BayesiaLab

Stefan Conrady, [email protected]

Dr. Lionel Jouffe, [email protected]

March 5, 2011

Conrady Applied Science, LLC - Bayesia’s North American Partner for Sales and Consulting

Page 2: Breast Cancer Diagnostics

Table of Contents

IntroductionAbout the Authors 2

Stefan Conrady 2

Lionel Jouffe 2

Case Study & TutorialBackground 3

Wisconsin Breast Cancer Database 3

Notation 4

Data Import 4

Unsupervised Learning 6

Model 1: Markov Blanket 7

Model 1 Performance 11

Model 2: Augmented Markov Blanket 13

Model 2 Performance 14

Structural Coef!cient 16

Conclusion 22

Model ApplicationInteractive Inference 23

Target Interpretation Tree 24

Summary 26

References 27

Contact Information 28

Conrady Applied Science, LLC 28

Bayesia SAS 28

Copyright 28

Introduction to Bayesian Networks

www.conradyscience.com | www.bayesia.com i

Page 3: Breast Cancer Diagnostics

Introduction

Data classi!cation is one of the most common tasks in the !eld of statistical analysis and countless methods have been

developed for this purpose over time. A common approach is to develop a model based on known historical data, i.e. where the class membership of a record is known, and to use this generalization to predict the class membership for a

new set of observations.

Applications of data classi!cations permeate virtually all !elds of study, including social sciences, engineering, biology, etc. In the medical !eld, classi!cation problems often appear in the context of disease identi!cation, i.e. making a diag-

nosis about a patient’s condition. The medical sciences have a long history of developing large body of knowledge,

which links observable symptoms with known types of illnesses. It is the physician’s task to use the available medical knowledge to make inference based on the patient’s symptoms, i.e. to classify the medical condition, in order to enable

appropriate treatment.

Over the last two decades, so-called medical expert systems have emerged, which are meant to support physicians in

their diagnostic work. Given the sheer amount of medical knowledge in existence today, it should not be surprising that signi!cant bene!ts are expected from such machine-based support in terms of medical reasoning and inference.

In this context, several papers by Wolberg, Street, Heisey and Managasarian became much-cited examples. They pro-

posed an automated method for the classi!cation of Fine Needle Aspirates1 through imaging processing and machine learning, with the objective of achieving a greater accuracy in distinguishing between malignant and benign cells for the

diagnosis of breast cancer. At the time of their study, the practice of visual inspection of FNA yielded an inconsistent

diagnostic accuracy. The proposed new approach would increase this accuracy reliably to over 95%. This research was quickly translated into clinical practice and has since been applied with continued success.

As part of their studies in the late 1980s and 1990s, the research team generated what became known as the Wisconsin

Breast Cancer Database, which contains measurements of hundreds of FNA samples and the associated diagnoses. This

database has been extensively studied, especially outside the medical !eld. Statisticians and computer scientists have proposed a wide range of techniques for this classi!cation problem and have continuously raised the benchmark for

predictive performance.

Our objective with this paper is to present Bayesian networks as a very practical framework for working with this kind of classi!cation problem. Furthermore, we intend to demonstrate how the BayesiaLab software can extremely quickly —

and simply — create a Bayesian network model that is on par performance-wise with virtually all existing models. Also,

while most of our previous white papers focused on marketing science applications, we hope that this case study from the medical !eld can demonstrate their universal applicability of Bayesian networks.

We speculate that our modeling approach with Bayesian networks (as the framework) and BayesiaLab (as the software

tool) achieves 99% of the performance of the best conceivable, custom-developed model, while only requiring 10% of

the development time. This allows researchers to focus more on the subject matter of their studies, because they are less

Breast Cancer Diagnostics with Bayesian Networks and BayesiaLab

www.conradyscience.com | www.bayesia.com 1

1 Fine needle aspiration (FNA) is a percutaneous (“through the skin”) procedure that uses a !ne gauge needle (22 or 25

gauge) and a syringe to sample "uid from a breast cyst or remove clusters of cells from a solid mass. With FNA, the cel-lular material taken from the breast is usually sent to the pathology laboratory for analysis.

Page 4: Breast Cancer Diagnostics

distracted by the technicalities of traditional statistical tools. As a result, Bayesian networks and BayesiaLab are a very

important innovations accelerating research and in pursuing translational science.

About the Authors

Stefan Conrady

Stefan Conrady is the co-founder and managing partner of Conrady Applied Science, LLC, a privately held consulting

!rm specializing in knowledge discovery and probabilistic reasoning with Bayesian networks. In 2010, Conrady Applied

Science was appointed the authorized sales and consulting partner of Bayesia SAS for North America. Stefan Conrady has extensive management experience in the !elds of product planning, marketing science and advanced analytics. Prior

to establishing his own !rm, he was heading the Analytics & Forecasting group at Nissan North America.

Lionel JouffeDr. Lionel Jouffe is co-founder and CEO of France-based Bayesia SAS. Lionel Jouffe holds a Ph.D. in Computer Science

and has been working in the !eld of Arti!cial Intelligence since the early 1990s. He and his team have been developing

BayesiaLab since 1999 and it has emerged as the leading software package for knowledge discovery, data mining and

knowledge modeling using Bayesian networks. BayesiaLab enjoys broad acceptance in academic communities as well as in business and industry. The relevance of Bayesian networks, especially in the context of consumer research, is high-

lighted by Bayesia’s strategic partnership with Procter & Gamble, who has deployed BayesiaLab globally since 2007.

Breast Cancer Diagnostics with Bayesian Networks and BayesiaLab

www.conradyscience.com | www.bayesia.com 2

Page 5: Breast Cancer Diagnostics

Case Study & Tutorial

BackgroundTo provide context for this study, we quote Mangasarian, Street and Wolberg (1994), who conducted the original re-

search related breast cancer diagnosis with digital image processing and machine learning:

Most breast cancers are detected by the patient as a lump in the breast. The majority of breast lumps are benign, so it is the physician’s responsibility to diagnose breast cancer, that is, to distinguish benign lumps from malignant ones. There

are three available methods for diagnosing breast cancer: mammography, FNA with visual interpretation and surgical

biopsy. The reported sensitivity, i.e. ability to correctly diagnose cancer when the disease is present of mammography varies from 68% to 79%, of FNA with visual interpretation from 65% to 98%, and of surgical biopsy close to 100%.

Therefore mammography lacks sensitivity, FNA sensitivity varies widely, and surgical biopsy, although accurate, is inva-

sive, time consuming and costly. The goal of the diagnostic aspect of our research is to develop a relatively objective sys-tem that diagnoses FNAs with an accuracy that approaches the best achieved visually.

Wisconsin Breast Cancer DatabaseThis breast cancer database was created through the clinical work of Dr. William H. Wolberg at the University of Wis-

consin Hospitals in Madison. As of 1992, Dr. Wolberg had collected 699 instances of patient diagnoses in this database,

consisting of two classes: 458 benign cases (65.5%) and 241 malignant cases (34.5%).

The following eleven attributes2 are included in the database:

1. Sample code number

2. Clump Thickness (1 - 10)

3. Uniformity of Cell Size (1 - 10)

4. Uniformity of Cell Shape (1 - 10)

5. Marginal Adhesion (1 - 10)

6. Single Epithelial Cell Size (1 - 10)

7. Bare Nuclei (1 - 10)

8. Bland Chromatin (1 - 10)

9. Normal Nucleoli (1 - 10)

10. Mitoses (1 - 10)

11. Class (benign/malignant)

Attributes 2 through 9 were computed from digital images of !ne needle aspirates (FNA) of breast masses. These fea-

tures describe the characteristics of the cell nuclei in the image. The class membership was established via subsequent biopsies or via long-term monitoring of the tumor.

Breast Cancer Diagnostics with Bayesian Networks and BayesiaLab

www.conradyscience.com | www.bayesia.com 3

2 Upon exclusion of the row identi!er, this database is ideally suited for the evaluation version of BayesiaLab, which is

limited to ten nodes.

Page 6: Breast Cancer Diagnostics

We will not go into detail here regarding the de!nition of the attributes and their measurement. Rather, we refer the

reader to papers referenced in the bibliography.

The Wisconsin Breast Cancer Database is available to any interested researcher from the UC Irvine Machine Learning

Repository.3 We use this database in its original format without any further transformation, so our results can be di-

rectly compared to dozens of methods that have been developed since the original study.

NotationTo clearly distinguish between natural language, software-speci!c functions and study-speci!c variable names, the fol-lowing notation is used:

• BayesiaLab-speci!c functions, keywords, commands, etc., are shown in bold type.

• Attribute/variable/node names are capitalized and italicized.

Data ImportOur modeling process begins with importing the database, which is available in a CSV format, into BayesiaLab. The Data Import Wizard guides the analyst through the required steps.

In the !rst dialogue box of the Data Import Wizard, we can click on De!ne Typing and specify that we wish to set aside

test set of the database. Following common practice, we will randomly select 20% of the 699 records as test data, and, as a result, the remaining 80% will serve as our training data set.

Breast Cancer Diagnostics with Bayesian Networks and BayesiaLab

www.conradyscience.com | www.bayesia.com 4

3 UC Irvine Machine Learning Repository website: http://archive.ics.uci.edu/ml/

Page 7: Breast Cancer Diagnostics

In the next step, the Data Import Wizard will suggest the data type for each variable (or attribute4 ). Attributes 2 through

10 are identi!ed as continuous variables and Class is read as a discrete variable. Only for the !rst variable, Sample code,

the analyst has to specify Row Identi!er, so it is not mistaken for a continuous predictor variable.

For the import process of this study, the most important step is the selection of the discretization algorithm. As we know

that the exclusive objective is classi!cation, we will choose the Decision Tree algorithm, which will discretize each vari-

able for an optimum information gain with respect to the target variable Class.

Bayesian networks are entirely non-parametric, probabilistic models and for their estimation they require a certain

minimum of observations. To help us with the selection of discretization levels, we use the heuristic of !ve observations

per parameter and probability cell. Given that we have a relatively small database with only 560 observations,5 three

discretization intervals for each variable appear to be an appropriate choice. If we used a higher number of discretiza-tion levels, we would most likely need more observations for the reliable estimation of the parameters.

Breast Cancer Diagnostics with Bayesian Networks and BayesiaLab

www.conradyscience.com | www.bayesia.com 5

4 “Attribute” and “variable” are used interchangeably throughout the paper.

5 560 cases are in the training set (80%) and 139 are in the test set (20%).

Page 8: Breast Cancer Diagnostics

Upon clicking Finish, we will immediately see a representation of the newly imported database in the form of a fully

unconnected Bayesian network. Each variable is now represented as a blue node in the graph panel of BayesiaLab.

The question mark symbol, which is associated with the Bare Nuclei node, indicates that there are missing values for

this variable. Hovering over the question mark with the mouse pointer while pressing the “i” key will show the number of missing values.

Unsupervised LearningWhen working with BayesiaLab, it is recommended to always perform Unsupervised Learning !rst on any newly im-

ported database. This is the case, even when the exclusive objective is predictive modeling, for which Supervised Learn-ing will later be the main tool.

Learning>Association Discovering>EQ will initiate the EQ algorithm, which, in this case, is suitable for the initial review

of the database. For larger databases with signi!cantly more variables, the Maximum Weight Spanning Tree is a very

fast algorithm and can be used !rst instead.

The analyst can visually review the learned network structure and compare it to his or her domain knowledge. This quickly provides a “sanity check” for the database and the variables and it may highlight any inconsistencies.

Breast Cancer Diagnostics with Bayesian Networks and BayesiaLab

www.conradyscience.com | www.bayesia.com 6

Page 9: Breast Cancer Diagnostics

Furthermore, one can also display the Pearson correlation between the nodes, by selecting Analysis>Graphic>Pearson’s Correlation and clicking the Display Arc Comment button in the toolbar.

For instance, a potentially incorrect sign of a correlation would noticed immediately by the analyst as the arcs are color-

coded. Red and blue arcs indicate negative and positive Pearson correlations respectively.

Model 1: Markov BlanketNow that all data is stored within BayesiaLab (and reviewed through the Unsupervised Learning step), we can proceed

to the modeling stage. Given our objective of predicting the state (benign versus malignant) of the variable Class, we will de!ne it as the Target Variable by right-clicking on the node and selecting Set as Target Variable from the contextual

menu. We need to specify this explicitly, so the subsequent Supervised Learning algorithm can use Class as the depend-

ent variable. The supervised learning algorithms are then available under Learning>Target Node Characterization.

Breast Cancer Diagnostics with Bayesian Networks and BayesiaLab

www.conradyscience.com | www.bayesia.com 7

Page 10: Breast Cancer Diagnostics

In most cases, the Markov Blanket algorithm is a good starting point for any predictive model. This algorithm is ex-

tremely fast and can even be applied to databases with thousands of variables and millions of records, although data-

base size is not a concern in this particular study.

The Markov Blanket for a node A is the set of nodes composed of A’s parents, its children, and its children’s other par-

ents (=spouses).

The Markov Blanket of the node A contains all the variables, which, if we know their states, will shield the node A from

the rest of the network. This means that the Markov Blanket of a node is the only knowledge needed to predict the be-

havior of that node A. Learning a Markov Blanket selects relevant predictor variables, which is particularly helpful

when there is a large number of variables in the database (In fact, this can also serve as a highly-ef!cient variable selec-tion method in preparation for other types of modeling, outside the Bayesian network framework).

Upon Markov Blanket learning for our database, the resulting Bayesian network looks as follows:

Breast Cancer Diagnostics with Bayesian Networks and BayesiaLab

www.conradyscience.com | www.bayesia.com 8

Page 11: Breast Cancer Diagnostics

This suggests that Class, has a direct probabilistic relationship with all variables except Marginal Adhesion and Single Epithelial Cell Size, which are disconnected. The lack of their connection with the Target indicates that these nodes are

independent given the nodes in the Markov Blanket.

For a better visual interpretation, we will apply the Force Directed Layout algorithm and obtain a view with the Class at

its center. Both unconnected variables are shown at the bottom of the graph.

Beyond distinguishing between predictors (connected nodes) and non-predictors (disconnected nodes), we can further

examine the relationship versus the Target Node Class by highlighting the Mutual Information of the arcs connecting the nodes. This function is accessible within the Validation Mode via Analysis>Graphic>Arcs’ Mutual Information.

Breast Cancer Diagnostics with Bayesian Networks and BayesiaLab

www.conradyscience.com | www.bayesia.com 9

Page 12: Breast Cancer Diagnostics

The thickness of the arcs is now proportional to the Mutual Information, i.e. the strength of the relationship between the nodes. Intuitively, Mutual Information measures the information that X and Y share: it measures how much know-

ing one of these variables reduces our uncertainty about the other. For example, if X and Y are independent, then know-

ing X does not provide any information about Y and vice versa, so their Mutual Information is zero. At the other ex-treme, if X and Y are identical then all information conveyed by X is shared with Y: knowing X determines the value of

Y and vice versa.

Formal De!nition of Mutual Information

I(X;Y ) = p(x, y)log p(x, y)p(x)p(y)

⎛⎝⎜

⎞⎠⎟x∈X

∑y∈Y∑

Breast Cancer Diagnostics with Bayesian Networks and BayesiaLab

www.conradyscience.com | www.bayesia.com 10

Page 13: Breast Cancer Diagnostics

We can also show the values of the Mutual Information on the graph by clicking on Display Arc Comments.

In the top part of the comment box attached to each arc, the Mutual Information of the arc is shown. Below, expressed as a percentage and highlighted in blue, we see the relative Mutual Information in the

direction of the arc (parent node ➔ child node). And, at the bottom, we have the relative mutual

information in the opposite direction of the arc (child node ➔ parent node).

Model 1 PerformanceAs we are not equipped with speci!c domain knowledge about the variables, we will not further interpret these relation-

ships but rather run an initial test for Network Performance — we want to know how well this Markov Blanket model can predict the states of the Class variable, i.e. benign versus malignant. This test is available via Analysis>Network Per-formance>Targeted.

Using our previously de!ned test set for validating our model, we obtain the following, rather encouraging results:

Markov Blanket - Test Set

Breast Cancer Diagnostics with Bayesian Networks and BayesiaLab

www.conradyscience.com | www.bayesia.com 11

Page 14: Breast Cancer Diagnostics

Of the 87 benign cases of the test set, 96.5% were correctly identi!ed (true negative), which corresponds to a false posi-

tive rate of 3.5%. More importantly though, of the 52 malignant cases, 100% were identi!ed correctly (true positive) with no false negatives. This yields a total precision of 97.8%.

Analogous to the original papers on this topic, we will also perform a K-Fold Cross Validation, which will iteratively

select different test and training sets, and, based on those, learn and test the model. The Cross Validation can be per-formed via Tools>Cross Validation>Targeted.

We choose 10 samples, i.e. 10 iterations with 69 cases as test samples and 630 training cases.

The results from the Cross Validation con!rms the good performance of this model. The overall precision is 96.7%, with a false negative rate of 2.9%.

Breast Cancer Diagnostics with Bayesian Networks and BayesiaLab

www.conradyscience.com | www.bayesia.com 12

Page 15: Breast Cancer Diagnostics

Markov Blanket - Cross Validation

At this point we might be tempted to conclude our analysis, as our Markov Blanket modeling is already performing at a

level comparable to the most sophisticated (and complex) models ever developed from this database. More remarkable though is the minimal effort that was required for creating our model with the Supervised Learning algorithms in

BayesiaLab. Even a new user of BayesiaLab would be expected to replicate the above steps in less than 30 minutes.

Model 2: Augmented Markov BlanketBayesiaLab offers an extension to the Markov Blanket algorithm, named Augmented Markov Blanket, which performs

an unsupervised learning algorithm on those nodes, which were previously selected by the Markov Blanket learning. This allows to identify in"uence paths between the predictor variables and can potentially help improve the prediction

performance.

This sequence of algorithms can be started via Learning>Target Node Characterization>Augmented Markov Blanket.

Breast Cancer Diagnostics with Bayesian Networks and BayesiaLab

www.conradyscience.com | www.bayesia.com 13

Page 16: Breast Cancer Diagnostics

As can be expected, the resulting network is somewhat more complex than the standard Markov Blanket.

The additional arcs (compared to the Markov Blanket network) are highlighted with green markers.

Model 2 PerformanceWith this Augmented Markov Blanket network we now proceed to performance evaluations, analogous to the Markov Blanket model. Initially, we evaluate the performance on the test set.

Breast Cancer Diagnostics with Bayesian Networks and BayesiaLab

www.conradyscience.com | www.bayesia.com 14

Page 17: Breast Cancer Diagnostics

Augmented Markov Blanket - Test Set

To complete the evaluation of this model, we will also perform a K-Fold Cross Validation.

Augmented Markov Blanket - Cross-Validation

Breast Cancer Diagnostics with Bayesian Networks and BayesiaLab

www.conradyscience.com | www.bayesia.com 15

Page 18: Breast Cancer Diagnostics

Despite the greater complexity of the model, we only see a marginal improvement in overall precision.

Structural Coef!cientUp to this point, we have not addressed the Structural Coef!cient (SC), which is the only adjustable parameter for all

the learning algorithms in BayesiaLab. This parameter is available to manage network complexity.

By default, this Structural Coef!cient is set to 1, which reliably prevents the learning algorithms from over!tting the

model to the data. In studies with relatively few observations, the analyst’s judgment is needed for determining a poten-

tial downward adjustment of this parameter. On the other hand, when data sets are very large, increasing the parameter to values higher than 1 will help manage the network complexity.

Given the fairly simple network structure of Model 1, complexity was of no concern. Model 2 is more complex, but still

very manageable. The question is, could a more complex network provide greater precision without over!tting? To an-swer this question, we will perform the Structural Coef!cient Analysis, which generates several metrics that help in mak-

ing a trade-off between complexity and precision. The function Tools>Cross Validation>Structural Coef!cient Analysis starts this process.

We are prompted to specify the range of the Structural Coef!cient to be examined and the number of iterations. The

Number of Iterations determines the interval steps to be taken within the speci!ed range of the Structural Coef!cient. Given the relatively light computational load, we choose 50 iterations. With more complex models, we might be more

conservative, as each iteration re-learns and re-evaluates the network. Furthermore, we select Compute Structure/Target’s Precision Ratio to compute our target metric.

The resulting report will show us how the network structure changes as a function of the Structural Coef!cient. This can

be interpreted as the degree of con!dence the analyst should have in any particular arc in the structure.

Breast Cancer Diagnostics with Bayesian Networks and BayesiaLab

www.conradyscience.com | www.bayesia.com 16

Page 19: Breast Cancer Diagnostics

Clicking Graphs, will show a synthesized network, consisting of all structures generated during the iterative learning process.

Breast Cancer Diagnostics with Bayesian Networks and BayesiaLab

www.conradyscience.com | www.bayesia.com 17

Page 20: Breast Cancer Diagnostics

The reference structure is represented by black arcs, which show the original network learned prior to the start of the

Structural Coef!cient Analysis. The blue-colored arcs are not contained in the reference structure, but they appear in networks that have been learned as a function of the different Structural Coef!cients (SC). The thickness of the arcs is

proportional to the frequency of individual arcs existing in the learned networks.

More importantly for us, however, is determining the correct level of network complexity for a reliable and accurate prediction performance while avoiding to over!t the data. We can plot several different metrics in this context by click-

ing Curve.

Structure/Target’s Precision Ratio is the most relevant metric in our case and the corresponding plot is shown below. This !rst plot shows the metric computed for the whole database.

Typically, the “elbow” of the L-shaped curve identi!es a suitable value for the Structural Coef!cient (SC). More for-

mally, we would look for the point on the curve where the second derivative is maximized. With a visual inspection, an

SC value of around 0.4 appears to be a good candidate for that point. The portion of the curve, where SC values ap-proach 0, shows the characteristic pattern of over!tting, which is to be avoided.

In order to further validate this interpretation, we will also compute the same metric for the training/test database.

Breast Cancer Diagnostics with Bayesian Networks and BayesiaLab

www.conradyscience.com | www.bayesia.com 18

Page 21: Breast Cancer Diagnostics

This graph has the same properties as the previous one and suggests a similar SC value. As a result, we can have some

con!dence in this new value for the Structural Coef!cient.

We will also plot the Target’s Precision alone as a function of the SC. On the surface, the curve resembles an L-shape,

too, but the curve moves only within roughly 1 percentage point, i.e. between 97% and 98%. For practical purposes,

this means that the curve is virtually "at.

Breast Cancer Diagnostics with Bayesian Networks and BayesiaLab

www.conradyscience.com | www.bayesia.com 19

Page 22: Breast Cancer Diagnostics

As a result, the Structure/Target’s Precision Ratio i.e. Structure

Target's Precision

⎛⎝⎜

⎞⎠⎟

is primarily a function of the numerator, i.e. Struc-

ture, as the denominator, Target’s Precision, is nearly constant across a wide range of SC values, as per the graph above.

The joint interpretation of Target’s Precision and Structure/Target’s Precision Ratio indicates that little can be gained with lowering the SC, but that there is a de!nite risk of over!tting.

Nevertheless, we relearn the network with an SC of 0.4, generating, as expected, a more complex network, which is

displayed below.

The performance of the model (with SC=0.4) on the test set appears to be virtually the same,

Breast Cancer Diagnostics with Bayesian Networks and BayesiaLab

www.conradyscience.com | www.bayesia.com 20

Page 23: Breast Cancer Diagnostics

Augmented Markov Blanket (SC=0.4) - Test Set

and the result from the K-Fold Cross Validation is not materially different from the previous performance with SC=1.

Augmented Markov Blanket (SC=0.4) - Cross-Validation

Breast Cancer Diagnostics with Bayesian Networks and BayesiaLab

www.conradyscience.com | www.bayesia.com 21

Page 24: Breast Cancer Diagnostics

ConclusionThe models reviewed, Markov Blanket and Augmented Markov Blanket (SC=0.4 and SC=1), have performed at virtually

indistinguishable levels in terms of classi!cation performance. The greater complexity of either Augmented Markov Blanket speci!cation did not yield the expected precision gain. Precision and false negatives are shown as the key met-

rics in the summary table below.

!"#$%&%'( )*+&#,-#.*/%0#&

!"#$%&%'( )*+&#,-#.*/%0#&

1*"2'0,3+*(2#/ 456789 : 4;65<9 5=>.?#(/#@,1*"2'0,3+*(2#/,ABCD<E 456789 < 456<89 F=>.?#(/#@,1*"2'0,3+*(2#/,ABCD:68E 476F;9 : 4;65<9 5

G#&/,B#/,A(D<H4E C"'&&,I*+%@*/%'(,A(D;44EB>??*"J

In this situation, the choice of model should be determined by the most parsimonious speci!cation. This provides the

best prospect of good generalization of the model beyond the samples observed in this study. The originally speci!ed

Markov Blanket model will thus be recommended as the model of choice.

Reestimating these models with more observations could potentially change this conclusion and might more clearly dif-

ferentiate the classi!cation performance. For now, however, we select the Markov Blanket model and it will serve as the

basis for the next section of this paper, Model Application.

Breast Cancer Diagnostics with Bayesian Networks and BayesiaLab

www.conradyscience.com | www.bayesia.com 22

Page 25: Breast Cancer Diagnostics

Model Application

Interactive InferenceWithout further discussion of the merits of each model speci!cation, we will now show how the learned Markov Blan-ket model can be applied in practice. For instance, we can use BayesiaLab to review the individual classi!cation predic-

tions made based on the model. This feature is call Interactive Inference, which can be accessed via Inference>Adaptive Inference.

This will bring up Monitors for all variables in the Monitor Panel, and the navigation bar above allows scrolling

through each record of the test set. Record #0 can be seen below with all the associated observations highlighted in

green. Given the observations shown, the model predicts a 99.76% probability that the cells from this FNA sample are

malignant (the Monitor is highlighted in red).

For reference, we will also show record #22, which is classi!ed as benign.

Breast Cancer Diagnostics with Bayesian Networks and BayesiaLab

www.conradyscience.com | www.bayesia.com 23

Page 26: Breast Cancer Diagnostics

Most cases are rather clear-cut, as above, with record #19 being one of the few exceptions. Here, the probability of ma-

lignancy is 73%.

Target Interpretation TreeIn situations, when only individual cases are under review by a pathologist (rather than a batch of cases from a data-

base), BayesiaLab can also express the model in the form of a Target Interpretation Tree. It is a kind of decision tree,

which prescribes in which sequence evidence should be sought for gaining the maximum amount of information to-

wards a diagnosis. As can been seen in the tree diagram, Uniformity of Cell Size provides the highest information gain. Upon obtaining this piece of evidence, Uniformity of Cell Shape will bring the highest information gain among the re-

maining variables. Due to the size of a complete Target Interpretation Tree, only three levels of evidence are shown in

the following diagram.

Breast Cancer Diagnostics with Bayesian Networks and BayesiaLab

www.conradyscience.com | www.bayesia.com 24

Page 27: Breast Cancer Diagnostics

Breast Cancer Diagnostics with Bayesian Networks and BayesiaLab

www.conradyscience.com | www.bayesia.com 25

Page 28: Breast Cancer Diagnostics

In our particular example, this may not be relevant, as all pieces of evidence, i.e. all observations regarding the FNA are

obtained simultaneously. However, in the context of other diagnostic methods, such as mammography and surgical bi-opsy, a tree-based decision structure can help prioritize the sequence of exams, given the evidence obtained up to that

point.

SummaryBy using Bayesian networks as the framework, we have shown a practical new modeling approach based on the widely

studied Wisconsin Breast Cancer Database. Our prediction accuracy is comparable with the results of all known studies on this topic.

With BayesiaLab as the software tool, modeling with Bayesian networks becomes accessible to a very broad range of

analysts and researchers, including non-statisticians. The speed of modeling, analysis and subsequent implementation make BayesiaLab a suitable tool in many areas of research and especially for translational science.

Breast Cancer Diagnostics with Bayesian Networks and BayesiaLab

www.conradyscience.com | www.bayesia.com 26

Page 29: Breast Cancer Diagnostics

References

Abdrabou, E. A.M.L, and A. E.B.M Salem. “A Breast Cancer Classi!er based on a Combination of Case-Based Reason-ing and Ontology Approach.”

El-Sebakhy, E. A, K. A Faisal, T. Helmy, F. Azzedin, and A. Al-Suhaim. “Evaluation of breast cancer tumor classi!cation with unconstrained functional networks classi!er.” In the 4th ACS/IEEE International Conf. on Computer Systems and Applications, 281–287, 2006.

Hung, M. S, M. Shanker, and M. Y Hu. “Estimating breast cancer risks using neural networks.” Journal of the Opera-tional Research Society 53, no. 2 (2002): 222–231.

Karabatak, M., and M. C Ince. “An expert system for detection of breast cancer based on association rules and neural network.” Expert Systems with Applications 36, no. 2 (2009): 3465–3469.

Mangasarian, Olvi L, W. Nick Street, and William H Wolberg. “Breast cancer diagnosis and prognosis via linear pro-gramming.” OPERATIONS RESEARCH 43 (1995): 570--577.

Mu, T., and A. K Nandi. “BREAST CANCER DIAGNOSIS FROM FINE-NEEDLE ASPIRATION USING SUPERVISED COMPACT HYPERSPHERES AND ESTABLISHMENT OF CONFIDENCE OF MALIGNANCY.”

Wolberg, W. H, W. N Street, D. M Heisey, and O. L Mangasarian. “Computer-derived nuclear features distinguish malig-nant from benign breast cytology* 1.” Human Pathology 26, no. 7 (1995): 792–796.

Wolberg, William H, W. Nick Street, and O. L Mangasarian. “MACHINE LEARNING TECHNIQUES TO DIAGNOSE BREAST CANCER FROM IMAGE-PROCESSED NUCLEAR FEATURES OF FINE NEEDLE ASPIRATES.” http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.127.2109.

Wolberg, William H, W. Nick Street, and Olvi L Mangasarian. “Breast Cytology Diagnosis Via Digital Image Analysis” (1993). http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.38.9894.

Breast Cancer Diagnostics with Bayesian Networks and BayesiaLab

www.conradyscience.com | www.bayesia.com 27

Page 30: Breast Cancer Diagnostics

Contact Information

Conrady Applied Science, LLC312 Hamlet’s End Way

Franklin, TN 37067

USA

+1 888-386-8383 [email protected]

www.conradyscience.com

Bayesia SAS6, rue Léonard de Vinci

BP 119

53001 Laval CedexFrance

+33(0)2 43 49 75 69

[email protected]

www.bayesia.com

Copyright© 2011 Conrady Applied Science, LLC and Bayesia SAS. All rights reserved.

Any redistribution or reproduction of part or all of the contents in any form is prohibited other than the following:

• You may print or download this document for your personal and noncommercial use only.

• You may copy the content to individual third parties for their personal use, but only if you acknowledge Conrady

Applied Science, LLC and Bayesia SAS as the source of the material.

• You may not, except with our express written permission, distribute or commercially exploit the content. Nor may you transmit it or store it in any other website or other form of electronic retrieval system.

Breast Cancer Diagnostics with Bayesian Networks and BayesiaLab

www.conradyscience.com | www.bayesia.com 28