maxent: statistical explanation (elithet. al....

14/09/55

1

MaxEnt:MaxEnt:A Statistical Explanation

(Elith et. al. 2011)

Nantachai Pongpattananurak

Department of Conservation

Introduction

• MaxEnt is a program for modelling species distributions from presence only speciesdistributions from presence‐only species records.

• a new statistical explanation of MaxEnt which shows that the model minimizes the relative entropy between two probability densities (one estimated from the presence data and(one estimated from the presence data and one, from the landscape) defined in covariate space.

14/09/55

2

Introduction

• Species distribution models (SDMs) estimate the relationship between species records at sites and the environmental and/or spatial characteristics of those sites (Franklin, 2009).

• MaxEnt’s predictive performance is consistently competitive with the highest performing methods (Elith et al., 2006).

• Published examples cover diverse aims (findingcorrelates of species occurrences mapping currentcorrelates of species occurrences, mapping current distributions, and predicting to new times and places) across many ecological, evolutionary, conservation and biosecurity applications (Table 1).

• Source: Elith et al. 2010

14/09/55

3

WHAT IS SPECIAL ABOUT THEPRESENCE‐ONLY CASE?

• It is complex because of the interplay of data quality, modellingmethod and scale of analysismethod and scale of analysis.

• Presence‐only data release us from the problems of unreliable absence records.

• However, the presence records are also imprinted by many of the factors affecting absences. If a species is absent from an environmentally suitable area because, say, past disturbances h d l l ti ti th i l f th t b ill bhave caused local extinctions, the signal of that absence will be found in the distribution of presence records: there will be no presence records in the disturbed area.

WHAT IS SPECIAL ABOUT THEPRESENCE‐ONLY CASE?

• Regardless of whether absences are used in modelling, the pattern in the presence records will suggest the area is unsuitable, and the model will be affected by this patterning.

• If the detectability of a particular species varies from site to site, then not only does this result in some false absences in presence‐absence data, it also affects the pattern of presences in presence‐only data.

• This thinking means that we will approach the description of the g pp ppresence‐only modelling problem as one that is trying to model the same quantity that is modelled with presence‐absence data, that is, the probability of presence of a species.

14/09/55

4

Introduction

• Let’s assume that the data be available to the modeller are presence‐only, i.e., a set of locations within L, the landscape of interest where the species has been observedinterest, where the species has been observed.

• Let y = 1 (presence), y = 0 (absence), z = a vector of environmental covariates, and background = all locations within L (or a random sample thereof).

• Assume the environmental variables or covariates z (representing environmental conditions) are available landscape wide.

• Define f(z) to be the probability density of covariates across L, • f1(z) to be the probability density of covariates across locations ( ) p y y

within L where the species is present, and • f0(z) where the species is absent • (densities – or probability density functions –describe the relative

likelihood of random variables over their range and can be univariate or multivariate).

Introduction

• we wish to estimate is the probability of presence of the species, conditioned on environment:

)|1P (

• Strictly presence‐only data only allow us to model f1(z), which on its own cannot approximate probability of presence.

• Presence/background data allows us to model both f1(z) and f(z), and this gets to within a constant of

• Pr(y = 1|z), because Bayes’ rule gives:

).|1Pr( zy

• Usually, Pr(y = 1) is unkonwn the prevalence of the species (proportion of occupied sites) in the landscape.

)(

)1Pr(*)()|1Pr( 1

zf

yzfzy

14/09/55

5

Introduction

Issues of presence/absence data• Prevalence is not identifiable from presence‐only data.Prevalence is not identifiable from presence only data. • Due to issues of detection probability, even presence‐absence data may not yield a good estimate of prevalence.

• Sample selection bias (whereby some areas in the landscape are sampled more intensively than others) has a much stronger effect on presence‐only models h b d lthan on presence‐absence models.

• For presence‐absence models, sample selection bias affects both presence and absence records, and the effect of the bias cancels out.

Introduction

Issues of presence/absence data• Presence or absence of a species is dependent on the time frame and

spatial scalespatial scale• However, with presence‐only data, we typically have occurrence data that

do not have any associated temporal or spatial scale.• The record is usually simply a record of the species at a location, with no

information on search area or time.• With presence‐absence data, the definition of the response variable

should naturally be consistent with the sampling method. For example, if the available data are surveys of 1‐m2 quadrats, then y = 1 should correspond to the species being present in a 1‐m2 quadrat.

• With presence‐only data, the available data do not usually describe the survey method, so we has considerable leeway in defining the response variable. A common approach is to implicitly assume a sampling unit of size equal to the grain size of available environmental data.

14/09/55

6

Introduction

In summary• With presence and background data, we can model the p g ,

same quantity as with presence‐absence data, up to the constant Pr(y = 1).

• If presenceabsence survey data are available, it is advisable to use a presence‐absence modelling method, since in that case the models are less susceptible to problems of sample selection bias

• Presence‐absence data give us much better information gabout prevalence than presence‐only, because –even though there may be some difficulties because of imperfect detection – they solve the major problem of nonidentifiability.

Explanation of MaxEnt

14/09/55

7

Covariates and features

• Most ecologists call the independent variables in a model the covariates, predictors or inputs. I SDM h i l d i l f h l• In SDMs, these include environmental factors that are relevant to habitat suitability.

• Since species’ responses to these tend to be complex, it is usually desirable to fit nonlinear functions.

• In regression this can be achieved by applying transformations to the covariates – for instance, creating basis functions for polynomials and splines, including piecewise linear functions.

• Complex models are fitted as linear combinations of these basis pfunctions in methods including GLMs and GAMs.

• In machine learning, basis functions and other transformations of available data are termed “features”

• Features are an expanded set of transformations of the original covariates.

Covariates and features

• The MaxEnt fitted function is usually defined over many features, meaning that in most models there will be more features than covariates.

• Six feature classes include 1)linear 2)product 3)quadratic 4)hinge• Six feature classes include 1)linear, 2)product, 3)quadratic, 4)hinge, 5)threshold and 6)categorical.

• Products are products of all possible pair‐wise combinations of covariates, allowing simple interactions to be fitted.

• Threshold features allow a ‘‘step’’ in the fitted function.• Hinge features are similar except they allow a change in gradient of the

response. • Many threshold or hinge features can be fitted for one covariate, giving a

potentially complex function. potentially complex function.• Hinge features (which are basis functions for piecewise linear splines), if

used alone, allow a model rather like a generalized additive model (GAM): an additive model, with nonlinear fitted functions of varying complexity but without the sudden steps of the threshold features.

• MaxEnt’s default is to allow all feature types.

14/09/55

8

Overview

• MaxEnt focuses on comparing probability density in the covariate space.

Overview• From Equation

f1(z) = the conditional density of the covariates at the presence sites,

)(

)1Pr(*)()|1Pr( 1

zf

yzfzy

f(z) = the marginal (i.e., unconditional) density of covariates across the study area

Pr(y = 1) = the prevalence (we usually don’t have this knowledge)

Pr(y=1|z) = the probability of occurrenc

• If we known f1(z) and f(z), we only nedd to know Pr(y=1) to calculate Pr(y=1|z)

• MaxEnt estimates the ratio f1(z)/f(z) (MaxEnt’s ‘‘raw’’ output). This is the core of the MaxEnt model output.

• Because the required information on prevalence is not available for calculating Pr(y = 1|z) k d h b i l d ( d M E ’ ‘‘l i i ’’ ), a workaround has been implemented (termed MaxEnt’s ‘‘logistic’’ output).

• This treats the log of the output: ƞ(z) = log(f1(z)/f(z)) as a logit score, and calibrates the intercept so that the implied probability of presence at sites with ‘‘typical’’ conditions for the species (i.e., where ƞ(z) = the average value of ƞ(z) under f1) is a parameter Ƭ.

• Knowledge of Ƭ would solve the nonidentifiability of prevalence, and in the absence of that knowledge MaxEnt arbitrarily sets Ƭ to equal 0.5.

• This logistic transformation is monotone (order preserving) with the raw output.

14/09/55

9

The landscape and species records

• The landscape of interest (L) is a geographic area suggested and defined by the ecologist (e.g., geographic boundaries or an understanding of how far the focal species could have dispersed).

• L1 is the subset of L where the species is present.

• The distribution of covariates in L is conveyed by a finite sample – a collection of points from L with associated covariates, called a “background sample”.

• By default, MaxEnt randomly samples 10,000 background locations from covariate grids, but the background data points can also be specified.

• Background sample does not take any account of the presence locations –a sample of L, and could by chance include presence locations.

• A random background sample implies the sample of presence records is also a random sample from L1.

Description of the model

• Estimate the ratio f1(z)/f(z). • Estimate of f1(z) that is consistent with the occurrence ( )

data; many such distributions are possible, but it chooses the one that is closest to f(z).

• Minimizing distance from f(z) is sensible, because f(z) is a null model for f1(z): without any occurrence data, we would have no reason to expect the species to prefer any particular environmental conditions over any others,

• So we could do no better than predict that the species p poccupies environmental conditions proportionally to their availability in the landscape.

• In MaxEnt, this distance from f(z) is taken to be the relative entropy of f1(z) with respect to f(z).

14/09/55

10

Description of the model• Using background data informs the model about f(z), the

density of covariates in the region, and provides the basis for comparison with the density of covariates occupied by thecomparison with the density of covariates occupied by the species f1(z).

• Constraints are imposed so that the solution is one that reflects information from the presence records.

• For example, if one covariate is summer rainfall, then constraints ensure that the mean summer rainfall for the estimate of f1(z) is close to its mean across the locations with observed presence . p

The heart of MaxEnt: the species distribution is estimated by minimizing the distance between f1(z) and f(z) subject to constraining the mean summer rainfall estimated by f1 to be close to the mean across presence locations.

Appendix 2

14/09/55

11

¶(x) = f1(z) = Pr(x|y=1) =Pr(z|y=1) The “raw” distribution: the probability given the species is present, that is found at pixel x

• Maximizing the entropy of the raw• Maximizing the entropy of the raw distribution is equivalent to minimizing the relative entropy of f1(z) relative to f(z), so the two formulations are equivalent.

)()1|(ˆ

)1|Pr()(

xqyxp

yxx

e xz )(

Q

Where z(x) = a vector of environmental variables

defined over the sampling space of pixelsβ a vector of weights (coefficients)β = a vector of weights (coefficients)qβ(x) = a probability distribution over xQβ = a normalization constant that ensures

qβ(x) sum to 1

14/09/55

12

f1(z) = Pr(Z=z|y=1)

= f(z) eα+βz

ZL ])([#

α for a normalizing constant ensuring that f1(z) integrates (sums) to 1

where L

zxZLxzf

])(:[#)(

f(z) the number of x in L such that Z(x)=z, divided by the number of elements (pixels) in (p )the landscape L

The null model for the raw distribution was the uniform distribution over the landscape

Description of the Model

• ConstraintsMaxEnt fits the model on f t th t t f ti f thfeatures that are transformations of the covariates allowing complex relationships to be modeled.

• The constraints are extended from being constraints on the means of covariates to being constraints on the means of the features.

14/09/55

13


• Let the vector of features be h(z) and the vector of coefficient beβcoefficient beβ

• Minimizing relative entropy results in a Gibbs distribution which is an exponential‐family model:

h ( ) β*h( )

)(1 )( zezfzf

where η(z) = α+β*h(z)

α = a normalizing constant ensures that f1z integrates (sums) to 1.


• The target of a MaxEnt model is eƞ(z) (i.e., f ( )/f( ))f1(z)/f(z)).

• It is a log‐linear model similar in form to a GLM and depends on both the presence sample and the background samples that are used in forming the estimate.g

14/09/55

14

Mechanics of the solution

• MaxEnt needs to find β that will result in the constraints being satisfied but not match up them so closely that it

fi d d d l i h li i d li ioverfits and produces a model with limited generalization.• MaxEnt sets an error bound (λ); or maximum allowed

deviation from the sample (empirical) feature means.• It rescales all features to have the range 0 – 1.• Then, λj is calculated for each feature.• λj reflects the variation in sample values for that feature,

adjusted by a tuned (preset) parameter for the featureadjusted by a tuned (preset) parameter for the feature class.

λ = tuning parameter (regularization)


• MaxEnt could estimate feature error bounds only from the data.

• MaxEnt uses feature class‐specific tuned parameters based on large international dataset.

• The dataset covers 226 species, 6 regions of the world, sample sizes ranging from 2 to 5822, and 11‐13 predictor per region.

• It is possible that the tuning may not work well for very diff d if hdifferent datasets, e.g., if there are many more predictors.

• The tuned parameters can be changed by the user if desired.

14/09/55

15


hS jj

][2

mj

λj = the regularization parameter for feature hjS2[hj] = feature’s variancem = presence sitesλ = tuning parameter


• Conceptually, λj corresponds to the width of the confidence interval, and therefore it takes the form of the standard error multiplied by λ according to the desired confidence levelmultiplied by λ according to the desired confidence level.

• λ in eq 3 allows regularization, i.e., smoothing the distribution call L1‐regularization.

• This is a way of reducing the β (penalizing them) to values that balance fit and complexity allowing both accurate prediction and generality.

• The fit of the model is measured at the occurrence sites using a log likelihood.

• A highly complex model will have a high log likelihood, but may not generalize well.

• The aim of regularization is to trade off the fit.

14/09/55

16

A criterion to find an optimal model

n

zm

iezf )( ))(ln(1

max j

jji

i ezfm 11

,))(ln(max

subject to where

z = the feature vector for occurrence point I of m sitesj = I,…,n featuresȠ(zi)= α+β*h(z)

L

z dzezf i 1)( )(

Ƞ( i) α β h( )

•MaxEnt fits a penalized maximum likelihood model closely related to other penalties for complexity such as Akaike’s Information Criterion.•Maximizing the penalized log likelihood is equivalent to minimizing the relative entropy subject to the error bound constraints.

prob ln(prob) exp(ln(prob)) -2*LN(prob)0 00000001 18 4 0 00000001 36 80.00000001 -18.4 0.00000001 36.8

0.1 -2.3 0.1 4.60.2 -1.6 0.2 3.20.3 -1.2 0.3 2.40.4 -0.9 0.4 1.80.5 -0.7 0.5 1.40.6 -0.5 0.6 1.00.7 -0.4 0.7 0.70.8 -0.2 0.8 0.40.9 -0.1 0.9 0.21.0 0.0 1.0 0.0

14/09/55

17

Feature Selection

Feature Selection

14/09/55

18

MaxEnt’s Outputs

• Raw Output– Maxent assigns a non‐negative probability to each pixel in the study area. Because these

probabilities must sum to 1, each probability is typically extremely small.

– The primary output of Maxent is the exponential function.

– Raw values are also scale‐dependent, in the sense that using more background data results in smaller raw values, since they must sum to one over a larger number of background points.

• Cumulative Output– The value assigned to a pixel is the sum of the probabilities of that pixel and all other

pixels with equal or lower probability, multiplied by 100 to give a percentage.

d f l i l di h d l d b bili– Interpreted as: If we resample pixels according to the modeled Maxent probability distribution, then c % of the resampled pixels will have cumulative value of c or less.

– Thus, if the Maxent distribution is a close approximation of the probability distribution that represents reality, the binary model obtained by setting a threshold of c will have approximately c % omission of test localities.

• Logistic Output

MaxEnt’s Logistic Output

• Attempt to estimate of the probability that the species is present given the environmentspecies is present, given the environment, Pr(y=1|z)

• A post transformation of the MaxEnt raw output that make certain assumptions about prevalence and sampling effort.

R d l i ti t t t i ll• Raw and logistic outputs are monotonically related in terms of suitability sites (yield identical AUC)

14/09/55

19


• From where , )(

)1Pr(*)()|1Pr( 1

zf

yzfzy

)(1 )( zezfzf

a simple approach to estimate Pr(y=1|z) would be to multiply eƞ(z) by a constant that estimate prevalence.

• The disadvantage is eƞ(z) can be large, implying that we may get an estimate of Pr(y=1|z) that

d 1exceed 1.• To avoid this problems and to side‐step Pr(y=1), MaxEnt’s logistic output transform the model from an exponential model to a logistic model.


From a “minimax” or robust Bayes approach (see more details in Appendix 3),

where, ƞ(z) = the linear score of the features, α+β*h(z)

th l ti t f M E t’ ti t

rz

rz

e

ezy

)(

)(

1)|1Pr(

r = the relative entropy of MaxEnt’s estimate of f1(z) from f(z)

Ƭ = the probability of presence at site with typical conditions for the species (the prevalence)

14/09/55

20

MaxEnt’s Logistic OutputFrom a “minimax” or robust Bayes approach (see more details in Appendix 3),

•r is the average of ƞ(z) under f1(z), so sites with ƞ(z) = r have Pr(y=1|z)= Ƭ.•In unsuitable areas, the denominator is close to 1‐ Ƭ, so the

rz

rz

e

ezy

)(

)(

1)|1Pr(

result is a linear scaling of raw output.•In suitable areas, the effect of the denominator is to bound output below 1.•The default value for Ƭ is arbitrarily set to 0.5.•If Ƭ is available, we can set up Ƭ by our own (only availabe in the latest MaxEnt Version)

Implication for modelling

Some thoughts when applying MaxEnt• Maxent relies on an unbiased samplep

– Collecting a comprehensive set of presence records (cleaned for duplicate and errors)

– Dealing with biased species data– Providing background data with similar biases to those on the

presence data, e.g. using site surveyed for other species or using a bias grid that indicates the biases in the survey data.

– Projection issuesTh l d h ld i l d h f ll i l f– The landscape should include the full environmental range of the species and exclude areas that have not been searched

– Excluding areas from the background sample through use of masks

14/09/55

21


• Feature types– Restrict the modle to simple features if few sample are p pavailable

– Linear is always used; quadratic with at least 10 samples; hinge with at least 15; threshold and product with at least 80

– Reduce the candidate predictor set using ecological understanding of the speices

– Hinge feature make linear and threshold features gredundant

– Hinge feature is similar to GAM – Excluding product features creates an additive model that is easier to interpret


• Regularization (L‐1 regularization)Inbuilt method– Inbuilt method

– Dealing with feature selection (relegating some coefficients to zero) from pre‐select variables

– Tolerance to correlated variables (advantage)

– Expert pre‐selection of a candidate set is a good idea

– Selecting proximal variables must be used for different regions or climates

– When smoother models are required, incresingregularization parameters

14/09/55

22


• Logistic output

– If comparing model for different species, some care is needed in use of the logistic outputs because probability of presence is only defined relative to a given level of sampling effort, which a default is assumed to be one that results in a 50% chance of observing the species in suitable areaschance of observing the species in suitable areas.

Case Study 1 & 2

14/09/55

23

Case Study 1

We use species data from the Banksia Atlas (Taylor &Hopper, 1988; Yates et al., 2010), with 361 records forB. prionotes from the 4631 sites across the SouthWest Australia

( )Floristic Region (SWAFR) that were surveyed for Banksia andfor which we had complete environmental data. The atlas is theresult of a community science project, and records could eitherbe interpreted as presence‐only or presence‐absence data,depending on what assumptions are made about the searchpatterns of contributors. Here we treat them as presence‐onlydata, but use the full set of locations as one ‘‘background’’treatment. To demonstrate the effect of this choice, twoalternative backgrounds (i.e., landscape definitions) wereevaluated: a sample of 10,000 sites within the SWAFR (Yateset al., 2010; and Fig. 2) and a sample of 20,000 sites across theg ) pwhole of Australia. The larger number of sites across Australiawas used to ensure good representation of all environments,based on previous tests of the effects of background sample sizeon model structure for these predictors (J. Elith, unpubl. data).Because the covariate data for this study are unprojected, thesesamples were weighted according to cell area (see methods inAppendix S4) but otherwise random.

Case Study 1

Models were fitted and projected to both current and futureclimates (Fig. 3) using only hinge features, with default

(regularization parameters (see Appendix S5 for model details, and for a comparison with models fitted with all feature types).We fitted all models on the full data sets but also used 10‐fold cross‐validation to estimate errors around fitted functions and predictive performance on held‐out data. The latter is a good test for each model but – given the different backgrounds – not comparable across models. Note also that the AUC in this case is calculated on presence vs. background data (Phillips et al., 2006). To compare the models on consistent data we also divided thecompare the models on consistent data, we also divided the atlas data into training and testing sets for a manual 5‐fold cross‐validation, testing each model on identical withheld data via two test statistics (area under the receiver operating characteristic curve (AUC), and correlation, COR;details in Appendix S4). Example code for running suchanalyses are available online (Appendix S4).

14/09/55

24

Case Study 1

Case Study 1The Effect of Model Clamping

ClamingThe process by which features are constrained to remain within the range of values in the training data.

• This identifies locations where predictions are uncertain because of the method of extrapolation, by showing where clamping substantially affects the predicted value.

• Extreme care should be taken whenever extrapolating outside the training,

• so new calculations (‘‘MESS maps’’, i.e., ( p , ,multivariate environmental similarity surfaces) display differences between the training and prediction environments.

• In this case compared with environments at the atlas sites, the northern parts of the SWAFR will experience novel climates in 2070 (model 1).

14/09/55

25

Case Study 1Cross Validation

• Cross‐validation is a straightforward, quick and useful method for resampling data for training and testing models. In k‐fold cross‐validation (Kohavi, 1995; Hastie et al., 2009)

• the data are divided into a small number (k, usually 5 or10) of mutually exclusive subsets. Model performance is assessed by successively removing each subset, so you have one subset omitted and k‐1 retained.

• You then fit your model on the retained data, and predict to the omitted data. You cycle through all the possibilities k times (the "folds"; see box below).

• By the end every site has been used in model fitting k‐1 times (but in k‐1 different combinations with other sites; see box below),

• and each site has been used for evaluation just once, and only when the model was not trained on that sitetrained on that site.

• Any evaluation will be performed on held‐out data. Cross‐validation is a widely used resampling method, and is fast and performs well. (Hastie et al., 2009)

• Resampling methods that test on withheld data are appropriate tests of a model's predictive performance on data from within the same general set of environmental conditions as those used to fit the data – e.g., within the same geographic region, or the same time.

Case Study 1

MaxEnt uses the cross‐validation in 2 ways.

1) It records the response curve and the predictions fitted under each of the k different training sets, and uses those to give an indication of variability in the results.

2) It also uses the predictions to the omitted (with‐held) sites for estimating test performance (e.g. for the reported test AUC).

For AUC, MaxEnt not only needs predictions to sites with observed occurrence, but also to background sites. For background, it uses the same background data for training and evaluation of all replicate runs.

14/09/55

26

Case Study 1Manual five‐fold cross validation• The different models (1 to 3, see main manuscript) use the same presence sites but different

backgrounds for training. Therefore, the AUC's estimated in MaxEnt's cross‐validation (AUCMaxent) are not comparable, because they use different background data sets.

• For evaluation we want a consistent dataset across models but also one that was not usedFor evaluation we want a consistent dataset across models, but also one that was not used for model training. We could do one split of the training data, and just have one training set and one test set, but cross‐validation is more thorough, tending to give a lower variance estimate than split samples (Hastie et al. 2009).

• Hence we used a "manual" five‐fold cross‐validation (i.e. one where we pre‐specified the data sets, and calculated the evaluation statistics outside of MaxEnt; AUCManual) to ensure consistent evaluation data sets across all four models.

• We divided the presence records into 5 mutually exclusive subsets (using the same ideas as presented in the cross‐validation section earlier in this appendix). For each training run (each fold), 4 of these were combined into a training set, and the 5th set aside as the data to which the model was predicted.

• We also need absence or background data. In this case we used the atlas sites at which Banksia prionotes was not recorded, and treated them as absences. So, the "absence" atlas sites were also divided into 5 subsets, and each combined with a presence subset to give 5 independent evaluation data sets. Note that for model 1, which uses atlas sites as background, we ensured that the "testing“ (evaluation) sites for any given fold were not used as background in the training data, to guarantee that each model's training set was independent of the testing set. In other words, for model 1 we had different subsets of background data for each cross‐validation fold.

Case Study 1

Model Background = 1 the current distribution of the speciesModel Background =2 the distribution only in Southwest AustraliaModel Background=3 across the continent of Australia

14/09/55

27

Case Study 2This analysis predicts the current distribution of Gadopsis bispinosus, the two‐spinedblackfish, in rivers of south‐eastern Australia. In the preamble we make a case that withIn the preamble, we make a case that with presence and background data, we can model the same quantity as with presence‐absence data, up to the constant Pr(y = 1). One implication of that is that we should be able to use the same types of data, including fine‐scale, detailed information, to model ecological relationships – i.e., we need not be restricted to coarse grid cells and basicbe restricted to coarse grid cells and basic climate variables. Here, we use detailed ecological information at the river segment scale to model the distribution of a native fish species. To our knowledge, it is the firstexample using MaxEnt with vector (river segment) data.

Case Study 2

14/09/55

28

Next thing we need to do!

Bias GridCreating a bias grid for Maximum Entropy Modelling (MaxEnt)Creating a bias grid for Maximum Entropy Modelling (MaxEnt)• One limitation of presence‐only data is the effect of

sample selection bias, where some areas in the landscape are sampled more intensively than others (Phillips et al. 2009) – MaxEnt requires an unbiased sample and this applies to all SDMs.

• some areas more intensively surveyed for target species than others and the predictions by MaxEnt were lessthan others and the predictions by MaxEnt were less biologically meaningful if this bias was not accounted for.

• So a bias grid was created to upweight presence‐only datapoints with fewer neighbors in the geographic landscape.

References

• Steven J. Phillips, Miroslav Dudík, Robert E. Schapire.A maximum entropy approach to species distribution modeling.In Proceedings of the Twenty First International Conference onIn Proceedings of the Twenty‐First International Conference on Machine Learning, pages 655‐662, 2004.

• Steven J. Phillips, Robert P. Anderson, Robert E. Schapire.Maximum entropy modeling of species geographic distributions.Ecological Modelling, 190:231‐259, 2006.(datasets used in this paper are available below)

• Jane Elith Steven J Phillips Trevor Hastie Miroslav Dudík Yung En• Jane Elith, Steven J. Phillips, Trevor Hastie, Miroslav Dudík, Yung En Chee, Colin J. Yates.A statistical explanation of MaxEnt for ecologists.Diversity and Distributions, 17:43‐57, 2011.

14/09/55

29

Tutorial on Maxent

Based on: S. Phillips 2012. A Brife Tutorial on Maxent.

1

Introduction to Logistic Regression

301572

GIS for watershed and Environmental Management

2

Statistical Analyses

Time series analysistimecontinuous

Binary logistic regressioncategory and/or continuousBinary

Survival analysiscategory and/or continuousa time at death

Logistic model (binomial errors)category and/or continuousproportion

Log-linear model (Poisson errors)category and/or continuouscount

Analysis of Covariance (ANCOVA)category and/or continuouscontinuous

Regressioncontinuouscontinuous

Analysis of Variacne (ANOVA)categorycontinuous

Contigency table-count data

Statistical AnanlysisExplanatory variableResponse variable

Type of Data

3

Understanding Logaritms

For a positive number n, the logarithm of n is the power to which some number b (the base of the logarithm) must be raised to yield n

if , then nb x xnblog

Log Rules:

1) logb(mn) = logb(m) + logb(n)

2) logb(m/n) = logb(m) – logb(n)

3) logb(mn) = n · logb(m)

4

Use of logarithms in quantitative analysis

• Transform a nonlinear model into a linear form so the researcher can use regression techniques and diagnostics.

• When the values of particular independent variable are very large, particularly when the values of the other independent variables are small by comparison, taking the log of large values to “shrink” them to a more manageable size will keep them from influencing the results of analysis.

5

(Y) (binary data)

(X)

Multiple regression analysis Discriminat Analysis

Introduction

6

Introduction

Binary Logistic Regression

( )

Discriminant

Ananlysis

Discriminant Analysis

Logistic Regression (Hosmer and

Lemeshow 2000)

7

Logistic Regression

(Myocardial

Infraction)

Lab

8

Logistic Regression Model

( )

(category) (continuous)

log Odds

9


dichotomy binary( , )

multiple linear regression

/

linear regression model 1

10




Multivariate Normal Distribution


Discriminant Scores

11


Logistic Regression

•

•

• log odd

ratio (

)

12


• logistic regression

13


B0 B1 =

X =

e = base of the natural logarithms (2.718)

Prob(event) =

Prob (event) = )( 101

1XBBe

14


Prob (event) = Ze1

1

Z

Z = pp XBXBXBB ...22110

p =

15

1 logistic regression

Z S

(cumulative

probability)

(nonlinear)

0 1 Z

x

Y

0

1

Prob (event) = Ze1

1

16

Predicting Node Involvement

(lymph nodes)

17

Predicting Node Involvement

Note: logistic regression

Peduzzu et al. (1966)

10

18

Coefficients for the Logistic ModelVariables in the Equation

.024 .013 3.423 1 .064 1.025 .999 1.051

-.069 .058 1.432 1 .231 .933 .833 1.045

2.045 .807 6.421 1 .011 7.732 1.589 37.615

.761 .771 .976 1 .323 2.141 .473 9.700

1.564 .774 4.084 1 .043 4.778 1.048 21.783

.062 3.460 .000 1 .986 1.064

acid

age

xray

grade

stage

Constant

Step1

a

B S.E. Wald df Sig. Exp(B) Lower Upper

95.0% C.I.for EXP(B)

Variable(s) entered on step 1: acid, age, xray, grade, stage.a.

( column B)

acid, age, xray, grade, stage

0 1 1 xray

xray 1 stage

1 grade

19

Coefficients for the Logistic ModelVariables in the Equation

.024 .013 3.423 1 .064 1.025 .999 1.051

-.069 .058 1.432 1 .231 .933 .833 1.045

2.045 .807 6.421 1 .011 7.732 1.589 37.615

.761 .771 .976 1 .323 2.141 .473 9.700

1.564 .774 4.084 1 .043 4.778 1.048 21.783

.062 3.460 .000 1 .986 1.064

acid

age

xray

grade

stage

Constant

Step1

a




parameters logistic regression

Prob (event) = Ze1

1

Z = 0.0618 + 0.0243 (acid) – 0.0693 (age) + 2.0453 (xray) + 0.7614 (grade) + 1.5641 (stage)20

Calculating a Predicted ProbabilityVariables in the Equation

.024 .013 3.423 1 .064 1.025 .999 1.051

-.069 .058 1.432 1 .231 .933 .833 1.045

2.045 .807 6.421 1 .011 7.732 1.589 37.615

.761 .771 .976 1 .323 2.141 .473 9.700

1.564 .774 4.084 1 .043 4.778 1.048 21.783

.062 3.460 .000 1 .986 1.064

acid

age

xray

grade

stage

Constant

Step1

a




66

48 0 Z

Z = 0.0618 + 0.0243 (48) – 0.0693 (66) = -3.346

Prob (event) = )46.3(1

1

e= 0.0340

21

Calculating a Predicted ProbabilityVariables in the Equation

.024 .013 3.423 1 .064 1.025 .999 1.051

-.069 .058 1.432 1 .231 .933 .833 1.045

2.045 .807 6.421 1 .011 7.732 1.589 37.615

.761 .771 .976 1 .323 2.141 .473 9.700

1.564 .774 4.084 1 .043 4.778 1.048 21.783

.062 3.460 .000 1 .986 1.064

acid

age

xray

grade

stage

Constant

Step1

a




Warning: (mean response)

22

Estimating the CoefficientsVariables in the Equation

.024 .013 3.423 1 .064 1.025 .999 1.051

-.069 .058 1.432 1 .231 .933 .833 1.045

2.045 .807 6.421 1 .011 7.732 1.589 37.615

.761 .771 .976 1 .323 2.141 .473 9.700

1.564 .774 4.084 1 .043 4.778 1.048 21.783

.062 3.460 .000 1 .986 1.064

acid

age

xray

grade

stage

Constant

Step1

a




Linear Regression least-square

method

Logistic Regression Maximum-likelihood

(iterative algorithm)

(converge) threshold

23

Estimating the CoefficientsVariables in the Equation

.024 .013 3.423 1 .064 1.025 .999 1.051

-.069 .058 1.432 1 .231 .933 .833 1.045

2.045 .807 6.421 1 .011 7.732 1.589 37.615

.761 .771 .976 1 .323 2.141 .473 9.700

1.564 .774 4.084 1 .043 4.778 1.048 21.783

.062 3.460 .000 1 .986 1.064

acid

age

xray

grade

stage

Constant

Step1

a




Warning: Logistic Regression

(

)

(constant)

(Schlesselman, 1982)24

Interpreting the Logistic Coefficients

Multiple Linear Regression

Logistic Regression Multiple Linear

Regression

Odds, Logits Odd Ratios

25

Calculating Odds

Odds

Odds

Odds = nodes)(negativeProb

nodes)(positiveProb

Warning: Odds

26

Calculating Logit

Logistic log of the odds logit

Multiple

Regression

( ) log of odds


log odds

( interaction terms

)

pp XBXBB ...event)(noProb

(event)Problog 110

27

Calculating LogitVariables in the Equation

.024 .013 3.423 1 .064 1.025 .999 1.051

-.069 .058 1.432 1 .231 .933 .833 1.045

2.045 .807 6.421 1 .011 7.732 1.589 37.615

.761 .771 .976 1 .323 2.141 .473 9.700

1.564 .774 4.084 1 .043 4.778 1.048 21.783

.062 3.460 .000 1 .986 1.064

acid

age

xray

grade

stage

Constant

Step1

a




grade 0.76

“ grade 0 1

log odds 0.76”

28

Calculating Change in OddsVariables in the Equation

.024 .013 3.423 1 .064 1.025 .999 1.051

-.069 .058 1.432 1 .231 .933 .833 1.045

2.045 .807 6.421 1 .011 7.732 1.589 37.615

.761 .771 .976 1 .323 2.141 .473 9.700

1.564 .774 4.084 1 .043 4.778 1.048 21.783

.062 3.460 .000 1 .986 1.064

acid

age

xray

grade

stage

Constant

Step1

a




Odds 60 acid

62 X-ray 1 stage grad 0

Prob (malignant nodes) =Ze1

1= 0.37

Z = 0.0618 + 0.0243 (62) – 0.0693 (60) + 2.0453 (1) = -0.54

( = 0.63)

29

Calculating Change in OddsVariables in the Equation

.024 .013 3.423 1 .064 1.025 .999 1.051

-.069 .058 1.432 1 .231 .933 .833 1.045

2.045 .807 6.421 1 .011 7.732 1.589 37.615

.761 .771 .976 1 .323 2.141 .473 9.700

1.564 .774 4.084 1 .043 4.778 1.048 21.783

.062 3.460 .000 1 .986 1.064

acid

age

xray

grade

stage

Constant

Step1

a




odds

Odds = event)(noProb

(event)Prob =

0.37-1

0.37= 0.59

log odds -0.53

Odds 60 acid

62 X-ray 1 stage grad 0

30

Calculating LogitVariables in the Equation

.024 .013 3.423 1 .064 1.025 .999 1.051

-.069 .058 1.432 1 .231 .933 .833 1.045

2.045 .807 6.421 1 .011 7.732 1.589 37.615

.761 .771 .976 1 .323 2.141 .473 9.700

1.564 .774 4.084 1 .043 4.778 1.048 21.783

.062 3.460 .000 1 .986 1.064

acid

age

xray

grade

stage

Constant

Step1

a




grade = 1 , 0.554

Odds = 1.24 log odds = 0.22

grade = 1 log odds 0.75 (=0.22 - (-0.53))

log odds grade = 0 1 (log odds = 0.22 -0.53 )

0.75 grade

Odds Ratio grade 0 1 1.24/0.59 =

2.1 ( Exp(b) 1)

31

Calculating the Odds RatioVariables in the Equation

.024 .013 3.423 1 .064 1.025 .999 1.051

-.069 .058 1.432 1 .231 .933 .833 1.045

2.045 .807 6.421 1 .011 7.732 1.589 37.615

.761 .771 .976 1 .323 2.141 .473 9.700

1.564 .774 4.084 1 .043 4.778 1.048 21.783

.062 3.460 .000 1 .986 1.064

acid

age

xray

grade

stage

Constant

Step1

a




Logistic odds

Exp(Bi) Odds i

Odds Ratio ( i

interactions)

pppp XBXBBXBXBXBB eeee ...event)(noProb

(event)Prob11022110 ...

32


.024 .013 3.423 1 .064 1.025 .999 1.051

-.069 .058 1.432 1 .231 .933 .833 1.045

2.045 .807 6.421 1 .011 7.732 1.589 37.615

.761 .771 .976 1 .323 2.141 .473 9.700

1.564 .774 4.084 1 .043 4.778 1.048 21.783

.062 3.460 .000 1 .986 1.064

acid

age

xray

grade

stage

Constant

Step1

a




Bi Odds Ratio 1 Odds

Bi Odds Ratio 1 Odds

Bi 0 Odds Ratio 1 Odds

33


.024 .013 3.423 1 .064 1.025 .999 1.051

-.069 .058 1.432 1 .231 .933 .833 1.045

2.045 .807 6.421 1 .011 7.732 1.589 37.615

.761 .771 .976 1 .323 2.141 .473 9.700

1.564 .774 4.084 1 .043 4.778 1.048 21.783

.062 3.460 .000 1 .986 1.064

acid

age

xray

grade

stage

Constant

Step1

a




Odd Ratio grade Odds 2.1 grade 0 1

95% Odds 0.47

9.7

Odds 1

grade Odds

( B 0 Odds

1)34


.024 .013 3.423 1 .064 1.025 .999 1.051

-.069 .058 1.432 1 .231 .933 .833 1.045

2.045 .807 6.421 1 .011 7.732 1.589 37.615

.761 .771 .976 1 .323 2.141 .473 9.700

1.564 .774 4.084 1 .043 4.778 1.048 21.783

.062 3.460 .000 1 .986 1.064

acid

age

xray

grade

stage

Constant

Step1

a




Tip: ,

Odds Ratio

Odds Ratio C eCB B

35

Testing Hypotheses about the Model

0 Logistic Regression

Likelihood-ratio Test

(the sum of squared residuals) Multiple Linear Regression

Likelihood-ratio Test

Likelihood

-2 log of the likelihood (-2LL)

likelihood

-2LL 36

Likelihood-Ratio Test Likelihood

(

R2 Multiple Regression) -2LL

Likelihood Ratio Logistic

Regression

Likelihood Ratio = -2LL(reduced model) - (-2LL(full model))

37

Likelihood-Ratio TestLikelihood-ratio statistic

Chi-square df = k (

)

Likelihood ratio

Ha:

Ha: 0)

( full model)

3 age, race sex

( reduced model) age

race sex 0

38

Likelihood-Ratio Test

Warning: Liklihood-ratio Test

reduced model full model reduced model

(nested) full model

39

Are All Coefficients Zero?

0 -2LL full

model reduced model

Intercept

40

Constant-Only Model

Logistic Regression Intercept -2LL 70.25

intercept

Iteration Historya,b,c

70.253 -.491

70.252 -.501

70.252 -.501

Iteration1

2

3

Step0

-2 Loglikelihood Constant

Coefficients

41

Constant-Only Model

Logistic Regression Intercept -2LL 70.25

intercept


70.253 -.491

70.252 -.501

70.252 -.501

Iteration1

2

3

Step0


Coefficients

Tip: SPSS -2LL “iteration

history”42

Full Model

-2LL = 48.126

Model Summary

48.126a .341 .465

Step1

-2 Loglikelihood

Cox & SnellR Square

NagelkerkeR Square

Estimation terminated at iteration number 5 becauseparameter estimates changed by less than .001.

a.

43

Model Change

Model Summary

48.126a .341 .465

Step1

-2 Loglikelihood

Cox & SnellR Square

NagelkerkeR Square


a.

Likelihood Ratio

0

-2LL

-2LL 70.252

48.126


70.253 -.491

70.252 -.501

70.252 -.501

Iteration1

2

3

Step0


Coefficients

Constant is included in the model.a.

Initial -2 Log Likelihood: 70.252b.


c.

44

Model Change

22.126 column Chi-square df

Chi-square 6 (

) 1 ( 1

Reduced Model)

(p-value < 0.001)

0

Omnibus Tests of Model Coefficients

22.126 5 .000

22.126 5 .000

22.126 5 .000

Step

Block

Model

Step 1Chi-square df Sig.

-2LL

45

Evaluating Sequential Changes

Chi-square Step, Block Model


22.126 5 .000

22.126 5 .000

22.126 5 .000

Step

Block

Model


-2LL

46



22.126 5 .000

22.126 5 .000

22.126 5 .000

Step

Block

Model


-2LL

Model chi-square: -2LL

-2LL (

likelihood

) Model Chi-square

0

( F test Multiple Regression)

47



22.126 5 .000

22.126 5 .000

22.126 5 .000

Step

Block

Model


-2LL

Block chi-square: -2LL

reduced model

dialog box Block

blocks

block Block Chi-square

Model Chi-square block Block

Chi-square 48



22.126 5 .000

22.126 5 .000

22.126 5 .000

Step

Block

Model


-2LL

Step chi-square: -2LL

0 Step Chi-square

F-test Stepwise Multiple

Regression

49



22.126 5 .000

22.126 5 .000

22.126 5 .000

Step

Block

Model


-2LL

2

5

Model Chi-square Block Chi-square Step Chi-square

forward backward

Chi-square

50

Summary Measures

R2 Multiple Regression

R2

R2

Logistic Regression

R2 Multiple Regression Model

2

Logistic Regression Cox and Snell R2

Nagelkerke R2

51

Summary Measures

Cox and Snell R2 Nagelkerke R2

R2 Multiple Regression Logistic

Regression (Least

Square Method)

Model Summary

48.126a .341 .465

Step1

-2 Loglikelihood

Cox & SnellR Square

NagelkerkeR Square


a.

-2LL 48.126

52

Cox and Snell R2

= Likelihood

= likelihood

N =

1

Cox and Snell R2

R2 = 1 - N

BL

L/2

)(

)0(

)0(L

)(BL

53

Nagelkerke R2

R2 = MAXR

R2

2

NMAX LR /22 )]0([1

Nagelkerke (1991) Cox and Snell R2

1 Standardize

Cox and Snell R2

Nagelkerke R2 47%

Logistic Regression

Model Summary

48.126a .341 .465

Step1

-2 Loglikelihood

Cox & SnellR Square

NagelkerkeR Square


a.

54

Summary Measures

Tip: Logistic

Linear Regression Logistic Regression

R2 Logistic Model

55

Tests for a Coefficient

Multiple Linear Regression

Regression 0

R2


SPSS 3 Wald Statistic, the

Likelihood-ratio Test, Score Statistic

56

Wald Statistic 0

Wald Statistic

Chi-square

Wald = 2

..ES

B

B =

S.E. =

df = 1 Wald Statistics

(category data) df 1 Wald

Statistics df 1

57

Wald Statistic

age -0.0693

0.058 ( S.E. 15-2)

Wald Statistics 1.43 (-0.0693/0.058)2

Wald Statistic Chi-

square 15-2 Sig

xray stage 0

0.05

Variables in the Equation

.024 .013 3.423 1 .064 1.025 .999 1.051

-.069 .058 1.432 1 .231 .933 .833 1.045

2.045 .807 6.421 1 .011 7.732 1.589 37.615

.761 .771 .976 1 .323 2.141 .473 9.700

1.564 .774 4.084 1 .043 4.778 1.048 21.783

.062 3.460 .000 1 .986 1.064

acid

age

xray

grade

stage

Constant

Step1

a




58

Wald Statistic

Warning:

Wald Statistic

(absolute)

Wald Statistic

0

Wald Statistic

log-likelihood (Hauck and Donner 1997)

59



4.432 1 .035

4.432 1 .035

22.126 5 .000

Step

Block

Model


0

the likelihood-ratio test

Full Model

Reduced Model Full Model

Chi-square likelihood-ratio test

stage 0

Chi-square 4.43 4.048

Wald StatisticVariables in the Equation

.024 .013 3.423 1 .064 1.025 .999 1.051

-.069 .058 1.432 1 .231 .933 .833 1.045

2.045 .807 6.421 1 .011 7.732 1.589 37.615

.761 .771 .976 1 .323 2.141 .473 9.700

1.564 .774 4.084 1 .043 4.778 1.048 21.783

.062 3.460 .000 1 .986 1.064

acid

age

xray

grade

stage

Constant

Step1

a




60


Step Block

stage 0 Model Chi-square

stage

0

Sig (p-value) 0.5


4.432 1 .035

4.432 1 .035

22.126 5 .000

Step

Block

Model


Variables in the Equation

.024 .013 3.423 1 .064 1.025 .999 1.051

-.069 .058 1.432 1 .231 .933 .833 1.045

2.045 .807 6.421 1 .011 7.732 1.589 37.615

.761 .771 .976 1 .323 2.141 .473 9.700

1.564 .774 4.084 1 .043 4.778 1.048 21.783

.062 3.460 .000 1 .986 1.064

acid

age

xray

grade

stage

Constant

Step1

a




61

Score Statistic

Rao’s efficient score statistic

0

Maximum Likelihood

SPSS

Variables not in the Equation

3.117 1 .077

1.094 1 .296

11.283 1 .001

4.075 1 .044

7.438 1 .006

19.451 5 .002

acid

age

xray

grade

stage

Variables

Overall Statistics

Step0

Score df Sig.

62

Score Statistic

Score Statistic

0

Output Logistic Regression

Overall Statistics log-likelihood

statistic Likelihood-ratio Statistic, Wald Statistic Rao’s Efficient Score

statistic

(Rao, 1973)

Variables not in the Equation

3.117 1 .077

1.094 1 .296

11.283 1 .001

4.075 1 .044

7.438 1 .006

19.451 5 .002

acid

age

xray

grade

stage

Variables

Overall Statistics

Step0

Score df Sig.

Overview of Diagnostic TestingOverview of Diagnostic TestingAnd PrecisionAnd Precision

OutlineOutline

1) Diagnostic Testing

- General Concepts, Sensitivity, Specificity- Receiver Operating Characteristic (ROC) Curves- Area Under (ROC) Curve - AUC- Positive / Negative Predictive Value- Positive / Negative Likelihood Ratio- Implementation

2) Precision / Sample Size Considerations2) Precision / Sample Size Considerations

- Independent Samples Comparing (+) Proportions- Paired Samples

Diagnostic Testing:Diagnostic Testing:

General Concepts, Sensitivity, SpecificityGeneral Concepts, Sensitivity, Specificity

Underlying Goal of Diagnostic TestingUnderlying Goal of Diagnostic Testing

To accurately identify patients:

- disease vs. not-diseased (healthy)disease vs. not diseased (healthy)- case vs. control- normal vs. abnormal- binary comparisons

Through some diagnostic test:

blood s gar- blood sugar- Biomarkers- CT- MRI

Diagnostic Test Results Diagnostic Test Results Often on a Continuous ScaleOften on a Continuous Scale

Pts withPts withPts without thePts without the Pts with Pts with disease (D)disease (D)

Pts without the Pts without the disease (H)disease (H)

Diagnostic Test Result

Consider a CutConsider a Cut--point point

Call these patients “negative” (-) Call these patients “positive” (+)


without the disease (H)with the disease (D)

with disease

no disease

total

CutCut--point Gives Rise to a point Gives Rise to a 2 2 x x 2 2 tabletable

Consider 1000 patients: 500 with a disease, 500 without

test + 400

50 450

test - 100 450 550

t t l 500 500 1000total 500 500 1000

with disease

no disease

total

CutCut--point Gives Rise to a point Gives Rise to a 2 2 x x 2 2 tabletable

Consider 1000 patients: 500 with a disease, 500 without

test + True Positive

False Positive

450

test - False Negative

True Negative

550

t t l 500 500 1000total 500 500 1000

4 4 possible types of diagnostic testing classifications: possible types of diagnostic testing classifications: 2 2 correct, correct, 2 2 incorrectincorrect


Correct Classification: True Positive (TP) Correct Classification: True Positive (TP)

True Positives



with no

Definition: Sensitivity = TP / (TP + FN)= probability of a “+” test among diseased

Correct Classification: True Positive (TP) Correct Classification: True Positive (TP)

with disease

no disease total

test + 400 50 450

test - 100 450 550

total 500 500 1000

Example: Sensitivity = 400 / 500 = 80%

Correct Classification: True Negative (TN) Correct Classification: True Negative (TN)


True negatives



with no

Definition: Specificity = TN / (FP + TN)= probability of a “-” test among healthy

Correct Classification: True Negative (TN) Correct Classification: True Negative (TN)

with disease

no disease total

test + 400 50 450

test - 100 450 550

total 500 500 1000

Example: Specificity = 450 / 500 = 90%

Incorrect Classification: False Positive Incorrect Classification: False Positive


False Positives



with no

Definition: “False Positive Fraction” = FP / (FP + TN)= probability of “+” test among healthy

Incorrect Classification: False Positive Incorrect Classification: False Positive

with disease

no disease total

test + 400 50 450

test - 100 450 550

total 500 500 1000

Example: FPF = 50 / 500 = 10%

Incorrect Classification: False Negative Incorrect Classification: False Negative


False negatives



with no

Definition: “False Negative Fraction” = FN / (FN + TP)= probability of “-” test among diseased

Incorrect Classification: False Negative Incorrect Classification: False Negative

with disease

no disease total

test + 400 50 450

test - 100 450 550

total 500 500 1000

Example: FNF = 100 / 500 = 20%

We Choose the Cut-point

“ “ -- ““ “ “ ++ ““



Deciding on a cut-point is generally a balance of:

(2) Minimizing the probability of incorrect classifications

(1) Maximizing the probability of correct classifications

i.e. sensitivity and specificity

(2) Minimizing the probability of incorrect classifications

i.e. false positive fraction and false negative fraction

(3) Whether a high sensitivity is most important:

i.e. screening for AIDS virus at a blood blank

(4) Or whether a high specificity is most important:

i.e. testing for “presence of death” prior to autopsy

Diagnostic Testing:Diagnostic Testing:Diagnostic Testing:Diagnostic Testing:

Receiver Operating Characteristic Receiver Operating Characteristic (ROC) Curve(ROC) Curve

100%

ROC Curve: Sensitivity against ROC Curve: Sensitivity against 1 1 -- SpecificitySpecificity

Consider a range of cut-points:

Sen

sitiv

ity

of cut points:

0%

1 - Specificity0% 100%

Consider a range of cut-points:

100%

ROC Curve: Sensitivity against ROC Curve: Sensitivity against 1 1 -- SpecificitySpecificity

of cut points:

Sen

sitiv

ity

0% 100%

0%

1 - Specificity

Best Test: Worst test:

100% 100%

Se

nsi

tivity

0%

1 - Specificity0%

100%

Se

nsi

tivity

0%

1 - Specificity0%

100%

Complete overlap of test results between diseased and non-diseased patients

No overlap of test results between diseased and non-diseased patients

100%100%

A good test: A less good test:

Se

nsi

tivity

0%

1 - Specificity0%

100%

Se

nsi

tivity

0%

1 - Specificity0%

100%

A Measure of Overall Usefulness of a Test:A Measure of Overall Usefulness of a Test:

AUC = Area Under (ROC) CurveAUC = Area Under (ROC) Curve

vity

100%

ity

100%

AUC of Four ROC Curves

100%

Se

nsi

tiv

0%

1 - Specificity0%

100%

Se

nsi

tiv

0%

1 - Specificity0%

100%

100%100%

50%

Se

nsi

tivity

0%

1 - Specificity0%

100%

Se

nsi

tivity

0%

1 - Specificity0%

100%

90%65%

AUC: Interpretation

Randomly select a diseased patient and get a score of Y.

Now, randomly select a healthy patient and get a score of X.

then,

AUC = Probability that Y is bigger than X (assume larger test values associated with disease)

Rough AUC Guidelines: Swets, J.A. (1988)

0.50 - 0.60 - Not So Good Science, 1285 - 1993

0.60 - 0.75 - Fair0.75 - 0.90 - Good0.90 - 0.97 - Very Good0.97 - 1.00 - Excellent


Why Do We Care About Sensitivity?Why Do We Care About Sensitivity?Why Do We Care About Specificity?Why Do We Care About Specificity?

Ultimately, we are truly interested knowingUltimately, we are truly interested knowingwhether or not a positive test => disease or whether or not a positive test => disease or whether or not a positive test => disease, or whether or not a positive test => disease, or whether or not a negative test => healthy. whether or not a negative test => healthy.

Positive Predictive ValuePositive Predictive Value

Positive Predictive Value (PPV) Positive Predictive Value (PPV)

(sensitivity) x (prevalence)(sensitivity) x (prevalence)(se s t ty) (p e a e ce)(se s t ty) (p e a e ce)= = ----------------------------------------------------------------------------------------------------------------------------------------------------------------------

(sensitivity)x(prevalence) + ((sensitivity)x(prevalence) + (11--specificity)x(specificity)x(1 1 -- prevalence)prevalence)

= probability of disease given a positive test= probability of disease given a positive test

Prevalence = Prevalence = prevalence of disease among population prevalence of disease among population who will be who will be testedtested..

=>=> Depending on how/where the test is used,Depending on how/where the test is used,prevalence can change the PPV of your test!prevalence can change the PPV of your test!

PPV vs. PrevalencePPV vs. Prevalence



Negative Predictive ValueNegative Predictive Value

Negative Predictive Value (NVP) Negative Predictive Value (NVP)

(specificity) x ((specificity) x (1 1 -- prevalence)prevalence)= = ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

((11--sensitivity)x(prevalence) + (specificity)x(sensitivity)x(prevalence) + (specificity)x(1 1 -- prevalence)prevalence)

= probability of not having disease given a negative test = probability of not having disease given a negative test

Diagnostic Testing:Diagnostic Testing:Diagnostic Testing:Diagnostic Testing:

Positive / Negative Likelihood RatiosPositive / Negative Likelihood Ratios

Positive Likelihood RatioPositive Likelihood Ratio

Positive Likelihood Ratio:

sensitivityLR+ = ------------------------

In our example:

0.8= ------------ = 8.0

Indicates:- How much odds of disease is increased if test is positive- A ratio of something that is desirable (true positives)

divided by something undesirable (false positives)

G l G id li

1 - specificity 1 - 0.9

General Guidelines:1 => Test is Useless1 - 2 => Rarely important change in pre- to post test odds2 - 5 => Small Change5 - 10 => Moderate Change>10 => Large Change

Negative Likelihood RatioNegative Likelihood Ratio

Negative Likelihood Ratio:

1 - sensitivityLR- = ------------------------

In our example:

1 - 0.8= ------------ = 0.22

Indicates:- How much odds of disease is decreased if test is negative- A ratio of something that is undesirable (false negatives)

divided by something desirable (true negatives)

G l G id li

specificity 0.9

General Guidelines:0.5 - 1 => Rarely important change in

pre- to post test odds0.2 - 0.5 => Small Change0.1 - 0.2 => Moderate Change0 - 0.1 => Large change in pre- to post test odds


ImplementationImplementation

ImplementationImplementation

An excellent web tool is available from Dept. of RadiologyJohns Hopkins U.

www jrocfit orgwww.jrocfit.org

Citation:Eng, J. (n.d.). ROC analysis: web-based calculator for ROC curves. Retrieved 10/2006, from http://www.jrocfit.org.

Other packages include:

Stata - ROC curves, AUC, testsExcel - ROC curves with manual data manipulationSPSS - ROC curves, AUC, tests

Implementation Implementation -- A Quick ExampleA Quick Example

Question of Interest:

Does the length of an abnormal T2 signal (median nerveDoes the length of an abnormal T2 signal (median nerve measured via MRI) predict who will benefit from surgery among patients with carpal tunnel syndrome.

Data:

65 patients ho “benefit” from s rger65 patients who “benefit” from surgery76 patients who did not “benefit” from surgery

141 MRI measures of median nerve:T2 abnormal signal length (mm.)



Data entered into www.jrocfit.org:

Benefit Signal Length…0 27.70 10.00 22.70 14.41 17 81 17.81 46.41 7.01 20.4…


Output from www.jrocfit.org:

Measure of Overall “Diagnosticity”:Measure of Overall Diagnosticity :AUC = 0.71 (95% CI: 0.63 - 0.79)

Precision:Precision:

Sample Size ConsiderationsSample Size Considerations

Comparing Proportions with Positive Test:Comparing Proportions with Positive Test:Two Independent SamplesTwo Independent Samples

Pretend we’d like to compare a new imaging technologyto the gold standard.to the gold standard.

- In other patients, the gold standard was shown to positivelyidentify abnormalities 80% (P1) of the time.

- You believe that the new imaging technology will do betterthan 80% -- possibly positively identifying 85%, 90%, or even 95% of abnormalities (P2).

- How many patients do you need to show that your newimaging technology is better? (P2 - P1 > 0)

Comparing Proportions with Positive Test:Comparing Proportions with Positive Test:Two Independent SamplesTwo Independent Samples

Comparing Proportions with Positive Test:Comparing Proportions with Positive Test:Paired SamplesPaired Samples

NOW pretend we’d like to compare this new imaging technology to the gold standard, BUT on the same patients.

- Again, the gold standard was shown to positivelyidentify abnormalities 80% (P1) of the time.

- Assume that the new imaging technology positive identifies 90% of abnormalities (P2).

NOW how many patients do you need to show that the- NOW, how many patients do you need to show that the new imaging technology is better? (P2 - P1 > 0)

Comparing Proportions with Positive Test:Comparing Proportions with Positive Test:Paired SamplesPaired Samples

maxent: statistical explanation (elithet. al....

Documents