maxent: statistical explanation (elithet. al....
TRANSCRIPT
14/09/55
1
MaxEnt:MaxEnt:A Statistical Explanation
(Elith et. al. 2011)
Nantachai Pongpattananurak
Department of Conservation
Introduction
• MaxEnt is a program for modelling species distributions from presence only speciesdistributions from presence‐only species records.
• a new statistical explanation of MaxEnt which shows that the model minimizes the relative entropy between two probability densities (one estimated from the presence data and(one estimated from the presence data and one, from the landscape) defined in covariate space.
14/09/55
2
Introduction
• Species distribution models (SDMs) estimate the relationship between species records at sites and the environmental and/or spatial characteristics of those sites (Franklin, 2009).
• MaxEnt’s predictive performance is consistently competitive with the highest performing methods (Elith et al., 2006).
• Published examples cover diverse aims (findingcorrelates of species occurrences mapping currentcorrelates of species occurrences, mapping current distributions, and predicting to new times and places) across many ecological, evolutionary, conservation and biosecurity applications (Table 1).
• Source: Elith et al. 2010
14/09/55
3
WHAT IS SPECIAL ABOUT THEPRESENCE‐ONLY CASE?
• It is complex because of the interplay of data quality, modellingmethod and scale of analysismethod and scale of analysis.
• Presence‐only data release us from the problems of unreliable absence records.
• However, the presence records are also imprinted by many of the factors affecting absences. If a species is absent from an environmentally suitable area because, say, past disturbances h d l l ti ti th i l f th t b ill bhave caused local extinctions, the signal of that absence will be found in the distribution of presence records: there will be no presence records in the disturbed area.
WHAT IS SPECIAL ABOUT THEPRESENCE‐ONLY CASE?
• Regardless of whether absences are used in modelling, the pattern in the presence records will suggest the area is unsuitable, and the model will be affected by this patterning.
• If the detectability of a particular species varies from site to site, then not only does this result in some false absences in presence‐absence data, it also affects the pattern of presences in presence‐only data.
• This thinking means that we will approach the description of the g pp ppresence‐only modelling problem as one that is trying to model the same quantity that is modelled with presence‐absence data, that is, the probability of presence of a species.
14/09/55
4
Introduction
• Let’s assume that the data be available to the modeller are presence‐only, i.e., a set of locations within L, the landscape of interest where the species has been observedinterest, where the species has been observed.
• Let y = 1 (presence), y = 0 (absence), z = a vector of environmental covariates, and background = all locations within L (or a random sample thereof).
• Assume the environmental variables or covariates z (representing environmental conditions) are available landscape wide.
• Define f(z) to be the probability density of covariates across L, • f1(z) to be the probability density of covariates across locations ( ) p y y
within L where the species is present, and • f0(z) where the species is absent • (densities – or probability density functions –describe the relative
likelihood of random variables over their range and can be univariate or multivariate).
Introduction
• we wish to estimate is the probability of presence of the species, conditioned on environment:
)|1P (
• Strictly presence‐only data only allow us to model f1(z), which on its own cannot approximate probability of presence.
• Presence/background data allows us to model both f1(z) and f(z), and this gets to within a constant of
• Pr(y = 1|z), because Bayes’ rule gives:
).|1Pr( zy
• Usually, Pr(y = 1) is unkonwn the prevalence of the species (proportion of occupied sites) in the landscape.
)(
)1Pr(*)()|1Pr( 1
zf
yzfzy
14/09/55
5
Introduction
Issues of presence/absence data• Prevalence is not identifiable from presence‐only data.Prevalence is not identifiable from presence only data. • Due to issues of detection probability, even presence‐absence data may not yield a good estimate of prevalence.
• Sample selection bias (whereby some areas in the landscape are sampled more intensively than others) has a much stronger effect on presence‐only models h b d lthan on presence‐absence models.
• For presence‐absence models, sample selection bias affects both presence and absence records, and the effect of the bias cancels out.
Introduction
Issues of presence/absence data• Presence or absence of a species is dependent on the time frame and
spatial scalespatial scale• However, with presence‐only data, we typically have occurrence data that
do not have any associated temporal or spatial scale.• The record is usually simply a record of the species at a location, with no
information on search area or time.• With presence‐absence data, the definition of the response variable
should naturally be consistent with the sampling method. For example, if the available data are surveys of 1‐m2 quadrats, then y = 1 should correspond to the species being present in a 1‐m2 quadrat.
• With presence‐only data, the available data do not usually describe the survey method, so we has considerable leeway in defining the response variable. A common approach is to implicitly assume a sampling unit of size equal to the grain size of available environmental data.
14/09/55
6
Introduction
In summary• With presence and background data, we can model the p g ,
same quantity as with presence‐absence data, up to the constant Pr(y = 1).
• If presenceabsence survey data are available, it is advisable to use a presence‐absence modelling method, since in that case the models are less susceptible to problems of sample selection bias
• Presence‐absence data give us much better information gabout prevalence than presence‐only, because –even though there may be some difficulties because of imperfect detection – they solve the major problem of nonidentifiability.
Explanation of MaxEnt
14/09/55
7
Covariates and features
• Most ecologists call the independent variables in a model the covariates, predictors or inputs. I SDM h i l d i l f h l• In SDMs, these include environmental factors that are relevant to habitat suitability.
• Since species’ responses to these tend to be complex, it is usually desirable to fit nonlinear functions.
• In regression this can be achieved by applying transformations to the covariates – for instance, creating basis functions for polynomials and splines, including piecewise linear functions.
• Complex models are fitted as linear combinations of these basis pfunctions in methods including GLMs and GAMs.
• In machine learning, basis functions and other transformations of available data are termed “features”
• Features are an expanded set of transformations of the original covariates.
Covariates and features
• The MaxEnt fitted function is usually defined over many features, meaning that in most models there will be more features than covariates.
• Six feature classes include 1)linear 2)product 3)quadratic 4)hinge• Six feature classes include 1)linear, 2)product, 3)quadratic, 4)hinge, 5)threshold and 6)categorical.
• Products are products of all possible pair‐wise combinations of covariates, allowing simple interactions to be fitted.
• Threshold features allow a ‘‘step’’ in the fitted function.• Hinge features are similar except they allow a change in gradient of the
response. • Many threshold or hinge features can be fitted for one covariate, giving a
potentially complex function. potentially complex function.• Hinge features (which are basis functions for piecewise linear splines), if
used alone, allow a model rather like a generalized additive model (GAM): an additive model, with nonlinear fitted functions of varying complexity but without the sudden steps of the threshold features.
• MaxEnt’s default is to allow all feature types.
14/09/55
8
Overview
• MaxEnt focuses on comparing probability density in the covariate space.
Overview• From Equation
f1(z) = the conditional density of the covariates at the presence sites,
)(
)1Pr(*)()|1Pr( 1
zf
yzfzy
f(z) = the marginal (i.e., unconditional) density of covariates across the study area
Pr(y = 1) = the prevalence (we usually don’t have this knowledge)
Pr(y=1|z) = the probability of occurrenc
• If we known f1(z) and f(z), we only nedd to know Pr(y=1) to calculate Pr(y=1|z)
• MaxEnt estimates the ratio f1(z)/f(z) (MaxEnt’s ‘‘raw’’ output). This is the core of the MaxEnt model output.
• Because the required information on prevalence is not available for calculating Pr(y = 1|z) k d h b i l d ( d M E ’ ‘‘l i i ’’ ), a workaround has been implemented (termed MaxEnt’s ‘‘logistic’’ output).
• This treats the log of the output: ƞ(z) = log(f1(z)/f(z)) as a logit score, and calibrates the intercept so that the implied probability of presence at sites with ‘‘typical’’ conditions for the species (i.e., where ƞ(z) = the average value of ƞ(z) under f1) is a parameter Ƭ.
• Knowledge of Ƭ would solve the nonidentifiability of prevalence, and in the absence of that knowledge MaxEnt arbitrarily sets Ƭ to equal 0.5.
• This logistic transformation is monotone (order preserving) with the raw output.
14/09/55
9
The landscape and species records
• The landscape of interest (L) is a geographic area suggested and defined by the ecologist (e.g., geographic boundaries or an understanding of how far the focal species could have dispersed).
• L1 is the subset of L where the species is present.
• The distribution of covariates in L is conveyed by a finite sample – a collection of points from L with associated covariates, called a “background sample”.
• By default, MaxEnt randomly samples 10,000 background locations from covariate grids, but the background data points can also be specified.
• Background sample does not take any account of the presence locations –a sample of L, and could by chance include presence locations.
• A random background sample implies the sample of presence records is also a random sample from L1.
Description of the model
• Estimate the ratio f1(z)/f(z). • Estimate of f1(z) that is consistent with the occurrence ( )
data; many such distributions are possible, but it chooses the one that is closest to f(z).
• Minimizing distance from f(z) is sensible, because f(z) is a null model for f1(z): without any occurrence data, we would have no reason to expect the species to prefer any particular environmental conditions over any others,
• So we could do no better than predict that the species p poccupies environmental conditions proportionally to their availability in the landscape.
• In MaxEnt, this distance from f(z) is taken to be the relative entropy of f1(z) with respect to f(z).
14/09/55
10
Description of the model• Using background data informs the model about f(z), the
density of covariates in the region, and provides the basis for comparison with the density of covariates occupied by thecomparison with the density of covariates occupied by the species f1(z).
• Constraints are imposed so that the solution is one that reflects information from the presence records.
• For example, if one covariate is summer rainfall, then constraints ensure that the mean summer rainfall for the estimate of f1(z) is close to its mean across the locations with observed presence . p
The heart of MaxEnt: the species distribution is estimated by minimizing the distance between f1(z) and f(z) subject to constraining the mean summer rainfall estimated by f1 to be close to the mean across presence locations.
Appendix 2
14/09/55
11
¶(x) = f1(z) = Pr(x|y=1) =Pr(z|y=1) The “raw” distribution: the probability given the species is present, that is found at pixel x
• Maximizing the entropy of the raw• Maximizing the entropy of the raw distribution is equivalent to minimizing the relative entropy of f1(z) relative to f(z), so the two formulations are equivalent.
)()1|(ˆ
)1|Pr()(
xqyxp
yxx
e xz )(
Q
Where z(x) = a vector of environmental variables
defined over the sampling space of pixelsβ a vector of weights (coefficients)β = a vector of weights (coefficients)qβ(x) = a probability distribution over xQβ = a normalization constant that ensures
qβ(x) sum to 1
14/09/55
12
f1(z) = Pr(Z=z|y=1)
= f(z) eα+βz
ZL ])([#
α for a normalizing constant ensuring that f1(z) integrates (sums) to 1
where L
zxZLxzf
])(:[#)(
f(z) the number of x in L such that Z(x)=z, divided by the number of elements (pixels) in (p )the landscape L
The null model for the raw distribution was the uniform distribution over the landscape
Description of the Model
• ConstraintsMaxEnt fits the model on f t th t t f ti f thfeatures that are transformations of the covariates allowing complex relationships to be modeled.
• The constraints are extended from being constraints on the means of covariates to being constraints on the means of the features.
14/09/55
13
Description of the Model
• Let the vector of features be h(z) and the vector of coefficient beβcoefficient beβ
• Minimizing relative entropy results in a Gibbs distribution which is an exponential‐family model:
h ( ) β*h( )
)(1 )( zezfzf
where η(z) = α+β*h(z)
α = a normalizing constant ensures that f1z integrates (sums) to 1.
Description of the Model
• The target of a MaxEnt model is eƞ(z) (i.e., f ( )/f( ))f1(z)/f(z)).
• It is a log‐linear model similar in form to a GLM and depends on both the presence sample and the background samples that are used in forming the estimate.g
14/09/55
14
Mechanics of the solution
• MaxEnt needs to find β that will result in the constraints being satisfied but not match up them so closely that it
fi d d d l i h li i d li ioverfits and produces a model with limited generalization.• MaxEnt sets an error bound (λ); or maximum allowed
deviation from the sample (empirical) feature means.• It rescales all features to have the range 0 – 1.• Then, λj is calculated for each feature.• λj reflects the variation in sample values for that feature,
adjusted by a tuned (preset) parameter for the featureadjusted by a tuned (preset) parameter for the feature class.
λ = tuning parameter (regularization)
Mechanics of the solution
• MaxEnt could estimate feature error bounds only from the data.
• MaxEnt uses feature class‐specific tuned parameters based on large international dataset.
• The dataset covers 226 species, 6 regions of the world, sample sizes ranging from 2 to 5822, and 11‐13 predictor per region.
• It is possible that the tuning may not work well for very diff d if hdifferent datasets, e.g., if there are many more predictors.
• The tuned parameters can be changed by the user if desired.
14/09/55
15
Mechanics of the solution
hS jj
][2
mj
λj = the regularization parameter for feature hjS2[hj] = feature’s variancem = presence sitesλ = tuning parameter
Mechanics of the solution
• Conceptually, λj corresponds to the width of the confidence interval, and therefore it takes the form of the standard error multiplied by λ according to the desired confidence levelmultiplied by λ according to the desired confidence level.
• λ in eq 3 allows regularization, i.e., smoothing the distribution call L1‐regularization.
• This is a way of reducing the β (penalizing them) to values that balance fit and complexity allowing both accurate prediction and generality.
• The fit of the model is measured at the occurrence sites using a log likelihood.
• A highly complex model will have a high log likelihood, but may not generalize well.
• The aim of regularization is to trade off the fit.
14/09/55
16
A criterion to find an optimal model
n
zm
iezf )( ))(ln(1
max j
jji
i ezfm 11
,))(ln(max
subject to where
z = the feature vector for occurrence point I of m sitesj = I,…,n featuresȠ(zi)= α+β*h(z)
L
z dzezf i 1)( )(
Ƞ( i) α β h( )
•MaxEnt fits a penalized maximum likelihood model closely related to other penalties for complexity such as Akaike’s Information Criterion.•Maximizing the penalized log likelihood is equivalent to minimizing the relative entropy subject to the error bound constraints.
prob ln(prob) exp(ln(prob)) -2*LN(prob)0 00000001 18 4 0 00000001 36 80.00000001 -18.4 0.00000001 36.8
0.1 -2.3 0.1 4.60.2 -1.6 0.2 3.20.3 -1.2 0.3 2.40.4 -0.9 0.4 1.80.5 -0.7 0.5 1.40.6 -0.5 0.6 1.00.7 -0.4 0.7 0.70.8 -0.2 0.8 0.40.9 -0.1 0.9 0.21.0 0.0 1.0 0.0
14/09/55
17
Feature Selection
Feature Selection
14/09/55
18
MaxEnt’s Outputs
• Raw Output– Maxent assigns a non‐negative probability to each pixel in the study area. Because these
probabilities must sum to 1, each probability is typically extremely small.
– The primary output of Maxent is the exponential function.
– Raw values are also scale‐dependent, in the sense that using more background data results in smaller raw values, since they must sum to one over a larger number of background points.
• Cumulative Output– The value assigned to a pixel is the sum of the probabilities of that pixel and all other
pixels with equal or lower probability, multiplied by 100 to give a percentage.
d f l i l di h d l d b bili– Interpreted as: If we resample pixels according to the modeled Maxent probability distribution, then c % of the resampled pixels will have cumulative value of c or less.
– Thus, if the Maxent distribution is a close approximation of the probability distribution that represents reality, the binary model obtained by setting a threshold of c will have approximately c % omission of test localities.
• Logistic Output
MaxEnt’s Logistic Output
• Attempt to estimate of the probability that the species is present given the environmentspecies is present, given the environment, Pr(y=1|z)
• A post transformation of the MaxEnt raw output that make certain assumptions about prevalence and sampling effort.
R d l i ti t t t i ll• Raw and logistic outputs are monotonically related in terms of suitability sites (yield identical AUC)
14/09/55
19
MaxEnt’s Logistic Output
• From where , )(
)1Pr(*)()|1Pr( 1
zf
yzfzy
)(1 )( zezfzf
a simple approach to estimate Pr(y=1|z) would be to multiply eƞ(z) by a constant that estimate prevalence.
• The disadvantage is eƞ(z) can be large, implying that we may get an estimate of Pr(y=1|z) that
d 1exceed 1.• To avoid this problems and to side‐step Pr(y=1), MaxEnt’s logistic output transform the model from an exponential model to a logistic model.
MaxEnt’s Logistic Output
From a “minimax” or robust Bayes approach (see more details in Appendix 3),
where, ƞ(z) = the linear score of the features, α+β*h(z)
th l ti t f M E t’ ti t
rz
rz
e
ezy
)(
)(
1)|1Pr(
r = the relative entropy of MaxEnt’s estimate of f1(z) from f(z)
Ƭ = the probability of presence at site with typical conditions for the species (the prevalence)
14/09/55
20
MaxEnt’s Logistic OutputFrom a “minimax” or robust Bayes approach (see more details in Appendix 3),
•r is the average of ƞ(z) under f1(z), so sites with ƞ(z) = r have Pr(y=1|z)= Ƭ.•In unsuitable areas, the denominator is close to 1‐ Ƭ, so the
rz
rz
e
ezy
)(
)(
1)|1Pr(
result is a linear scaling of raw output.•In suitable areas, the effect of the denominator is to bound output below 1.•The default value for Ƭ is arbitrarily set to 0.5.•If Ƭ is available, we can set up Ƭ by our own (only availabe in the latest MaxEnt Version)
Implication for modelling
Some thoughts when applying MaxEnt• Maxent relies on an unbiased samplep
– Collecting a comprehensive set of presence records (cleaned for duplicate and errors)
– Dealing with biased species data– Providing background data with similar biases to those on the
presence data, e.g. using site surveyed for other species or using a bias grid that indicates the biases in the survey data.
– Projection issuesTh l d h ld i l d h f ll i l f– The landscape should include the full environmental range of the species and exclude areas that have not been searched
– Excluding areas from the background sample through use of masks
14/09/55
21
Implication for modelling
• Feature types– Restrict the modle to simple features if few sample are p pavailable
– Linear is always used; quadratic with at least 10 samples; hinge with at least 15; threshold and product with at least 80
– Reduce the candidate predictor set using ecological understanding of the speices
– Hinge feature make linear and threshold features gredundant
– Hinge feature is similar to GAM – Excluding product features creates an additive model that is easier to interpret
Implication for modelling
• Regularization (L‐1 regularization)Inbuilt method– Inbuilt method
– Dealing with feature selection (relegating some coefficients to zero) from pre‐select variables
– Tolerance to correlated variables (advantage)
– Expert pre‐selection of a candidate set is a good idea
– Selecting proximal variables must be used for different regions or climates
– When smoother models are required, incresingregularization parameters
14/09/55
22
Implication for modelling
• Logistic output
– If comparing model for different species, some care is needed in use of the logistic outputs because probability of presence is only defined relative to a given level of sampling effort, which a default is assumed to be one that results in a 50% chance of observing the species in suitable areaschance of observing the species in suitable areas.
Case Study 1 & 2
14/09/55
23
Case Study 1
We use species data from the Banksia Atlas (Taylor &Hopper, 1988; Yates et al., 2010), with 361 records forB. prionotes from the 4631 sites across the SouthWest Australia
( )Floristic Region (SWAFR) that were surveyed for Banksia andfor which we had complete environmental data. The atlas is theresult of a community science project, and records could eitherbe interpreted as presence‐only or presence‐absence data,depending on what assumptions are made about the searchpatterns of contributors. Here we treat them as presence‐onlydata, but use the full set of locations as one ‘‘background’’treatment. To demonstrate the effect of this choice, twoalternative backgrounds (i.e., landscape definitions) wereevaluated: a sample of 10,000 sites within the SWAFR (Yateset al., 2010; and Fig. 2) and a sample of 20,000 sites across theg ) pwhole of Australia. The larger number of sites across Australiawas used to ensure good representation of all environments,based on previous tests of the effects of background sample sizeon model structure for these predictors (J. Elith, unpubl. data).Because the covariate data for this study are unprojected, thesesamples were weighted according to cell area (see methods inAppendix S4) but otherwise random.
Case Study 1
Models were fitted and projected to both current and futureclimates (Fig. 3) using only hinge features, with default
(regularization parameters (see Appendix S5 for model details, and for a comparison with models fitted with all feature types).We fitted all models on the full data sets but also used 10‐fold cross‐validation to estimate errors around fitted functions and predictive performance on held‐out data. The latter is a good test for each model but – given the different backgrounds – not comparable across models. Note also that the AUC in this case is calculated on presence vs. background data (Phillips et al., 2006). To compare the models on consistent data we also divided thecompare the models on consistent data, we also divided the atlas data into training and testing sets for a manual 5‐fold cross‐validation, testing each model on identical withheld data via two test statistics (area under the receiver operating characteristic curve (AUC), and correlation, COR;details in Appendix S4). Example code for running suchanalyses are available online (Appendix S4).
14/09/55
24
Case Study 1
Case Study 1The Effect of Model Clamping
ClamingThe process by which features are constrained to remain within the range of values in the training data.
• This identifies locations where predictions are uncertain because of the method of extrapolation, by showing where clamping substantially affects the predicted value.
• Extreme care should be taken whenever extrapolating outside the training,
• so new calculations (‘‘MESS maps’’, i.e., ( p , ,multivariate environmental similarity surfaces) display differences between the training and prediction environments.
• In this case compared with environments at the atlas sites, the northern parts of the SWAFR will experience novel climates in 2070 (model 1).
14/09/55
25
Case Study 1Cross Validation
• Cross‐validation is a straightforward, quick and useful method for resampling data for training and testing models. In k‐fold cross‐validation (Kohavi, 1995; Hastie et al., 2009)
• the data are divided into a small number (k, usually 5 or10) of mutually exclusive subsets. Model performance is assessed by successively removing each subset, so you have one subset omitted and k‐1 retained.
• You then fit your model on the retained data, and predict to the omitted data. You cycle through all the possibilities k times (the "folds"; see box below).
• By the end every site has been used in model fitting k‐1 times (but in k‐1 different combinations with other sites; see box below),
• and each site has been used for evaluation just once, and only when the model was not trained on that sitetrained on that site.
• Any evaluation will be performed on held‐out data. Cross‐validation is a widely used resampling method, and is fast and performs well. (Hastie et al., 2009)
• Resampling methods that test on withheld data are appropriate tests of a model's predictive performance on data from within the same general set of environmental conditions as those used to fit the data – e.g., within the same geographic region, or the same time.
Case Study 1
MaxEnt uses the cross‐validation in 2 ways.
1) It records the response curve and the predictions fitted under each of the k different training sets, and uses those to give an indication of variability in the results.
2) It also uses the predictions to the omitted (with‐held) sites for estimating test performance (e.g. for the reported test AUC).
For AUC, MaxEnt not only needs predictions to sites with observed occurrence, but also to background sites. For background, it uses the same background data for training and evaluation of all replicate runs.
14/09/55
26
Case Study 1Manual five‐fold cross validation• The different models (1 to 3, see main manuscript) use the same presence sites but different
backgrounds for training. Therefore, the AUC's estimated in MaxEnt's cross‐validation (AUCMaxent) are not comparable, because they use different background data sets.
• For evaluation we want a consistent dataset across models but also one that was not usedFor evaluation we want a consistent dataset across models, but also one that was not used for model training. We could do one split of the training data, and just have one training set and one test set, but cross‐validation is more thorough, tending to give a lower variance estimate than split samples (Hastie et al. 2009).
• Hence we used a "manual" five‐fold cross‐validation (i.e. one where we pre‐specified the data sets, and calculated the evaluation statistics outside of MaxEnt; AUCManual) to ensure consistent evaluation data sets across all four models.
• We divided the presence records into 5 mutually exclusive subsets (using the same ideas as presented in the cross‐validation section earlier in this appendix). For each training run (each fold), 4 of these were combined into a training set, and the 5th set aside as the data to which the model was predicted.
• We also need absence or background data. In this case we used the atlas sites at which Banksia prionotes was not recorded, and treated them as absences. So, the "absence" atlas sites were also divided into 5 subsets, and each combined with a presence subset to give 5 independent evaluation data sets. Note that for model 1, which uses atlas sites as background, we ensured that the "testing“ (evaluation) sites for any given fold were not used as background in the training data, to guarantee that each model's training set was independent of the testing set. In other words, for model 1 we had different subsets of background data for each cross‐validation fold.
Case Study 1
Model Background = 1 the current distribution of the speciesModel Background =2 the distribution only in Southwest AustraliaModel Background=3 across the continent of Australia
14/09/55
27
Case Study 2This analysis predicts the current distribution of Gadopsis bispinosus, the two‐spinedblackfish, in rivers of south‐eastern Australia. In the preamble we make a case that withIn the preamble, we make a case that with presence and background data, we can model the same quantity as with presence‐absence data, up to the constant Pr(y = 1). One implication of that is that we should be able to use the same types of data, including fine‐scale, detailed information, to model ecological relationships – i.e., we need not be restricted to coarse grid cells and basicbe restricted to coarse grid cells and basic climate variables. Here, we use detailed ecological information at the river segment scale to model the distribution of a native fish species. To our knowledge, it is the firstexample using MaxEnt with vector (river segment) data.
Case Study 2
14/09/55
28
Next thing we need to do!
Bias GridCreating a bias grid for Maximum Entropy Modelling (MaxEnt)Creating a bias grid for Maximum Entropy Modelling (MaxEnt)• One limitation of presence‐only data is the effect of
sample selection bias, where some areas in the landscape are sampled more intensively than others (Phillips et al. 2009) – MaxEnt requires an unbiased sample and this applies to all SDMs.
• some areas more intensively surveyed for target species than others and the predictions by MaxEnt were lessthan others and the predictions by MaxEnt were less biologically meaningful if this bias was not accounted for.
• So a bias grid was created to upweight presence‐only datapoints with fewer neighbors in the geographic landscape.
References
• Steven J. Phillips, Miroslav Dudík, Robert E. Schapire.A maximum entropy approach to species distribution modeling.In Proceedings of the Twenty First International Conference onIn Proceedings of the Twenty‐First International Conference on Machine Learning, pages 655‐662, 2004.
• Steven J. Phillips, Robert P. Anderson, Robert E. Schapire.Maximum entropy modeling of species geographic distributions.Ecological Modelling, 190:231‐259, 2006.(datasets used in this paper are available below)
• Jane Elith Steven J Phillips Trevor Hastie Miroslav Dudík Yung En• Jane Elith, Steven J. Phillips, Trevor Hastie, Miroslav Dudík, Yung En Chee, Colin J. Yates.A statistical explanation of MaxEnt for ecologists.Diversity and Distributions, 17:43‐57, 2011.
14/09/55
29
Tutorial on Maxent
Based on: S. Phillips 2012. A Brife Tutorial on Maxent.
1
Introduction to Logistic Regression
301572
GIS for watershed and Environmental Management
2
Statistical Analyses
Time series analysistimecontinuous
Binary logistic regressioncategory and/or continuousBinary
Survival analysiscategory and/or continuousa time at death
Logistic model (binomial errors)category and/or continuousproportion
Log-linear model (Poisson errors)category and/or continuouscount
Analysis of Covariance (ANCOVA)category and/or continuouscontinuous
Regressioncontinuouscontinuous
Analysis of Variacne (ANOVA)categorycontinuous
Contigency table-count data
Statistical AnanlysisExplanatory variableResponse variable
Type of Data
3
Understanding Logaritms
For a positive number n, the logarithm of n is the power to which some number b (the base of the logarithm) must be raised to yield n
if , then nb x xnblog
Log Rules:
1) logb(mn) = logb(m) + logb(n)
2) logb(m/n) = logb(m) – logb(n)
3) logb(mn) = n · logb(m)
4
Use of logarithms in quantitative analysis
• Transform a nonlinear model into a linear form so the researcher can use regression techniques and diagnostics.
• When the values of particular independent variable are very large, particularly when the values of the other independent variables are small by comparison, taking the log of large values to “shrink” them to a more manageable size will keep them from influencing the results of analysis.
5
(Y) (binary data)
(X)
Multiple regression analysis Discriminat Analysis
Introduction
6
Introduction
Binary Logistic Regression
( )
Discriminant
Ananlysis
Discriminant Analysis
Logistic Regression (Hosmer and
Lemeshow 2000)
7
Logistic Regression
(Myocardial
Infraction)
Lab
8
Logistic Regression Model
( )
(category) (continuous)
log Odds
9
Logistic Regression Model
dichotomy binary( , )
multiple linear regression
/
linear regression model 1
10
Logistic Regression Model
Discriminant Analysis
Discriminant Analysis
Multivariate Normal Distribution
Discriminant Analysis
Discriminant Scores
11
Logistic Regression Model
Logistic Regression
•
•
• log odd
ratio (
)
12
Logistic Regression Model
• logistic regression
13
Logistic Regression Model
B0 B1 =
X =
e = base of the natural logarithms (2.718)
Prob(event) =
Prob (event) = )( 101
1XBBe
14
Logistic Regression Model
Prob (event) = Ze1
1
Z
Z = pp XBXBXBB ...22110
p =
15
1 logistic regression
Z S
(cumulative
probability)
(nonlinear)
0 1 Z
x
Y
0
1
Prob (event) = Ze1
1
16
Predicting Node Involvement
(lymph nodes)
17
Predicting Node Involvement
Note: logistic regression
Peduzzu et al. (1966)
10
18
Coefficients for the Logistic ModelVariables in the Equation
.024 .013 3.423 1 .064 1.025 .999 1.051
-.069 .058 1.432 1 .231 .933 .833 1.045
2.045 .807 6.421 1 .011 7.732 1.589 37.615
.761 .771 .976 1 .323 2.141 .473 9.700
1.564 .774 4.084 1 .043 4.778 1.048 21.783
.062 3.460 .000 1 .986 1.064
acid
age
xray
grade
stage
Constant
Step1
a
B S.E. Wald df Sig. Exp(B) Lower Upper
95.0% C.I.for EXP(B)
Variable(s) entered on step 1: acid, age, xray, grade, stage.a.
( column B)
acid, age, xray, grade, stage
0 1 1 xray
xray 1 stage
1 grade
19
Coefficients for the Logistic ModelVariables in the Equation
.024 .013 3.423 1 .064 1.025 .999 1.051
-.069 .058 1.432 1 .231 .933 .833 1.045
2.045 .807 6.421 1 .011 7.732 1.589 37.615
.761 .771 .976 1 .323 2.141 .473 9.700
1.564 .774 4.084 1 .043 4.778 1.048 21.783
.062 3.460 .000 1 .986 1.064
acid
age
xray
grade
stage
Constant
Step1
a
B S.E. Wald df Sig. Exp(B) Lower Upper
95.0% C.I.for EXP(B)
Variable(s) entered on step 1: acid, age, xray, grade, stage.a.
parameters logistic regression
Prob (event) = Ze1
1
Z = 0.0618 + 0.0243 (acid) – 0.0693 (age) + 2.0453 (xray) + 0.7614 (grade) + 1.5641 (stage)20
Calculating a Predicted ProbabilityVariables in the Equation
.024 .013 3.423 1 .064 1.025 .999 1.051
-.069 .058 1.432 1 .231 .933 .833 1.045
2.045 .807 6.421 1 .011 7.732 1.589 37.615
.761 .771 .976 1 .323 2.141 .473 9.700
1.564 .774 4.084 1 .043 4.778 1.048 21.783
.062 3.460 .000 1 .986 1.064
acid
age
xray
grade
stage
Constant
Step1
a
B S.E. Wald df Sig. Exp(B) Lower Upper
95.0% C.I.for EXP(B)
Variable(s) entered on step 1: acid, age, xray, grade, stage.a.
66
48 0 Z
Z = 0.0618 + 0.0243 (48) – 0.0693 (66) = -3.346
Prob (event) = )46.3(1
1
e= 0.0340
21
Calculating a Predicted ProbabilityVariables in the Equation
.024 .013 3.423 1 .064 1.025 .999 1.051
-.069 .058 1.432 1 .231 .933 .833 1.045
2.045 .807 6.421 1 .011 7.732 1.589 37.615
.761 .771 .976 1 .323 2.141 .473 9.700
1.564 .774 4.084 1 .043 4.778 1.048 21.783
.062 3.460 .000 1 .986 1.064
acid
age
xray
grade
stage
Constant
Step1
a
B S.E. Wald df Sig. Exp(B) Lower Upper
95.0% C.I.for EXP(B)
Variable(s) entered on step 1: acid, age, xray, grade, stage.a.
Warning: (mean response)
22
Estimating the CoefficientsVariables in the Equation
.024 .013 3.423 1 .064 1.025 .999 1.051
-.069 .058 1.432 1 .231 .933 .833 1.045
2.045 .807 6.421 1 .011 7.732 1.589 37.615
.761 .771 .976 1 .323 2.141 .473 9.700
1.564 .774 4.084 1 .043 4.778 1.048 21.783
.062 3.460 .000 1 .986 1.064
acid
age
xray
grade
stage
Constant
Step1
a
B S.E. Wald df Sig. Exp(B) Lower Upper
95.0% C.I.for EXP(B)
Variable(s) entered on step 1: acid, age, xray, grade, stage.a.
Linear Regression least-square
method
Logistic Regression Maximum-likelihood
(iterative algorithm)
(converge) threshold
23
Estimating the CoefficientsVariables in the Equation
.024 .013 3.423 1 .064 1.025 .999 1.051
-.069 .058 1.432 1 .231 .933 .833 1.045
2.045 .807 6.421 1 .011 7.732 1.589 37.615
.761 .771 .976 1 .323 2.141 .473 9.700
1.564 .774 4.084 1 .043 4.778 1.048 21.783
.062 3.460 .000 1 .986 1.064
acid
age
xray
grade
stage
Constant
Step1
a
B S.E. Wald df Sig. Exp(B) Lower Upper
95.0% C.I.for EXP(B)
Variable(s) entered on step 1: acid, age, xray, grade, stage.a.
Warning: Logistic Regression
(
)
(constant)
(Schlesselman, 1982)24
Interpreting the Logistic Coefficients
Multiple Linear Regression
Logistic Regression Multiple Linear
Regression
Odds, Logits Odd Ratios
25
Calculating Odds
Odds
Odds
Odds = nodes)(negativeProb
nodes)(positiveProb
Warning: Odds
26
Calculating Logit
Logistic log of the odds logit
Multiple
Regression
( ) log of odds
Logistic Regression Model
log odds
( interaction terms
)
pp XBXBB ...event)(noProb
(event)Problog 110
27
Calculating LogitVariables in the Equation
.024 .013 3.423 1 .064 1.025 .999 1.051
-.069 .058 1.432 1 .231 .933 .833 1.045
2.045 .807 6.421 1 .011 7.732 1.589 37.615
.761 .771 .976 1 .323 2.141 .473 9.700
1.564 .774 4.084 1 .043 4.778 1.048 21.783
.062 3.460 .000 1 .986 1.064
acid
age
xray
grade
stage
Constant
Step1
a
B S.E. Wald df Sig. Exp(B) Lower Upper
95.0% C.I.for EXP(B)
Variable(s) entered on step 1: acid, age, xray, grade, stage.a.
grade 0.76
“ grade 0 1
log odds 0.76”
28
Calculating Change in OddsVariables in the Equation
.024 .013 3.423 1 .064 1.025 .999 1.051
-.069 .058 1.432 1 .231 .933 .833 1.045
2.045 .807 6.421 1 .011 7.732 1.589 37.615
.761 .771 .976 1 .323 2.141 .473 9.700
1.564 .774 4.084 1 .043 4.778 1.048 21.783
.062 3.460 .000 1 .986 1.064
acid
age
xray
grade
stage
Constant
Step1
a
B S.E. Wald df Sig. Exp(B) Lower Upper
95.0% C.I.for EXP(B)
Variable(s) entered on step 1: acid, age, xray, grade, stage.a.
Odds 60 acid
62 X-ray 1 stage grad 0
Prob (malignant nodes) =Ze1
1= 0.37
Z = 0.0618 + 0.0243 (62) – 0.0693 (60) + 2.0453 (1) = -0.54
( = 0.63)
29
Calculating Change in OddsVariables in the Equation
.024 .013 3.423 1 .064 1.025 .999 1.051
-.069 .058 1.432 1 .231 .933 .833 1.045
2.045 .807 6.421 1 .011 7.732 1.589 37.615
.761 .771 .976 1 .323 2.141 .473 9.700
1.564 .774 4.084 1 .043 4.778 1.048 21.783
.062 3.460 .000 1 .986 1.064
acid
age
xray
grade
stage
Constant
Step1
a
B S.E. Wald df Sig. Exp(B) Lower Upper
95.0% C.I.for EXP(B)
Variable(s) entered on step 1: acid, age, xray, grade, stage.a.
odds
Odds = event)(noProb
(event)Prob =
0.37-1
0.37= 0.59
log odds -0.53
Odds 60 acid
62 X-ray 1 stage grad 0
30
Calculating LogitVariables in the Equation
.024 .013 3.423 1 .064 1.025 .999 1.051
-.069 .058 1.432 1 .231 .933 .833 1.045
2.045 .807 6.421 1 .011 7.732 1.589 37.615
.761 .771 .976 1 .323 2.141 .473 9.700
1.564 .774 4.084 1 .043 4.778 1.048 21.783
.062 3.460 .000 1 .986 1.064
acid
age
xray
grade
stage
Constant
Step1
a
B S.E. Wald df Sig. Exp(B) Lower Upper
95.0% C.I.for EXP(B)
Variable(s) entered on step 1: acid, age, xray, grade, stage.a.
grade = 1 , 0.554
Odds = 1.24 log odds = 0.22
grade = 1 log odds 0.75 (=0.22 - (-0.53))
log odds grade = 0 1 (log odds = 0.22 -0.53 )
0.75 grade
Odds Ratio grade 0 1 1.24/0.59 =
2.1 ( Exp(b) 1)
31
Calculating the Odds RatioVariables in the Equation
.024 .013 3.423 1 .064 1.025 .999 1.051
-.069 .058 1.432 1 .231 .933 .833 1.045
2.045 .807 6.421 1 .011 7.732 1.589 37.615
.761 .771 .976 1 .323 2.141 .473 9.700
1.564 .774 4.084 1 .043 4.778 1.048 21.783
.062 3.460 .000 1 .986 1.064
acid
age
xray
grade
stage
Constant
Step1
a
B S.E. Wald df Sig. Exp(B) Lower Upper
95.0% C.I.for EXP(B)
Variable(s) entered on step 1: acid, age, xray, grade, stage.a.
Logistic odds
Exp(Bi) Odds i
Odds Ratio ( i
interactions)
pppp XBXBBXBXBXBB eeee ...event)(noProb
(event)Prob11022110 ...
32
Calculating the Odds RatioVariables in the Equation
.024 .013 3.423 1 .064 1.025 .999 1.051
-.069 .058 1.432 1 .231 .933 .833 1.045
2.045 .807 6.421 1 .011 7.732 1.589 37.615
.761 .771 .976 1 .323 2.141 .473 9.700
1.564 .774 4.084 1 .043 4.778 1.048 21.783
.062 3.460 .000 1 .986 1.064
acid
age
xray
grade
stage
Constant
Step1
a
B S.E. Wald df Sig. Exp(B) Lower Upper
95.0% C.I.for EXP(B)
Variable(s) entered on step 1: acid, age, xray, grade, stage.a.
Bi Odds Ratio 1 Odds
Bi Odds Ratio 1 Odds
Bi 0 Odds Ratio 1 Odds
33
Calculating the Odds RatioVariables in the Equation
.024 .013 3.423 1 .064 1.025 .999 1.051
-.069 .058 1.432 1 .231 .933 .833 1.045
2.045 .807 6.421 1 .011 7.732 1.589 37.615
.761 .771 .976 1 .323 2.141 .473 9.700
1.564 .774 4.084 1 .043 4.778 1.048 21.783
.062 3.460 .000 1 .986 1.064
acid
age
xray
grade
stage
Constant
Step1
a
B S.E. Wald df Sig. Exp(B) Lower Upper
95.0% C.I.for EXP(B)
Variable(s) entered on step 1: acid, age, xray, grade, stage.a.
Odd Ratio grade Odds 2.1 grade 0 1
95% Odds 0.47
9.7
Odds 1
grade Odds
( B 0 Odds
1)34
Calculating the Odds RatioVariables in the Equation
.024 .013 3.423 1 .064 1.025 .999 1.051
-.069 .058 1.432 1 .231 .933 .833 1.045
2.045 .807 6.421 1 .011 7.732 1.589 37.615
.761 .771 .976 1 .323 2.141 .473 9.700
1.564 .774 4.084 1 .043 4.778 1.048 21.783
.062 3.460 .000 1 .986 1.064
acid
age
xray
grade
stage
Constant
Step1
a
B S.E. Wald df Sig. Exp(B) Lower Upper
95.0% C.I.for EXP(B)
Variable(s) entered on step 1: acid, age, xray, grade, stage.a.
Tip: ,
Odds Ratio
Odds Ratio C eCB B
35
Testing Hypotheses about the Model
0 Logistic Regression
Likelihood-ratio Test
(the sum of squared residuals) Multiple Linear Regression
Likelihood-ratio Test
Likelihood
-2 log of the likelihood (-2LL)
likelihood
-2LL 36
Likelihood-Ratio Test Likelihood
(
R2 Multiple Regression) -2LL
Likelihood Ratio Logistic
Regression
Likelihood Ratio = -2LL(reduced model) - (-2LL(full model))
37
Likelihood-Ratio TestLikelihood-ratio statistic
Chi-square df = k (
)
Likelihood ratio
Ha:
Ha: 0)
( full model)
3 age, race sex
( reduced model) age
race sex 0
38
Likelihood-Ratio Test
Warning: Liklihood-ratio Test
reduced model full model reduced model
(nested) full model
39
Are All Coefficients Zero?
0 -2LL full
model reduced model
Intercept
40
Constant-Only Model
Logistic Regression Intercept -2LL 70.25
intercept
Iteration Historya,b,c
70.253 -.491
70.252 -.501
70.252 -.501
Iteration1
2
3
Step0
-2 Loglikelihood Constant
Coefficients
41
Constant-Only Model
Logistic Regression Intercept -2LL 70.25
intercept
Iteration Historya,b,c
70.253 -.491
70.252 -.501
70.252 -.501
Iteration1
2
3
Step0
-2 Loglikelihood Constant
Coefficients
Tip: SPSS -2LL “iteration
history”42
Full Model
-2LL = 48.126
Model Summary
48.126a .341 .465
Step1
-2 Loglikelihood
Cox & SnellR Square
NagelkerkeR Square
Estimation terminated at iteration number 5 becauseparameter estimates changed by less than .001.
a.
43
Model Change
Model Summary
48.126a .341 .465
Step1
-2 Loglikelihood
Cox & SnellR Square
NagelkerkeR Square
Estimation terminated at iteration number 5 becauseparameter estimates changed by less than .001.
a.
Likelihood Ratio
0
-2LL
-2LL 70.252
48.126
Iteration Historya,b,c
70.253 -.491
70.252 -.501
70.252 -.501
Iteration1
2
3
Step0
-2 Loglikelihood Constant
Coefficients
Constant is included in the model.a.
Initial -2 Log Likelihood: 70.252b.
Estimation terminated at iteration number 3 becauseparameter estimates changed by less than .001.
c.
44
Model Change
22.126 column Chi-square df
Chi-square 6 (
) 1 ( 1
Reduced Model)
(p-value < 0.001)
0
Omnibus Tests of Model Coefficients
22.126 5 .000
22.126 5 .000
22.126 5 .000
Step
Block
Model
Step 1Chi-square df Sig.
-2LL
45
Evaluating Sequential Changes
Chi-square Step, Block Model
Omnibus Tests of Model Coefficients
22.126 5 .000
22.126 5 .000
22.126 5 .000
Step
Block
Model
Step 1Chi-square df Sig.
-2LL
46
Evaluating Sequential Changes
Omnibus Tests of Model Coefficients
22.126 5 .000
22.126 5 .000
22.126 5 .000
Step
Block
Model
Step 1Chi-square df Sig.
-2LL
Model chi-square: -2LL
-2LL (
likelihood
) Model Chi-square
0
( F test Multiple Regression)
47
Evaluating Sequential Changes
Omnibus Tests of Model Coefficients
22.126 5 .000
22.126 5 .000
22.126 5 .000
Step
Block
Model
Step 1Chi-square df Sig.
-2LL
Block chi-square: -2LL
reduced model
dialog box Block
blocks
block Block Chi-square
Model Chi-square block Block
Chi-square 48
Evaluating Sequential Changes
Omnibus Tests of Model Coefficients
22.126 5 .000
22.126 5 .000
22.126 5 .000
Step
Block
Model
Step 1Chi-square df Sig.
-2LL
Step chi-square: -2LL
0 Step Chi-square
F-test Stepwise Multiple
Regression
49
Evaluating Sequential Changes
Omnibus Tests of Model Coefficients
22.126 5 .000
22.126 5 .000
22.126 5 .000
Step
Block
Model
Step 1Chi-square df Sig.
-2LL
2
5
Model Chi-square Block Chi-square Step Chi-square
forward backward
Chi-square
50
Summary Measures
R2 Multiple Regression
R2
R2
Logistic Regression
R2 Multiple Regression Model
2
Logistic Regression Cox and Snell R2
Nagelkerke R2
51
Summary Measures
Cox and Snell R2 Nagelkerke R2
R2 Multiple Regression Logistic
Regression (Least
Square Method)
Model Summary
48.126a .341 .465
Step1
-2 Loglikelihood
Cox & SnellR Square
NagelkerkeR Square
Estimation terminated at iteration number 5 becauseparameter estimates changed by less than .001.
a.
-2LL 48.126
52
Cox and Snell R2
= Likelihood
= likelihood
N =
1
Cox and Snell R2
R2 = 1 - N
BL
L/2
)(
)0(
)0(L
)(BL
53
Nagelkerke R2
R2 = MAXR
R2
2
NMAX LR /22 )]0([1
Nagelkerke (1991) Cox and Snell R2
1 Standardize
Cox and Snell R2
Nagelkerke R2 47%
Logistic Regression
Model Summary
48.126a .341 .465
Step1
-2 Loglikelihood
Cox & SnellR Square
NagelkerkeR Square
Estimation terminated at iteration number 5 becauseparameter estimates changed by less than .001.
a.
54
Summary Measures
Tip: Logistic
Linear Regression Logistic Regression
R2 Logistic Model
55
Tests for a Coefficient
Multiple Linear Regression
Regression 0
R2
Logistic Regression Model
SPSS 3 Wald Statistic, the
Likelihood-ratio Test, Score Statistic
56
Wald Statistic 0
Wald Statistic
Chi-square
Wald = 2
..ES
B
B =
S.E. =
df = 1 Wald Statistics
(category data) df 1 Wald
Statistics df 1
57
Wald Statistic
age -0.0693
0.058 ( S.E. 15-2)
Wald Statistics 1.43 (-0.0693/0.058)2
Wald Statistic Chi-
square 15-2 Sig
xray stage 0
0.05
Variables in the Equation
.024 .013 3.423 1 .064 1.025 .999 1.051
-.069 .058 1.432 1 .231 .933 .833 1.045
2.045 .807 6.421 1 .011 7.732 1.589 37.615
.761 .771 .976 1 .323 2.141 .473 9.700
1.564 .774 4.084 1 .043 4.778 1.048 21.783
.062 3.460 .000 1 .986 1.064
acid
age
xray
grade
stage
Constant
Step1
a
B S.E. Wald df Sig. Exp(B) Lower Upper
95.0% C.I.for EXP(B)
Variable(s) entered on step 1: acid, age, xray, grade, stage.a.
58
Wald Statistic
Warning:
Wald Statistic
(absolute)
Wald Statistic
0
Wald Statistic
log-likelihood (Hauck and Donner 1997)
59
Likelihood-Ratio Test
Omnibus Tests of Model Coefficients
4.432 1 .035
4.432 1 .035
22.126 5 .000
Step
Block
Model
Step 1Chi-square df Sig.
0
the likelihood-ratio test
Full Model
Reduced Model Full Model
Chi-square likelihood-ratio test
stage 0
Chi-square 4.43 4.048
Wald StatisticVariables in the Equation
.024 .013 3.423 1 .064 1.025 .999 1.051
-.069 .058 1.432 1 .231 .933 .833 1.045
2.045 .807 6.421 1 .011 7.732 1.589 37.615
.761 .771 .976 1 .323 2.141 .473 9.700
1.564 .774 4.084 1 .043 4.778 1.048 21.783
.062 3.460 .000 1 .986 1.064
acid
age
xray
grade
stage
Constant
Step1
a
B S.E. Wald df Sig. Exp(B) Lower Upper
95.0% C.I.for EXP(B)
Variable(s) entered on step 1: acid, age, xray, grade, stage.a.
60
Likelihood-Ratio Test
Step Block
stage 0 Model Chi-square
stage
0
Sig (p-value) 0.5
Omnibus Tests of Model Coefficients
4.432 1 .035
4.432 1 .035
22.126 5 .000
Step
Block
Model
Step 1Chi-square df Sig.
Variables in the Equation
.024 .013 3.423 1 .064 1.025 .999 1.051
-.069 .058 1.432 1 .231 .933 .833 1.045
2.045 .807 6.421 1 .011 7.732 1.589 37.615
.761 .771 .976 1 .323 2.141 .473 9.700
1.564 .774 4.084 1 .043 4.778 1.048 21.783
.062 3.460 .000 1 .986 1.064
acid
age
xray
grade
stage
Constant
Step1
a
B S.E. Wald df Sig. Exp(B) Lower Upper
95.0% C.I.for EXP(B)
Variable(s) entered on step 1: acid, age, xray, grade, stage.a.
61
Score Statistic
Rao’s efficient score statistic
0
Maximum Likelihood
SPSS
Variables not in the Equation
3.117 1 .077
1.094 1 .296
11.283 1 .001
4.075 1 .044
7.438 1 .006
19.451 5 .002
acid
age
xray
grade
stage
Variables
Overall Statistics
Step0
Score df Sig.
62
Score Statistic
Score Statistic
0
Output Logistic Regression
Overall Statistics log-likelihood
statistic Likelihood-ratio Statistic, Wald Statistic Rao’s Efficient Score
statistic
(Rao, 1973)
Variables not in the Equation
3.117 1 .077
1.094 1 .296
11.283 1 .001
4.075 1 .044
7.438 1 .006
19.451 5 .002
acid
age
xray
grade
stage
Variables
Overall Statistics
Step0
Score df Sig.
Overview of Diagnostic TestingOverview of Diagnostic TestingAnd PrecisionAnd Precision
OutlineOutline
1) Diagnostic Testing
- General Concepts, Sensitivity, Specificity- Receiver Operating Characteristic (ROC) Curves- Area Under (ROC) Curve - AUC- Positive / Negative Predictive Value- Positive / Negative Likelihood Ratio- Implementation
2) Precision / Sample Size Considerations2) Precision / Sample Size Considerations
- Independent Samples Comparing (+) Proportions- Paired Samples
Diagnostic Testing:Diagnostic Testing:
General Concepts, Sensitivity, SpecificityGeneral Concepts, Sensitivity, Specificity
Underlying Goal of Diagnostic TestingUnderlying Goal of Diagnostic Testing
To accurately identify patients:
- disease vs. not-diseased (healthy)disease vs. not diseased (healthy)- case vs. control- normal vs. abnormal- binary comparisons
Through some diagnostic test:
blood s gar- blood sugar- Biomarkers- CT- MRI
Diagnostic Test Results Diagnostic Test Results Often on a Continuous ScaleOften on a Continuous Scale
Pts withPts withPts without thePts without the Pts with Pts with disease (D)disease (D)
Pts without the Pts without the disease (H)disease (H)
Diagnostic Test Result
Consider a CutConsider a Cut--point point
Call these patients “negative” (-) Call these patients “positive” (+)
Diagnostic Test Result
without the disease (H)with the disease (D)
with disease
no disease
total
CutCut--point Gives Rise to a point Gives Rise to a 2 2 x x 2 2 tabletable
Consider 1000 patients: 500 with a disease, 500 without
test + 400
50 450
test - 100 450 550
t t l 500 500 1000total 500 500 1000
with disease
no disease
total
CutCut--point Gives Rise to a point Gives Rise to a 2 2 x x 2 2 tabletable
Consider 1000 patients: 500 with a disease, 500 without
test + True Positive
False Positive
450
test - False Negative
True Negative
550
t t l 500 500 1000total 500 500 1000
4 4 possible types of diagnostic testing classifications: possible types of diagnostic testing classifications: 2 2 correct, correct, 2 2 incorrectincorrect
Call these patients “negative” (-) Call these patients “positive” (+)
Correct Classification: True Positive (TP) Correct Classification: True Positive (TP)
True Positives
without the disease (H)with the disease (D)
Diagnostic Test Result
with no
Definition: Sensitivity = TP / (TP + FN)= probability of a “+” test among diseased
Correct Classification: True Positive (TP) Correct Classification: True Positive (TP)
with disease
no disease total
test + 400 50 450
test - 100 450 550
total 500 500 1000
Example: Sensitivity = 400 / 500 = 80%
Correct Classification: True Negative (TN) Correct Classification: True Negative (TN)
Call these patients “negative” (-) Call these patients “positive” (+)
True negatives
Diagnostic Test Result
without the disease (H)with the disease (D)
with no
Definition: Specificity = TN / (FP + TN)= probability of a “-” test among healthy
Correct Classification: True Negative (TN) Correct Classification: True Negative (TN)
with disease
no disease total
test + 400 50 450
test - 100 450 550
total 500 500 1000
Example: Specificity = 450 / 500 = 90%
Incorrect Classification: False Positive Incorrect Classification: False Positive
Call these patients “negative” (-) Call these patients “positive” (+)
False Positives
Diagnostic Test Result
without the disease (H)with the disease (D)
with no
Definition: “False Positive Fraction” = FP / (FP + TN)= probability of “+” test among healthy
Incorrect Classification: False Positive Incorrect Classification: False Positive
with disease
no disease total
test + 400 50 450
test - 100 450 550
total 500 500 1000
Example: FPF = 50 / 500 = 10%
Incorrect Classification: False Negative Incorrect Classification: False Negative
Call these patients “negative” (-) Call these patients “positive” (+)
False negatives
Diagnostic Test Result
without the disease (H)with the disease (D)
with no
Definition: “False Negative Fraction” = FN / (FN + TP)= probability of “-” test among diseased
Incorrect Classification: False Negative Incorrect Classification: False Negative
with disease
no disease total
test + 400 50 450
test - 100 450 550
total 500 500 1000
Example: FNF = 100 / 500 = 20%
We Choose the Cut-point
“ “ -- ““ “ “ ++ ““
Diagnostic Test Result
without the disease (H)with the disease (D)
Deciding on a cut-point is generally a balance of:
(2) Minimizing the probability of incorrect classifications
(1) Maximizing the probability of correct classifications
i.e. sensitivity and specificity
(2) Minimizing the probability of incorrect classifications
i.e. false positive fraction and false negative fraction
(3) Whether a high sensitivity is most important:
i.e. screening for AIDS virus at a blood blank
(4) Or whether a high specificity is most important:
i.e. testing for “presence of death” prior to autopsy
Diagnostic Testing:Diagnostic Testing:Diagnostic Testing:Diagnostic Testing:
Receiver Operating Characteristic Receiver Operating Characteristic (ROC) Curve(ROC) Curve
100%
ROC Curve: Sensitivity against ROC Curve: Sensitivity against 1 1 -- SpecificitySpecificity
Consider a range of cut-points:
Sen
sitiv
ity
of cut points:
0%
1 - Specificity0% 100%
Consider a range of cut-points:
100%
ROC Curve: Sensitivity against ROC Curve: Sensitivity against 1 1 -- SpecificitySpecificity
of cut points:
Sen
sitiv
ity
0% 100%
0%
1 - Specificity
Best Test: Worst test:
100% 100%
Se
nsi
tivity
0%
1 - Specificity0%
100%
Se
nsi
tivity
0%
1 - Specificity0%
100%
Complete overlap of test results between diseased and non-diseased patients
No overlap of test results between diseased and non-diseased patients
100%100%
A good test: A less good test:
Se
nsi
tivity
0%
1 - Specificity0%
100%
Se
nsi
tivity
0%
1 - Specificity0%
100%
A Measure of Overall Usefulness of a Test:A Measure of Overall Usefulness of a Test:
AUC = Area Under (ROC) CurveAUC = Area Under (ROC) Curve
vity
100%
ity
100%
AUC of Four ROC Curves
100%
Se
nsi
tiv
0%
1 - Specificity0%
100%
Se
nsi
tiv
0%
1 - Specificity0%
100%
100%100%
50%
Se
nsi
tivity
0%
1 - Specificity0%
100%
Se
nsi
tivity
0%
1 - Specificity0%
100%
90%65%
AUC: Interpretation
Randomly select a diseased patient and get a score of Y.
Now, randomly select a healthy patient and get a score of X.
then,
AUC = Probability that Y is bigger than X (assume larger test values associated with disease)
Rough AUC Guidelines: Swets, J.A. (1988)
0.50 - 0.60 - Not So Good Science, 1285 - 1993
0.60 - 0.75 - Fair0.75 - 0.90 - Good0.90 - 0.97 - Very Good0.97 - 1.00 - Excellent
Diagnostic Testing:Diagnostic Testing:
Why Do We Care About Sensitivity?Why Do We Care About Sensitivity?Why Do We Care About Specificity?Why Do We Care About Specificity?
Ultimately, we are truly interested knowingUltimately, we are truly interested knowingwhether or not a positive test => disease or whether or not a positive test => disease or whether or not a positive test => disease, or whether or not a positive test => disease, or whether or not a negative test => healthy. whether or not a negative test => healthy.
Positive Predictive ValuePositive Predictive Value
Positive Predictive Value (PPV) Positive Predictive Value (PPV)
(sensitivity) x (prevalence)(sensitivity) x (prevalence)(se s t ty) (p e a e ce)(se s t ty) (p e a e ce)= = ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
(sensitivity)x(prevalence) + ((sensitivity)x(prevalence) + (11--specificity)x(specificity)x(1 1 -- prevalence)prevalence)
= probability of disease given a positive test= probability of disease given a positive test
Prevalence = Prevalence = prevalence of disease among population prevalence of disease among population who will be who will be testedtested..
=>=> Depending on how/where the test is used,Depending on how/where the test is used,prevalence can change the PPV of your test!prevalence can change the PPV of your test!
PPV vs. PrevalencePPV vs. Prevalence
PPV vs. PrevalencePPV vs. Prevalence
PPV vs. PrevalencePPV vs. Prevalence
Negative Predictive ValueNegative Predictive Value
Negative Predictive Value (NVP) Negative Predictive Value (NVP)
(specificity) x ((specificity) x (1 1 -- prevalence)prevalence)= = ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
((11--sensitivity)x(prevalence) + (specificity)x(sensitivity)x(prevalence) + (specificity)x(1 1 -- prevalence)prevalence)
= probability of not having disease given a negative test = probability of not having disease given a negative test
Diagnostic Testing:Diagnostic Testing:Diagnostic Testing:Diagnostic Testing:
Positive / Negative Likelihood RatiosPositive / Negative Likelihood Ratios
Positive Likelihood RatioPositive Likelihood Ratio
Positive Likelihood Ratio:
sensitivityLR+ = ------------------------
In our example:
0.8= ------------ = 8.0
Indicates:- How much odds of disease is increased if test is positive- A ratio of something that is desirable (true positives)
divided by something undesirable (false positives)
G l G id li
1 - specificity 1 - 0.9
General Guidelines:1 => Test is Useless1 - 2 => Rarely important change in pre- to post test odds2 - 5 => Small Change5 - 10 => Moderate Change>10 => Large Change
Negative Likelihood RatioNegative Likelihood Ratio
Negative Likelihood Ratio:
1 - sensitivityLR- = ------------------------
In our example:
1 - 0.8= ------------ = 0.22
Indicates:- How much odds of disease is decreased if test is negative- A ratio of something that is undesirable (false negatives)
divided by something desirable (true negatives)
G l G id li
specificity 0.9
General Guidelines:0.5 - 1 => Rarely important change in
pre- to post test odds0.2 - 0.5 => Small Change0.1 - 0.2 => Moderate Change0 - 0.1 => Large change in pre- to post test odds
Diagnostic Testing:Diagnostic Testing:
ImplementationImplementation
ImplementationImplementation
An excellent web tool is available from Dept. of RadiologyJohns Hopkins U.
www jrocfit orgwww.jrocfit.org
Citation:Eng, J. (n.d.). ROC analysis: web-based calculator for ROC curves. Retrieved 10/2006, from http://www.jrocfit.org.
Other packages include:
Stata - ROC curves, AUC, testsExcel - ROC curves with manual data manipulationSPSS - ROC curves, AUC, tests
Implementation Implementation -- A Quick ExampleA Quick Example
Question of Interest:
Does the length of an abnormal T2 signal (median nerveDoes the length of an abnormal T2 signal (median nerve measured via MRI) predict who will benefit from surgery among patients with carpal tunnel syndrome.
Data:
65 patients ho “benefit” from s rger65 patients who “benefit” from surgery76 patients who did not “benefit” from surgery
141 MRI measures of median nerve:T2 abnormal signal length (mm.)
Implementation Implementation -- A Quick ExampleA Quick Example
Implementation Implementation -- A Quick ExampleA Quick Example
Data entered into www.jrocfit.org:
Benefit Signal Length…0 27.70 10.00 22.70 14.41 17 81 17.81 46.41 7.01 20.4…
Implementation Implementation -- A Quick ExampleA Quick Example
Output from www.jrocfit.org:
Measure of Overall “Diagnosticity”:Measure of Overall Diagnosticity :AUC = 0.71 (95% CI: 0.63 - 0.79)
Precision:Precision:
Sample Size ConsiderationsSample Size Considerations
Comparing Proportions with Positive Test:Comparing Proportions with Positive Test:Two Independent SamplesTwo Independent Samples
Pretend we’d like to compare a new imaging technologyto the gold standard.to the gold standard.
- In other patients, the gold standard was shown to positivelyidentify abnormalities 80% (P1) of the time.
- You believe that the new imaging technology will do betterthan 80% -- possibly positively identifying 85%, 90%, or even 95% of abnormalities (P2).
- How many patients do you need to show that your newimaging technology is better? (P2 - P1 > 0)
Comparing Proportions with Positive Test:Comparing Proportions with Positive Test:Two Independent SamplesTwo Independent Samples
Comparing Proportions with Positive Test:Comparing Proportions with Positive Test:Paired SamplesPaired Samples
NOW pretend we’d like to compare this new imaging technology to the gold standard, BUT on the same patients.
- Again, the gold standard was shown to positivelyidentify abnormalities 80% (P1) of the time.
- Assume that the new imaging technology positive identifies 90% of abnormalities (P2).
NOW how many patients do you need to show that the- NOW, how many patients do you need to show that the new imaging technology is better? (P2 - P1 > 0)
Comparing Proportions with Positive Test:Comparing Proportions with Positive Test:Paired SamplesPaired Samples