multi-step polynomial regression method to model and forecast malaria incidence
TRANSCRIPT
-
8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence
1/26
predicting Malaria incidencepredicting Malaria incidence
in Chennaiin Chennai
By
Chandrajit Chatterjee
M Sc (I), Statistics
University of Madras
, Chennai
-
8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence
2/26
Introduction to malariaIntroduction to malaria
Malaria is a communicative disease, caused
through parasitic infection (mostly by the
parasite Plasmodiumfalciparum).
Causes 300-500 million cases of infection
around the world and kills 1.5 to 2.7 million
people each year.
Problems of drug resistance of the parasite and
no single, universally accepted control measure
around the world has aggravated the global
situation of malaria incidence.
The premier cause of concern is the high
incidence rates of disease in children below 5
-
8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence
3/26
Four species of pathogens are found to be
causative of the major part of malaria around
namely Plasmodium falciparum, P. vivax, P.
malariae and P.ovale.
In India and other warmer parts of the world
the falciparumspecies is far more predominant
than its counterparts (death is caused due to
cerebral malaria and renal failure).
The parasite is transmitted from person to
person by mosquitoes of genusAnopheles.
Out of the 422 species of existingAnopheles
species worldwide only 40 are important from
-
8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence
4/26
MALARIAMALARIA IN INDIAIN INDIA
Malaria is prevalent in all parts of the country except
in areas 5000 feet above sea-level.
In India, Malaria has been on a constant high from
1993 except the period between 1995 and 1999 (Fig. 2)
Fig 1: THE STATE OF MALARIA IN INDIA FROM 1961 TO
2006
-
8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence
5/26
A complete orientation towards eradication needs
understanding the causative factors, their degree of influence
and disease transmission dynamics over a horizon-that is
where the need of a model arises.
New age modeling probably began with Harvey in 16th century
when he used quantitative reasoning in proving circular motion
of blood.
There are 2 broad methods in modeling one using abstract
differential equations (mathematical modeling) and the other
using data on previous incidences and related causes (databased or statistical modeling)
Our aim:
We aim at establishing a simple method that bypass the
Malaria controlneed of a modelMalaria controlneed of a model
-
8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence
6/26
Malaria incidence data from the Corporation of Chennai.
(I) The first data set (large-over 30 time points) consists of the
monthly slide positivity rate of malaria (all types) in Chennai over 37
months from Jan 2002 to Jan 2005.
(II) The second data set (small-less than 30 time points) consists of
deaths due toplasmodium vivaxdistributed over the 10 zones of
Chennai city for 12 months of the year 2006 and the population ofthese months for all zones.
Climatic data was had for the same time points from the following
websites:
www.waterportal-india.orgwww.wunderground.com
www.imd.ernet.in
www.worldweather.com
The Data:The Data:
-
8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence
7/26
Contd.
The population for Chennai city for the period between 2002
Jan and 2005 Jan was obtained from census data from 1901 to
2001 from the website www.gisd.tn.nic.in/census-paper1,census-paper2.
A third order polynomial was fitted to this data to impute
population at required time points as in fig 2.
Fig 2: IMPUTATION OF POPULATION(Dashed-observed; Straight line-predicted).
-
8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence
8/26
The method in briefThe method in brief
Our methodology consists of the following steps:
Selection of variables
Identification of relationship between the modelvariables
Initial identification of the model
Model refinementPrediction and forecasting with the help of the modelequation
Testing the correctness of prediction with standardmethods
Tools of analysis: Microsoft excel,2000 & 2007 SPSS for windows 11.0 MATLAB 7.5.0.
-
8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence
9/26
Variable selection and identification ofVariable selection and identification of
relationshiprelationship
The procedure begins with variable selection:
The first order models are considered first i.e. simple
linear regression between dependant variable and the
factors individually, purely based on the coefficient of
determination* we select the first variable.
In the next step we consider regression model of order
2 with the first variable already in model and based on
partial t-statistic**, we choose the second variable and go
on increasing the order with this procedure unless all
variables are exhausted or no other variable qualify the
criterion, whichever earlier.
* Coefficient of determinationa measure of goodness ofregression fit denoted as R-square** Partial t-statistic is a t test to determine the influence of a
particular variable in a multi variable model.
-
8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence
10/26
Discussions - Analysis I-Slide Positivity Rates:Discussions - Analysis I-Slide Positivity Rates:
After variable selection we have:
Rainfall
Maximum temperature
Minimum humidity
Population.
The optimum relations are then determined between
the dependant and the selected independent variables
from study of the following scatters.
(In the method trial and error in determining the
functional forms of individual relationships is the way we
chose).
The methodology of model refinement to attain the final
model was run then.
-
8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence
11/26
FIG 3:SCATTER BETWEEN log(SPR) AND log(Max Temp) FIG 4: SCATTER BETWEEN log(SPR) ANDlog(Rainfall)
A 4th order polynomial explains the relationship A 3rd order polynomial explains
18% of variabilitybetween the variables above with an R square 49% between the two in
fig3.
FIG 5: SCATTER BETWEEN log(SPR) AND Population FIG 6: SCATTER BETWEEN log(SPR) AND MinTemp
-
8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence
12/26
Model construction and refinementModel construction and refinement
We deviate here from the other similar works done in the
field, in that we consider a non-linear model as it is but
keeping methodology simple and not having to pre-define
the functional form of the model as it is determined in the
process.
In our multi step regression procedure in the modelrefinement we have considered each functional form of an
independent variable as a separate independent variable.
when the R2 is not a plausible option for comparison of
predictions made we have taken the adjusted R2
* as thecriterion of comparison.
We then construct and refine the model based on principle
ofprogressive improvement of residual sum of squares in
stepwise induction of variables ,as depicted in the nextflowchart .
* Adjusted R2 is a refinement over R2 in that it uses the unbiasedestimators for the sum of squares used in R2 calculated as 1-[(1-
R2)*(n-1)/(n-p-1)] where p is the number of independent variablesin model.
-
8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence
13/26
NO
Is R2
increasing?
Model with the initialvariable (say Xi),with
highest R-square
Induce Xi2
YES
AcceptXi
2
NORemoveXi
2
Induce all higher orders ofXi (one by one)
as defined by its relation with dependant
variable, dependant on R2
Induce Xj
Is R2
increasing?
YES
AcceptXj
Remove
Xj
Induce all higher orders ofXj(one by one) as before,
dependant on R2
Are all
variablesexhausted?
Inducea newvariabl
e anditsfunction-forms
STOP inducing
YES
NO
-
8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence
14/26
In this particular analysis the final model after refinement is:
NOTE: One has to get back SPR values by exponential calculation.
Results:
The model gave an R2 of 44.16% and an adjusted R2 of 42.51%
which says that the model will not vary much in its degree of
prediction when applied to a new data set of similar kind.
ANOVA indicates an F value of 2.720 (F(sig) =0.024) which ishighly significant at the 5% significance level indicating a good
explanative nature of the regression.
95% confidence intervals over all 36 time points contain thepredicted responses indicating significantly correct responseprediction (process slide 17).
The forecasted value of Jan 05 also falls in the 95% C.I. ofresponse indicating a fairly good degree of forecast by the model
log(SPR) = -26.492+5.343E-21*(POPULATION)3
+22.008*log(MAX TEMP)
-1.652*[ log(MAX TEMP)]4
+1.097E-02*(MIN TEMP)
-5.24E-07*[MIN TEMP]3
-5.03E-02* log(RAINFALL)
+ 9.128E-02*[ log(RAINFALL)]2
-2.45E-02*[ log(RAINFALL)]3
.
-
8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence
15/26
FIG 7: PLOT OF FIT OF PREDICTED VALUES OF MODEL AGAINST OBSERVEDVALUES
Legends:
Dashed series with circular markers-predicted values
Straight line with square markers-observed values
Green circle-forecasted value
Light blue square-observed
The error bars are constructed according to 95% confidence intervals of
-
8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence
16/26
Testing the model predictions:Let a general regression model be :
Yi=+jXij*j+i; i=1(1)m, j=1(1)n
n=number of independent variables
m=number of data points
S=iri2where ri are the residuals
Then by Gauss Markov GLM
est=(XTX)-1X
TY and
2est=S/(m-n)
Given independent variable values Xdj for a given time point d,
Then writing Z=(1, xd1, , xdn)
100(1- ) % C.I. for a predicted response for level is
[ZT t/2;(m-n)* est* sqrt[(1+ZT(XTX)-1Z)]
Where X =((Xij))j=1(1)n ; i=1(1)m-the matrix of independent variables
-
8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence
17/26
Analysis II-Total P.V. deaths in Chennai cityAnalysis II-Total P.V. deaths in Chennai city
We had data on 10 months of total deaths due toPlasmodium Vivax, scaled with total population (henceour dependant variable) in Chennai and we wanted to see
the credibility of forecasts of the model for the next 2time points.
We demonstrate here that our method can work equallywell for smaller data sets for which normality may not bereadily assumed (as in the earlier case of the data set
with 36 time points, which had good chances of tendingtowards normality).
In this analysis with the same process, variable
selection yielded:
Minimum Temperature
Maximum Temperature
Maximum Humidity
The initial relationships were studied through scattersand then the model refined b the multi ste rocedure
-
8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence
18/26
Fig 8: A 6th order polynomial explains 85.8% between Fig 9: A 4th order polynomial explains30.6%
P.V. Deaths (scaled) and Min Temp. between scaled P.V. deaths and MaxTemp
Fig 10: A 3rd order polynomial tells 61.8% of the relationship
-
8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence
19/26
The final model equation was:
NOTE: After the predictions for scaled p.v. deaths are obtained from
the model one needs to multiply the values by population total toobtain total p.v. deaths.
Results:
The model gave a coefficient of determination of 82.565% and anadjusted R2 of 60.77% which as before implies the goodness of model in
application to a new data set.
ANOVA indicates an F value of 3.78 of regression against a critical
value of 0.11 at 5% level thereby signaling a good regressive nature of
the model.
95% C.I. over the 10 time points all include the predicted P.V. deaths
(except 1 value) and the observed values as well lie in the C.I. thereby
validating the confidence intervals at level 0.05. The 95% C.I. contains both the forecasted values of November and
(Scaled)PV deaths=2.309E-04+8.313E-06* (min temp)
+7.375E-13*(min temp)6
-8.72E-07*(max hum)
-5.81E-11*(max hum)3
-2.98E-10*(max temp)4
-
8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence
20/26
FIG 11:PLOT OF FIT OF PREDICTED VALUES OF MODEL AGAINST OBSERVED
VALUES.
Legends:Dashed series with circular markers-predicted valuesStraight line with square markers-observed valuesGreen circle-forecasted valueBlue square-observed values in forecasting horizonThe error bars are constructed according to 95% confidence intervals of
predicted values.
-
8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence
21/26
A special study of individual zonesA special study of individual zones
In order to analyze the zones separately we needed to construct
the models for each zone separately. However that is cumbersome
and may be redundant.
We hence did a Factor Analysis with the ten zones of data of
deaths (scaled by population) due to P.V. and the climatic factors
together, with the extraction method as Principal Component
Analysis (the data set was normalized before the analysis to reduce
the large analytical data set of intercorrelated variables into a
smaller and interpretable set of factors).
Subsequently we did a hierarchial Cluster Analysis to reconfirm
results from FA.
The clustering of the variables obtained from here, were used to
frame models with our methodology and the predictions were
tested through the same method as before to yield fits as shown
later.
The Dendrogram in cluster analysis looks as below from which two
-
8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence
22/26
Where
HT - highest temperature,
TL - lowest temperature,
LH - lowest humidity,
HH -highest humidity,
TR- total rainfall,
Z1-Z10 - scaled values of P.V. deaths over the 10 zones of Chennai city.
-
8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence
23/26
CLUSTER 1-TOT DEATHS OF ZONES 2,3,5,6,7,8-FITTED
series with circular markers-predictedt line with square markers-observedars constructed on the basis of 95% C.I. of predicted values
-
8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence
24/26
CLUSTER 2-DEATHS OF ZONE 1-FITTEDCLUSTER 2-DEATHS OF ZONE 1-FITTED
Dashed series with circular markers-predicted
Straight line with square markers-observed
Error bars constructed on the basis of 95% C.I.of predicted values
CONCLUSIONCONCLUSION ::
-
8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence
25/26
CONCLUSIONCONCLUSION ::
The method yielded good fits to data sets, both large and
small with equally apt predictions and for two different
formats of incidence of the disease-one with SPR values and
the other with deaths.
The response predictions of our model for majority of
cases, lie in the 95% C.I. of the response predictions,
keeping in mind, the tests that we have applied for
prediction response has its basis in the most general format
of linear models-the Gauss Markov models.
The method may well be applicable to still larger data sets
with several other variables as the socio-economic factors,
geographical influences, etc. and one may follow the method
to establish a regression model for future predictions of
malaria incidence as we have also shown-using methods like
FA and cluster analysis and then adopting our method yields
predictions of equal caliber as before.
-
8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence
26/26