multi-step polynomial regression method to model and forecast malaria incidence

Upload: chandrajit85843

Post on 30-May-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence

    1/26

    predicting Malaria incidencepredicting Malaria incidence

    in Chennaiin Chennai

    By

    Chandrajit Chatterjee

    M Sc (I), Statistics

    University of Madras

    , Chennai

  • 8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence

    2/26

    Introduction to malariaIntroduction to malaria

    Malaria is a communicative disease, caused

    through parasitic infection (mostly by the

    parasite Plasmodiumfalciparum).

    Causes 300-500 million cases of infection

    around the world and kills 1.5 to 2.7 million

    people each year.

    Problems of drug resistance of the parasite and

    no single, universally accepted control measure

    around the world has aggravated the global

    situation of malaria incidence.

    The premier cause of concern is the high

    incidence rates of disease in children below 5

  • 8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence

    3/26

    Four species of pathogens are found to be

    causative of the major part of malaria around

    namely Plasmodium falciparum, P. vivax, P.

    malariae and P.ovale.

    In India and other warmer parts of the world

    the falciparumspecies is far more predominant

    than its counterparts (death is caused due to

    cerebral malaria and renal failure).

    The parasite is transmitted from person to

    person by mosquitoes of genusAnopheles.

    Out of the 422 species of existingAnopheles

    species worldwide only 40 are important from

  • 8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence

    4/26

    MALARIAMALARIA IN INDIAIN INDIA

    Malaria is prevalent in all parts of the country except

    in areas 5000 feet above sea-level.

    In India, Malaria has been on a constant high from

    1993 except the period between 1995 and 1999 (Fig. 2)

    Fig 1: THE STATE OF MALARIA IN INDIA FROM 1961 TO

    2006

  • 8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence

    5/26

    A complete orientation towards eradication needs

    understanding the causative factors, their degree of influence

    and disease transmission dynamics over a horizon-that is

    where the need of a model arises.

    New age modeling probably began with Harvey in 16th century

    when he used quantitative reasoning in proving circular motion

    of blood.

    There are 2 broad methods in modeling one using abstract

    differential equations (mathematical modeling) and the other

    using data on previous incidences and related causes (databased or statistical modeling)

    Our aim:

    We aim at establishing a simple method that bypass the

    Malaria controlneed of a modelMalaria controlneed of a model

  • 8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence

    6/26

    Malaria incidence data from the Corporation of Chennai.

    (I) The first data set (large-over 30 time points) consists of the

    monthly slide positivity rate of malaria (all types) in Chennai over 37

    months from Jan 2002 to Jan 2005.

    (II) The second data set (small-less than 30 time points) consists of

    deaths due toplasmodium vivaxdistributed over the 10 zones of

    Chennai city for 12 months of the year 2006 and the population ofthese months for all zones.

    Climatic data was had for the same time points from the following

    websites:

    www.waterportal-india.orgwww.wunderground.com

    www.imd.ernet.in

    www.worldweather.com

    The Data:The Data:

  • 8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence

    7/26

    Contd.

    The population for Chennai city for the period between 2002

    Jan and 2005 Jan was obtained from census data from 1901 to

    2001 from the website www.gisd.tn.nic.in/census-paper1,census-paper2.

    A third order polynomial was fitted to this data to impute

    population at required time points as in fig 2.

    Fig 2: IMPUTATION OF POPULATION(Dashed-observed; Straight line-predicted).

  • 8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence

    8/26

    The method in briefThe method in brief

    Our methodology consists of the following steps:

    Selection of variables

    Identification of relationship between the modelvariables

    Initial identification of the model

    Model refinementPrediction and forecasting with the help of the modelequation

    Testing the correctness of prediction with standardmethods

    Tools of analysis: Microsoft excel,2000 & 2007 SPSS for windows 11.0 MATLAB 7.5.0.

  • 8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence

    9/26

    Variable selection and identification ofVariable selection and identification of

    relationshiprelationship

    The procedure begins with variable selection:

    The first order models are considered first i.e. simple

    linear regression between dependant variable and the

    factors individually, purely based on the coefficient of

    determination* we select the first variable.

    In the next step we consider regression model of order

    2 with the first variable already in model and based on

    partial t-statistic**, we choose the second variable and go

    on increasing the order with this procedure unless all

    variables are exhausted or no other variable qualify the

    criterion, whichever earlier.

    * Coefficient of determinationa measure of goodness ofregression fit denoted as R-square** Partial t-statistic is a t test to determine the influence of a

    particular variable in a multi variable model.

  • 8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence

    10/26

    Discussions - Analysis I-Slide Positivity Rates:Discussions - Analysis I-Slide Positivity Rates:

    After variable selection we have:

    Rainfall

    Maximum temperature

    Minimum humidity

    Population.

    The optimum relations are then determined between

    the dependant and the selected independent variables

    from study of the following scatters.

    (In the method trial and error in determining the

    functional forms of individual relationships is the way we

    chose).

    The methodology of model refinement to attain the final

    model was run then.

  • 8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence

    11/26

    FIG 3:SCATTER BETWEEN log(SPR) AND log(Max Temp) FIG 4: SCATTER BETWEEN log(SPR) ANDlog(Rainfall)

    A 4th order polynomial explains the relationship A 3rd order polynomial explains

    18% of variabilitybetween the variables above with an R square 49% between the two in

    fig3.

    FIG 5: SCATTER BETWEEN log(SPR) AND Population FIG 6: SCATTER BETWEEN log(SPR) AND MinTemp

  • 8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence

    12/26

    Model construction and refinementModel construction and refinement

    We deviate here from the other similar works done in the

    field, in that we consider a non-linear model as it is but

    keeping methodology simple and not having to pre-define

    the functional form of the model as it is determined in the

    process.

    In our multi step regression procedure in the modelrefinement we have considered each functional form of an

    independent variable as a separate independent variable.

    when the R2 is not a plausible option for comparison of

    predictions made we have taken the adjusted R2

    * as thecriterion of comparison.

    We then construct and refine the model based on principle

    ofprogressive improvement of residual sum of squares in

    stepwise induction of variables ,as depicted in the nextflowchart .

    * Adjusted R2 is a refinement over R2 in that it uses the unbiasedestimators for the sum of squares used in R2 calculated as 1-[(1-

    R2)*(n-1)/(n-p-1)] where p is the number of independent variablesin model.

  • 8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence

    13/26

    NO

    Is R2

    increasing?

    Model with the initialvariable (say Xi),with

    highest R-square

    Induce Xi2

    YES

    AcceptXi

    2

    NORemoveXi

    2

    Induce all higher orders ofXi (one by one)

    as defined by its relation with dependant

    variable, dependant on R2

    Induce Xj

    Is R2

    increasing?

    YES

    AcceptXj

    Remove

    Xj

    Induce all higher orders ofXj(one by one) as before,

    dependant on R2

    Are all

    variablesexhausted?

    Inducea newvariabl

    e anditsfunction-forms

    STOP inducing

    YES

    NO

  • 8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence

    14/26

    In this particular analysis the final model after refinement is:

    NOTE: One has to get back SPR values by exponential calculation.

    Results:

    The model gave an R2 of 44.16% and an adjusted R2 of 42.51%

    which says that the model will not vary much in its degree of

    prediction when applied to a new data set of similar kind.

    ANOVA indicates an F value of 2.720 (F(sig) =0.024) which ishighly significant at the 5% significance level indicating a good

    explanative nature of the regression.

    95% confidence intervals over all 36 time points contain thepredicted responses indicating significantly correct responseprediction (process slide 17).

    The forecasted value of Jan 05 also falls in the 95% C.I. ofresponse indicating a fairly good degree of forecast by the model

    log(SPR) = -26.492+5.343E-21*(POPULATION)3

    +22.008*log(MAX TEMP)

    -1.652*[ log(MAX TEMP)]4

    +1.097E-02*(MIN TEMP)

    -5.24E-07*[MIN TEMP]3

    -5.03E-02* log(RAINFALL)

    + 9.128E-02*[ log(RAINFALL)]2

    -2.45E-02*[ log(RAINFALL)]3

    .

  • 8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence

    15/26

    FIG 7: PLOT OF FIT OF PREDICTED VALUES OF MODEL AGAINST OBSERVEDVALUES

    Legends:

    Dashed series with circular markers-predicted values

    Straight line with square markers-observed values

    Green circle-forecasted value

    Light blue square-observed

    The error bars are constructed according to 95% confidence intervals of

  • 8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence

    16/26

    Testing the model predictions:Let a general regression model be :

    Yi=+jXij*j+i; i=1(1)m, j=1(1)n

    n=number of independent variables

    m=number of data points

    S=iri2where ri are the residuals

    Then by Gauss Markov GLM

    est=(XTX)-1X

    TY and

    2est=S/(m-n)

    Given independent variable values Xdj for a given time point d,

    Then writing Z=(1, xd1, , xdn)

    100(1- ) % C.I. for a predicted response for level is

    [ZT t/2;(m-n)* est* sqrt[(1+ZT(XTX)-1Z)]

    Where X =((Xij))j=1(1)n ; i=1(1)m-the matrix of independent variables

  • 8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence

    17/26

    Analysis II-Total P.V. deaths in Chennai cityAnalysis II-Total P.V. deaths in Chennai city

    We had data on 10 months of total deaths due toPlasmodium Vivax, scaled with total population (henceour dependant variable) in Chennai and we wanted to see

    the credibility of forecasts of the model for the next 2time points.

    We demonstrate here that our method can work equallywell for smaller data sets for which normality may not bereadily assumed (as in the earlier case of the data set

    with 36 time points, which had good chances of tendingtowards normality).

    In this analysis with the same process, variable

    selection yielded:

    Minimum Temperature

    Maximum Temperature

    Maximum Humidity

    The initial relationships were studied through scattersand then the model refined b the multi ste rocedure

  • 8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence

    18/26

    Fig 8: A 6th order polynomial explains 85.8% between Fig 9: A 4th order polynomial explains30.6%

    P.V. Deaths (scaled) and Min Temp. between scaled P.V. deaths and MaxTemp

    Fig 10: A 3rd order polynomial tells 61.8% of the relationship

  • 8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence

    19/26

    The final model equation was:

    NOTE: After the predictions for scaled p.v. deaths are obtained from

    the model one needs to multiply the values by population total toobtain total p.v. deaths.

    Results:

    The model gave a coefficient of determination of 82.565% and anadjusted R2 of 60.77% which as before implies the goodness of model in

    application to a new data set.

    ANOVA indicates an F value of 3.78 of regression against a critical

    value of 0.11 at 5% level thereby signaling a good regressive nature of

    the model.

    95% C.I. over the 10 time points all include the predicted P.V. deaths

    (except 1 value) and the observed values as well lie in the C.I. thereby

    validating the confidence intervals at level 0.05. The 95% C.I. contains both the forecasted values of November and

    (Scaled)PV deaths=2.309E-04+8.313E-06* (min temp)

    +7.375E-13*(min temp)6

    -8.72E-07*(max hum)

    -5.81E-11*(max hum)3

    -2.98E-10*(max temp)4

  • 8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence

    20/26

    FIG 11:PLOT OF FIT OF PREDICTED VALUES OF MODEL AGAINST OBSERVED

    VALUES.

    Legends:Dashed series with circular markers-predicted valuesStraight line with square markers-observed valuesGreen circle-forecasted valueBlue square-observed values in forecasting horizonThe error bars are constructed according to 95% confidence intervals of

    predicted values.

  • 8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence

    21/26

    A special study of individual zonesA special study of individual zones

    In order to analyze the zones separately we needed to construct

    the models for each zone separately. However that is cumbersome

    and may be redundant.

    We hence did a Factor Analysis with the ten zones of data of

    deaths (scaled by population) due to P.V. and the climatic factors

    together, with the extraction method as Principal Component

    Analysis (the data set was normalized before the analysis to reduce

    the large analytical data set of intercorrelated variables into a

    smaller and interpretable set of factors).

    Subsequently we did a hierarchial Cluster Analysis to reconfirm

    results from FA.

    The clustering of the variables obtained from here, were used to

    frame models with our methodology and the predictions were

    tested through the same method as before to yield fits as shown

    later.

    The Dendrogram in cluster analysis looks as below from which two

  • 8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence

    22/26

    Where

    HT - highest temperature,

    TL - lowest temperature,

    LH - lowest humidity,

    HH -highest humidity,

    TR- total rainfall,

    Z1-Z10 - scaled values of P.V. deaths over the 10 zones of Chennai city.

  • 8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence

    23/26

    CLUSTER 1-TOT DEATHS OF ZONES 2,3,5,6,7,8-FITTED

    series with circular markers-predictedt line with square markers-observedars constructed on the basis of 95% C.I. of predicted values

  • 8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence

    24/26

    CLUSTER 2-DEATHS OF ZONE 1-FITTEDCLUSTER 2-DEATHS OF ZONE 1-FITTED

    Dashed series with circular markers-predicted

    Straight line with square markers-observed

    Error bars constructed on the basis of 95% C.I.of predicted values

    CONCLUSIONCONCLUSION ::

  • 8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence

    25/26

    CONCLUSIONCONCLUSION ::

    The method yielded good fits to data sets, both large and

    small with equally apt predictions and for two different

    formats of incidence of the disease-one with SPR values and

    the other with deaths.

    The response predictions of our model for majority of

    cases, lie in the 95% C.I. of the response predictions,

    keeping in mind, the tests that we have applied for

    prediction response has its basis in the most general format

    of linear models-the Gauss Markov models.

    The method may well be applicable to still larger data sets

    with several other variables as the socio-economic factors,

    geographical influences, etc. and one may follow the method

    to establish a regression model for future predictions of

    malaria incidence as we have also shown-using methods like

    FA and cluster analysis and then adopting our method yields

    predictions of equal caliber as before.

  • 8/14/2019 Multi-step polynomial regression method to model and forecast malaria incidence

    26/26