mathematical models of climate evolution in dobrudja

ORIGINAL PAPER

Mathematical models of climate evolution in Dobrudja

Alina Bărbulescu & Elena Băutu

Received: 28 May 2009 /Accepted: 31 May 2009 /Published online: 12 July 2009# Springer-Verlag 2009

Abstract The understanding of processes that occur inclimate change evolution and their spatial and temporalvariations are of major importance in environmentalsciences. Modeling these processes is the first step in theprediction of weather change. In this context, this paperpresents the results of statistical investigations of monthlyand annual meteorological data collected between 1961 and2007 in Dobrudja (a region situated in the South–East ofRomania between the Black Sea and the lower DanubeRiver) and the models obtained using time series analysisand gene expression programming. Using two fundamen-tally different approaches, we provide a comprehensiveanalysis of temperature variability in Dobrudja, which maybe significant in understanding the processes that governclimate changes in the region.

1 Introduction

Recent studies concerning the temperature variationsindicate an increase of 0.3°C to 0.7°C of surface airtemperature since 1865 (Jones et al. 1986, 1999; Jones1988). The climate modeling results showed an increase inannual temperature in Europe of 0.1°C to 0.4°C per decadeover the twenty-first century based on a range of scenariosand models in which the biggest warming rate will be

registered in the Southern and Northeast Europe (IPCC2008). One of most affected Romanian areas is Dobrudjawhere an increase of 4°C to 4.5°C is expected until the endof the century (Maftei and Bărbulescu 2008). In thiscontext, the determination of a scenario for the temperatureevolution is of major importance.

The objective of the current study is to determine modelsfor the temperature variation. The complexity of modelingmeteorological time series derives from the diversity ofphenomena that affects the climate in general; therefore, theproblem has been of substantial interest in the literature(Aksoy et al. 2008; Charles et al. 2004; Bărbulescu andBăutu 2009).

Time series modeling methods belong to two broadclasses: classical statistical methods and modern heuristicmethods. Autoregressive methods or exponential smooth-ing are classical approaches (De Gooijera and Hyndman2006), while most modern methods rely on neural networksor evolutionary computation (Wagner et al. 2007). Classicalapproaches to the time series modeling problem aregenerally based on the idea of a deterministic world (DeGooijera and Hyndman 2006). However, the meteorologi-cal time series exhibit highly nonlinear dynamic behavior;therefore, they cannot be described using deterministicapproaches (Aksoy et al. 2008; Charles et al. 2004). Sinceclassical methods usually fail to capture this behavior, wework with a combination of statistical tests and an artificialintelligence modeling method.

In this study, two distinct ways are considered for thispurpose: the classical Box–Jenkins methodology and amodern heuristic approach, gene expression programming(GEP). We use statistical tests and procedures to detectchange points in the time series, and subsequently, wemodel the identified subseries by means of classical ARMAmodels and by means of models evolved by GEP.

A. Bărbulescu (*) : E. BăutuFaculty of Mathematics and Informatics,Ovidius University of Constanta,Constanta, Romaniae-mail: [email protected]: http://alina.ilinc.ro/ID-262/index.html

E. Băutue-mail: [email protected]: http://csam.univ-ovidius.ro/~ebautu

Theor Appl Climatol (2010) 100:29–44DOI 10.1007/s00704-009-0160-7

The paper is structured as follows: in the first section,some considerations regarding the studied time series areprovided. We proceed with a presentation of the method-ologies used to analyze the series. We briefly present themathematical background surrounding the ARMA models.We continue with the basic ideas of evolutionary techniqueused, GEP. Section 4 contains the results of statisticalanalysis and the different models obtained. The finalsection concludes the paper with a discussion of our resultsand possible directions of future research.

2 Input data

Two types of series are studied:

& The mean monthly temperatures collected at Sulina andTulcea meteorological stations (situated in the DanubeDelta) in the period January 1961–July 2007; they willbe denoted, respectively, by Sul_M and Tul_M.

& The mean annual temperatures at seven meteorologicalstations (Sulina, Tulcea, Jurilovca, Constanta, Mangalia,Corugea, and Medgidia) between 1961 and 2005(Maftei and Bărbulescu 2008). The series will bedenoted, respectively, by Sul_A, Tul_A, Jur_A, Con_A,Man_A, Cor_A, and Med_A.

The stations locations are presented in Fig. 1 and detailson them are given in Maftei and Bărbulescu (2008).

3 Methodology

3.1 Mathematical background and statistical tests

We remind some basic notions concerning time series,which will be used in this article (Brockwell and Davies2002; Gourieroux and Monfort 1990; Jones et al. 1999).

A time series model for the observed data (xt) is aspecification of the joint distributions of a sequence ofrandom variables (xt) of which (xt) is postulated to be arealization.

Let (xt) be a time series with the expectance E(Xt)<∞.The covariance function of (xt) is defined by:

gX r; sð Þ ¼ Cov Xr;Xsð Þ ¼ E Xr � E Xrð Þð Þ Xs � E Xsð Þð Þ½ �

for all integers r and s.The autocovariance function of (Xt) at lag h (h 2 N�) is

gX ðhÞ ¼ Cov Xtþh; Xtð Þ and the autocorrelation function of(Xt) at lag h is rX ðhÞ ¼ gX ðhÞ

gX ð0Þ.If x1 ; . . . ; xn are observations of a time series, the

empiric autocorrelation function, ACF, is:

r̂ðhÞ ¼Pn� hj j

t¼1xt � xð Þ xtþ hj j � x

� �

Pnt¼1

xt � xð Þ2

Fig. 1 Meteorological stations

30 A. Bărbulescu, E. Băutu

and the partial autocorrelation function is:

tðhÞ ¼Cov X t � X*

t ; Xt�h � X*t�h

� �

s2 Xt � X*t

� � ; h 2 N*

where X*t X*

t�h

� �is the affine regression of X t Xt�hð Þ with

respect to X t�1 ; . . . ; Xt�hþ1.The empiric partial autocorrelation function (partial

ACF) is defined by analogy for the observationsx1 ; . . . ; xn of the time series.

The random variablesX and Y with EðX Þ<1; EðY Þ<1are called uncorrelated if their covariance is 0. A sequence(Xt) of uncorrelated random variables, each with the mean of0 and the variance σ2, is called a white noise.

Let us consider the operators defined by:

B Xtð Þ ¼ Xt�1;

ΦðBÞ ¼ 1� 81B� . . .� 8pBp;8p 6¼ 0;

ΘðBÞ ¼ 1� q1B� . . .� qqBq; qq 6¼ 0;

Δd Xtð Þ ¼ 1� Bð ÞdXt:

The process (Xt) is said to be an ARIMA(p, d, q) processif ΦðBÞΔdXt ¼ ΘðBÞ"t where the absolute values of theroots of Φ and Θ are greater than 1 and (εt) is a white noise.An ARIMA(p, d, q) process is called an AR(p) process if d=0=q. An ARIMA(p, d, q) process is called a MA(q) processif d=0=p. The ARIMA(p, d, q) process is called an ARMA(p, q) process if d=0.

In order to determine the process type, the form of ACFand partial ACF charts of the process may be analyzed. Forexample, the ACF of an AR(p) process is an exponentialdecreasing or a damped sine wave oscillation. The partialACF of an AR(p) process is vanishing for all h>p.

Research shows that hydrometeorological series havecommon characteristics as nonstationarity, long-range de-pendence, data perturbations, the absence of normality,which make difficult the building of mathematical models.

In order to determine the model type, the first step is thedata analysis. To perform a part of them, the ETREFprogram was elaborated (Maftei et al. 2007). Two moduleswere added: one, which determines the long-range depen-dence of the time series, and another, which computes theBox dimension of a phenomenon occurrence (Bărbulescuand Ciobanu 2007).

The meteorological time series under study are influ-enced by many factors in the environment; therefore, it ishighly likely that the processes that govern their behaviorare not described by only one mathematical equation. Theusual approach when considering dynamic processes is totry to identify the points when major changes occur in theunderlying model of the time series. The points where theprobability law of a time series changes are also calledstructural breaks (for short, breaks) or change points. Theproblem is well known and is usually referred to as thechange point problem. In this study, we use statistical teststo determine the existence of a break in a time series or inorder to test some hypothesis on it. Most break tests allow

Fig. 2 GP individual that encodes the expression sin xt�3ð Þxt�2

þ xt�1þ2ð Þ�5

Fig. 3 GEP chromosome

Mathematical models of climate evolution in Dobrudja 31

the detection of a change in the mean of a time series, whileothers detect changes in the distribution function.

The methods that will be used to detect a break are:Pettitt test (Pettitt 1979), the test “U” Buishard (Buishard1984), Lee and Heghinian test (Lee and Heghinian 1977),Hubert’s segmentation procedure (Hubert and Carbonnel1993), and change point analysis (Taylor 2000).

The following procedures and tests will also be used:

& the empirical autocorrelation function and the Durbin–Watson test—to determine the autocorrelation existencein the data series (Seskin 2007);

& the Kolmogorov–Smirnov and Jarque–Bera tests or Q–Q plot—for normality (Seskin 2007);

& the Bartlett test—for homoscedasticity (Baltagi 2008).

3.2 Gene expression programming

GEP was introduced as a method of automatic program-ming in the seminal paper (Ferreira 2001). Evolutionarycomputation techniques proved their use in many real-lifeoptimization problems (Michalewicz et al. 2006). GEP is animportant member of the family of evolutionary computa-tion techniques, alongside genetic algorithms, evolutionarystrategies, genetic programming (GP), and evolutionaryprogramming (Baeck et al. 1997).

Evolutionary computation relies on Darwin’s explana-tion for the diversity of species and individuals. Evolution

and the natural selection principle have been the startingpoints of this branch of artificial intelligence and have provenextremely useful in problems where classical methods failed.The key point is the representation of the solutions of aproblem as individuals that can evolve over generations andimprove by interacting with the other individuals and throughthe application of genetic operations.

The basic features of an evolutionary algorithm are: thecreation of a population of candidate solutions of theproblem at hand (individuals), the existence of some criteriato evaluate the quality (fitness) of the candidate solutions,and some mechanisms to vary the individuals over time. Thevariation mechanisms are inspired by genetics and the mostused ones are mutation and crossover. The population ofpotential solutions evolves over time. The fitter an individualis, the greater its chances are to pass his features on to thenext generations. The features’ transmission takes place if:either the individual survives unchanged or with slightmodifications, or by means of its offspring. In genetics,slight modifications, that appear in the genetic code of anindividual are called mutations. The operation by whichoffspring inherit features from their parents is known ascrossover.

Genetic algorithms (GAs) are inspired by nature whereindividuals compete to survive. They adapt to their environ-ment during their lifetime, so that the best-adapted individualshave the greatest chances to survive until reaching reproduc-tive age, and reproduce, transmitting their features to future

-7

0

7

14

21

28

1 71 141 211 281 351 421 491 561

time (month)

tem

pera

ture

(0 C

)

Sulina

Fig. 4 Sulina—monthly series(Sul_M)

-9

-5

0

5

9

14

18

23

27

1 71 141 211 281 351 421 491 561

time (month)

tem

pera

ture

(0 C)

Tulcea

Fig. 5 Tulcea—monthly series(Tul_M)


generations. GAs work with a population of individuals, eachindividual is a candidate solution to the problem.

The classical GA uses a string of bits to encode asolution. The bit string represents the genotype of actualsolution that it encodes, which represents the phenotype.The optimization problem is the environment to which theindividuals adapt and the fitness of each individual is ameasure of how close it is to the real solution. Candidatesolutions evolve over a number of generations and change,improving over time, by means of genetic operators.

The classical genetic operators are mutation and cross-over. Classic GA mutation works by flipping a bit in anindividual, and crossover works by constructing twooffspring from two selected individuals by swapping somesegments of genetic code between them. The genetic

operators work on the genotype, but their applicationaffects, most importantly, the phenotype of the individual.In majority of GA applications, there exists a cleardelimitation between the genotype and the phenotype ofindividuals (Baeck et al. 1997).

GP is a prolongation of GAs where individuals arecomputer programs or complex compositions of mathemat-ical functions (Koza 1992). It rose from the desire to createalgorithms that would automatically create computer pro-grams that solve problems. As opposed to GAs, which aremainly used to optimize functions, GP is oriented towardsidentifying the model that produced a given set of data. GPindividuals encode computer programs or mathematicalfunctions expressed as complex compositions of functionsand variables or constants. This kind of abstractions

Fig. 6 ACF of Sul_M

Fig. 7 Partial ACF of Sul_M


(computer programs, functions) can be encoded as differenttypes of data structures. The classic GP approach usesindividuals expressed as trees that encode the parse trees ofthe expressions that represent the candidate solutions of theproblem. No constraint is imposed to GP with respect to theshape or the size of the individuals. It may use any number ofsymbols, in any shape, as long as it is a valid mathematicalexpression. Evidently, the physical limitations of computersnowadays imply that, when actually run, the algorithm mustuse some sort of control over the size of the solutions. Anexample of a GP individual is presented in Fig. 2.

The paradigm is the same as in GA, yet the implicationsderived by the change of representation are crucial. GPgenetic operators are applied on parse trees of mathematicalexpressions (as in Fig. 2); therefore, there exist specificoperators defined. Mutation works by replacing a symbolwith another symbol of the same arity: a variable or aconstant may be replaced with either a variable or aconstant, while a function may be replaced with a functionof the same arity. Crossover operators swap subtreesbetween GP individuals.

In GP, there is no delimitation between the phenotypeand the genotype; therefore, the genetic operators actdirectly on the phenotype. This is contrary to what happens

in nature where the phenotype of an organism is the set ofall its features and the genotype is the genetic code of thatindividual. The fact that there is no separation between thegenotype and the phenotype limits the search power of thegenetic operators (Ferreira 2006).

We employ a GEP algorithm in this paper to discover themodel that best fits a given time series. GEP is a variant ofGP in the sense that the individuals in GEP use a mappingfunction between the encoded complex mathematicalexpressions the likes of Fig. 2. The GEP codification isinspired from the GAs.

3.2.1 Representation

GEP individuals are strings of symbols of fixed size. Bysymbols, we understand mathematical functions (e.g.,arithmetic operators like +, −, *, /, trigonometric functions,exponential, logarithmic, etc.), constants, or variables. Theset of symbols at the algorithm’s disposal is a parameter ofthe algorithm. Nonetheless, GEP individuals encode non-linear expressions; in our case, compositions of mathemat-ical functions with functions and variables. A GEP generepresents a mathematical expression encoded as a linearstring of symbols. In GEP, individuals are composed of one

Fig. 8 Histogram of residual inthe model for Sul_A

Fig. 9 Sul_M: the forecast oftemperatures for the last120 months


or more genes of equal length. The number of genes isconstant throughout the population over all generations andis given as a parameter of the algorithm as is the gene size.When decoding a GEP chromosome, the expressionsencoded by the genes are linked by means of a linkingfunction. The most used linking function is addition, butother functions may also be used.

Every gene encodes a mathematical represented ex-pressed as an expression tree. Ferreira proposed a specialsyntax for GEP genes that ensures the validity of thedecodification process (Ferreira 2001). A GEP gene hastwo parts, named “head” and “tail.” The tail is constrainedto contain only constants or variables, whereas the headmay contain any symbol. Also, if we denote the head’s sizeby h and the tail’s size by t, the relation t=h(n−1)+1 musthold, where n represents the maximum arity of thefunctional symbols used by algorithm. This rule is aguarantee that each GEP gene decodes into a correctexpression tree, i.e., a correct mathematical function. EachGEP gene contains an active code—the portion of symbolsthat actually participates in the decoding process andsymbols that has no effect in the decodification process.

For the time series modeling problem, a solution is amodel that best fits the time series. The problem to be

solved concerns the identification of the right type of model(e.g., the form of the model) and its coefficients. Anotherimportant aspect is to decide how many previous datapoints are used by the model, i.e., the “window size.” Onemust also decide how the past data used by the model issampled from the original time series.

In this study, we denote the window size by w and wesample the past data at a sampling lag k=1. For example, ifwe consider the window size of 3 and the lag size of 1, themodel will predict the value at a moment t using theprevious three values in the time series, xt−1,xt−2,xt−3. As aresult, any function of three parameters is a possiblecandidate solution to our problem. The candidate functionsmay be complex compositions of common functions,numerical constants, and variables.

Figure 3 presents a possible GEP individual for the timeseries modeling problem. The head size is three, themaximum number of parameters is two (since onlyarithmetic operators are used), and therefore, the tail sizeis four. As it can be seen, the first gene encodes theexpression xt−3*xt−2. The tail that starts with the symbol 2has no active code. The second gene encodes theexpression xt−1*2+5. The last two symbols in the tail areinactive (they do not appear in the decoded expression).

Fig. 10 Tul_M: the forecast oftemperatures for the last120 months

Fig. 11 GEP model for Sul_M


The decodification process of a gene into an expressiontree makes use of the arity of each symbol (variables andnumerical constants are considered 0-arity). To obtain theexpression, we associate each functional symbol with theparameters necessary for its evaluation. In the case ofthe first gene from Fig. 3, the first symbol (*) requires twoparameters (the second and the third symbols, the variablesxt−3 and xt−2). The rest of the symbols are not usable, sincethe decoding process ends when all functional symbolshave received parameters so that they decode to validexpressions. For the second gene, the first symbol (+)requires two parameters (* and 5). The second symbol (+)requires two parameters (xt−1 and 2). Therefore, the genedecodes to the following expression: xt−1*2+5.

GEP individuals combine the conceptual simplicity ofthe GA linear encoding with the complexity of theexpressions encoded by GP individuals. The codificationdraws a clear separation between the genotype and thephenotype, allowing the algorithm to perform a thoroughsearch in the candidate solution space (Ferreira 2006).

3.2.2 Genetic operators

The GEP algorithm imitates the evolutionary process innature and uses it in order to obtain knowledge expressed asformulas from data. The GEP population consists inindividuals that are mathematical expressions of functions.

The population evolves over a number of generations (setas a parameter of the algorithm). On each generation, aseries of genetic operators act upon the populationintroducing diversity in the population or enhancing thesearch process.

GEP has various operators (Ferreira 2006), most of theminspired by the ones used by GAs. The standard GEP unaryoperators are mutation, IS transposition, RIS transposition,and gene transposition.

The mutation operator can change any symbol in thechromosome, while respecting the head/tail rule: in the tail,only variables and constants are allowed. For example,given the gene below (with the head size h=3), let usassume that the second symbol is selected to be mutated.Given that it is placed in the head of the gene, it may bereplaced by any other symbol, so let us assume that it getsreplaced with the functional symbol sin. So, the gene:

�xt�3 xt�2 2 xt�13 2

that encodes the expression xt−3*xt−2 becomes:

� sin xt�22 xt�1 3 2

and the encoded expression becomes sin(2)*xt−2. The effectof the operator on the phenotype of the individual isobvious.

8.2

9.2

10.2

11.2

12.2

13.2

1 5 9 13 17 21 25 29 33 37 41

time (year)

tem

pera

ture

(0 C)

Medgidia Corugea Jurilovca Constanta Mangalia

Fig. 14 Medgidia (Med_A), Corugea (Cor_A), Jurilovca (Jur_A),Constanta (Con_A), and Mangalia (Man_A) annual series

9.6

10.4

11.2

12.0

12.8

1 5 9 13 25 29 33 37 41 45time (year)

tem

pera

ture

(0 C

)

Tulcea Sulina

17 21

Fig. 13 Sulina (Sul_A) and Tulcea (Tul_A) annual series

Fig. 12 GEP model for Tul_M


The transposition operators copy a sequence of symbolscalled transposon (or insertion sequence) and insert it intosome location. The transposon gets duplicated in this processinside that individual. The GEP transposition operators differamong themselves in the way they select the insertion pointand the insertion sequence. The IS transposition operatorselects a random sequence from the chromosome and insertsa copy of it in a random position in the head of a gene (exceptthe first position) without altering the tail or the other genes.For example, in the individual represented in Fig. 3, if thetransposition operator selects the sequence *5 xt−1 andselects as insertion point the second position of the firstgene, then the first gene becomes:

**5 xt�1 xt�1 3 2:

The expression encoded by this gene is x2t�1*5.The RIS transposition operator is similar to IS transposi-

tion, except that it is forced to select the transposon to startwith a functional symbol. The gene transposition operatoruses an entire gene as a transposon, moving it to thebeginning of the chromosome (i.e., the transposon is deletedat the place of origin). This means that the respective genedoes not get duplicated. The linking operation we use isaddition and it is commutative; therefore, the application ofthis operator on a GEP chromosome has no effect on theexpression encoded by it. In our experiments, we do not usethis operator.

The standard GEP binary operators are: one-pointcrossover, two-point crossover, and gene crossover. Inone-point crossover, one random location is chosen ascutting point. The genetic material downstream of it isexchanged between the two parent chromosomes. Forexample, let us assume that the parents are the followingtwo individuals:

� xt�3 xt�2 2 xt�1 3 2þ * 5 xt�1 2 xt�2 4;

þ xt�1 2 5 xt�2 3 xt�3 sin xt�3 þ 2 xt 3 4:

They contain two genes, the head size is three, the firstgene starts with *, and the second gene starts with + for thefirst one, respectively, the first gene starts with + and thesecond starts with sin for the second one. If the cuttingpoint is selected after the fourth symbol in the chromosome,then the offspring of these two parents are obtained byjoining the first part of the first parent (the underlined part)with the second part of the second parent (the underlinedone), and the second part of the first parent with the firstpart of the second parent:

� xt�3 xt�2 2 xt�2 3 xt�3 sin xt�3 þ 2 xt 3 4;

þ xt�1 2 5 xt�1 3 2 þ * 5 xt�1 2 xt�2 4:

Fig. 15 ACF of Sul_A

Fig. 16 ACF of Tul_A


In two-point crossover, two random locations are chosenas crossover points and the genetic material between themis exchanged between the two parent chromosomes. In genecrossover, entire genes are exchanged between two parentchromosomes. The exchanged genes are randomly chosenand occupy the same position in the parent chromosomes.

3.2.3 Selection

At each generation, individuals are evaluated with respectto a performance measure. The performance of an individ-ual is assessed based on the error of the expression encodedby the individual versus the data set, and it is assigned afitness value as a result.

Consider the realization of a time series of volume n: x1,x2, …, xn.

We compute the fitness of an individual as the meansquared error (MSE):

MSE indð Þ ¼ 1

n

Xnt¼1

xt �x̂t� �

where x̂t represents the value predicted by the functionencoded by the individual ind for the moment t of the timeseries. The smaller the fitness value, the better theindividual is.

The selection operator simulates the process of naturalselection. It uses the fitness of individuals to assign higherchances for surviving into the next generation to theindividuals which are better adapted to their environment—therefore, have better fitness values, thus encoding bettersolutions for the problem at hand.

The selection method used in our experiments is roulettewheel selection combined with elitist survival of the bestindividual from one generation to the next. This way, everyindividual’s chances to survive are directly proportional toits fitness, and the best individual in a generation isguaranteed to have at least one copy in the next generation.

The algorithm goes on until a maximum number ofgenerations is reached. The solution indicated by the algorithmis the individual that has the smallest MSE encounteredthroughout all generations. Alternative termination criteria areavailable, as reported in the literature (Ferreira 2006).

Summarizing, the basic stages of the GEP algorithmused are:

1. Creation of a random initial population of individualsthat encode candidate solutions.

2. Evaluation of the population (compute the MSE) andassign fitness values to individuals.

3. Application of the genetic operators.4. Application of the selection operator.5. If the termination criterion is not met, then go back to

step 2.

9

10

11

12

13

14

1 4 7 10 13 16 19 22 25 28 31 34

time(year)

tem

pera

ture

(0 C

)

observed calculated

37

Fig. 17 Sul_A (1961–1997) model

Station Test

Buishard Pettitt Lee and Heghinian Hubert

95% 90% 95% 90% Year

Tulcea No No No No 1988 1997 1997

Sulina No No Yes No 1997 1997 1997

Jurilovca Yes No Yes Yes – 1998 1998

Constanta No No Yes No 1988 1997 1997

Mangalia No No Yes Yes – 1997 1997

Corugea No No Yes No 1997 1997 1997

Medgidia No No Yes Yes – 1997 1997

Table 1 Results of break tests

Table 2 Kolmogorov–Smirnov test for residuals in MA (3) model forSul_A (1961–1997)

Resid_1991–1997

N 37

Normal parameters (a, b) Mean −0.000888SD 0.5935501

Kolmogorov–Smirnov Z 0.569

Asymp. Sig. (two-tailed) 0.902


4 Results

4.1 Models for monthly data

4.1.1 ARIMA models

In this section, we shall present the results obtained for theseries Sul_M and Tul_M. Their charts are represented inFigs. 4 and 5.

The tests’ results were: the series are not normallydistributed, are correlated, homoscedastic, and no breakpoint stood out.

& In order to determine a model for Sul_M, ACF (Fig. 6)and partial ACF (Fig. 7) of this series were analyzed.

Analyzing Figs. 4 and 6, small damping of ACF valuesand the presence of a seasonal component with period 12 inthe data series are revealed. So, we consider the seriesobtained applying a seasonal difference of 12 orders and themean subtraction. For it, the following model was built:

Xt ¼ 0:3626� Xt�1 þ "t � 1:005� "t�1

where (εt) is the residual with the variance of 2.17.In Fig. 8, the residual chart is represented. The values of

Box–Ljung and MacLeod–Li statistics and the correspondingp values (0.382 and 0.245, respectively), lead us to acceptthe hypothesis that the residual is a Gaussian white noise.The values of Jarque–Bera statistics confirm these assertions.

Using this model, we forecast the temperatures of thelast 120 months using the previous 450 data. The result wasgood, as it can be seen in Fig. 9.

& The same procedure was followed to detect a goodmodel for Tul_M. After considering a 12-order differ-ence and the mean subtraction, an ARMA(1, 12) modelwas determined using Akaike selection criterion. It hasthe equation:

Xt ¼ 0:2643� Xt�1 þ "t � 0:9998� "t�1

where (εt) is a white noise with the variance of 2.878.

The forecast realized taking into account the previousmodel appears to be close to the registered values (Fig. 10).

4.1.2 GEP models

We performed experiments using the gep package devel-oped for the evolutionary computation software ECJ.1 In allexperiments, we performed 50 independent runs for eachsetup. The number of genes in a chromosome was set tofive in all GEP runs. Each chromosome consisted of fivegenes, linked in the final model by the addition operator.The head size of a gene was set to five symbols; therefore,the tail size was set to six symbols. In each run, thepopulation consisted of 200 individuals and the algorithmevolved the population over 500 generations. The operatorrates used the default values defined in ECJ, which are setas recommended in the literature (Ferreira 2006).

The function set used by the algorithm consisted of{+, −, *, /, sin} where division is implemented as a Kozastyle protected operator (Koza 1992).

Finding the optimum window size is an optimizationproblem by itself and there exists no precise algorithm tocompute it. Since this is not the main purpose of our article,we do not employ a special algorithm to decide on aspecific window size. Instead, we take on a brute-forceapproach: we perform experiments for all window sizes inthe interval w 2 1; 6. Throughout experiments, we considerthe lag k=1.

The nature of the search process employed by GEP allowsit to automatically identify the variables that are most useful toestimate future values among the past n input variables. Forexample, the window size w=5 and the lag k=1 mean thatthe model may use the most recent five past values xt−1,xt−2,xt−3,xt−4,xt−5. It is possible that the function identified byGEP as the model that best fits the data does not use some ofthese variables.

The models obtained are compared with respect to theMSE values obtained on the original data set. In the results,we report the best model (i.e., the model with the smallestMSE value) found in all 50 runs for each window sizeconsidered.

For the monthly data, inspired by the best autoregressivemodels obtained, we also performed experiments for thewindow size of 12. As expected, the models obtained weremore accurate than those obtained for all other windowsizes for both Sulina and Tulcea stations. The corre-sponding charts are presented in Figs. 11 and 12.

1 ECJ is an open-source evolutionary computation research systemdeveloped in Java at George Mason University’s EvolutionaryComputation Laboratory and available at http://cs.gmu.edu/˜eclab/projects/ecj/.

10.7

11.2

11.7

12.2

12.7

13.2

1 2 3 4 5 6 7 8 9 10

time (year)

tem

pera

ture

(0 C)

after_98 shift_fit AR1

Fig. 18 Tul_A (1998–2007) model


cs.gmu.edu/~eclab/projects/ecj/

cs.gmu.edu/~eclab/projects/ecj/

In both cases, the residual was Gaussian and homosce-dastic. For Sul_A, the residual was correlated with thevariance of 3.18, but for Tul_M, it is was independent withthe variance of 5.37.

4.2 Models for annual data

4.2.1 ARIMA models

In this section, we present the results obtained for theseries of average annual temperatures at Sulina and Tulceastations, making comparisons with those obtained for theother five stations: Jurilovca—in the Danube Delta,Constanta and Mangalia—on the Black Sea coast,Medgidia and Corugea—in the center of Dobrudja. Sincethe analysis of mean temperature variation was done indetail in Maftei and Bărbulescu (2008), we restrictourselves to present the chart of temperature evolution(Figs. 13 and 14).

It can be seen that there are some series with the sameevolution pattern, so we expect to obtain models of thesame type. After the application of Kolmogorov–Smirnovtest, the normality hypothesis was accepted for all theannual series. The study of the ACF of the series leads us to

accept the hypothesis that the data are not correlated. InFigs. 15 and 16, we present, for example, the charts of ACFof Cor_A and Tul_A.

The results of break tests are listed in Table 1. “Yes”means that the null hypothesis (there is no break in thetime series) is accepted, and “no” means that it isrejected, in which case, the break moment is alsomentioned.

Since the tests’ results are contradictory, but the majorityof them lead us to reject the null hypothesis, the modelshave been determined for the entire series and for thesubseries before and after the break moment.

The best models obtained using the Box–Jenkinsmethods were:

1. Sul_A:(a) For Sul_A (1961–2007), after the mean subtraction: a

Gaussian white noise with the variance of 0.488.(b) For Sul_A (1961–1997), after the mean subtraction:

Xt ¼ "t � 0:3783� "t�3

where (εt) is a Gaussian white noise with the variance of0.348 (Fig. 17).

Table 3 Best models for Jur_A, Con_A, and Man_A

Series Transformation Period

1965–2005 Before break After break

Jur_A Mean subtraction Gaussian white noise;variance=0.444

Gaussian white noise;variance=0.393


Con_A Mean subtraction Gaussian white noise;variance=0.491



Man_A Mean subtraction Gaussian white noise;variance=0.464



-2.1

-1.4

-0.7

0

0.7

1.4

2.1

1 5 9 13 17 21 25 29 33 37

time (year)

tem

pera

ture

(0 C

)

Series_Dif1 MA1_Model

Fig. 19 MA(1) model forCor_A dif 1 (1965–2005)


In Table 2, the results of the Kolmogorov–Smirnov teston the residual of the previous MA(3) model are presented.

(c) For Sul_A (1998–2007):

Xt ¼ 0:9977� Xt�1 þ "t

where (εt) is a Gaussian white noise with the residualvariance of 1.094.

2. Tul_A:(a) For Tul_A (1961–2007), after the subtraction of the

mean: a Gaussian white noise with the variance of 0.516.(b) For Tul_A (1961–1997), after the subtraction of the

mean: a Gaussian white noise with the variance of 0.422.(c) For Tul_A (1998–2007):

Xt ¼ 0:9944� Xt�1 þ "t

where (εt) is a Gaussian white noise (Fig. 18).For the other series located in the Danube Delta and on

the Black Sea coast, after some transformation, see Table 3.

3. Cor_A:(a) For Cor_A (1965–2005), after considering a first-order

difference and the mean subtraction:

Xt ¼ "t � 0:8976� "t�1

where (εt) is a white Gaussian noise with the variance of0.534 (Fig. 19).

(b) For Cor_A (1965–1997), after considering a first-orderdifference and the mean subtraction:

Xt ¼ "t � 0:9902� "t�1

where (εt) is a Gaussian white noise with the variance of0.464 (Fig. 20).

(c) For Cor_A (1998–2005), after the mean subtraction: aGaussian white noise with the variance of 0.2073.

4. Med_A:(a) For Med_A (1965–2005):

Yt � 0:833 Yt�1 ¼ 9:7725þ "t; t ¼ 2; 41

where (εt) is the residual, which is independent, normallydistributed.

(b) For Med_A (1965–1997):

1� 0:997Bð Þ 1� Bð ÞXt ¼ "t þ 0:909"t�1 ; t ¼ 2; 33

where (εt) is a Gaussian white noise.

Fig. 21 GEP_model for Sul_A (1961–2007)

-2.2

-1.65

-1.1

-0.55

0

0.55

1.1

1.65

2.2

1 5 9 13 17 21 25 29

time (year)

tem

pera

ture

(0 C)

Cor_A dif 1 MA(1) for Cor_A dif1

Fig. 20 MA(1) model for Cor_A dif 1 (1965–1997)




4.2.2 GEP models

In what follows, the charts presented in Figs. 21, 22, and 23(corresponding to the Sul_A series) and Figs. 24, 25, and 26(corresponding to the Tul_A series) provide a goodvisualization towards how well the GEP models fit theoriginal time series. Excepting the GEP model obtained forSul_A (1961–1997) for which the residual is dependent, theresiduals are Gaussian, independent, and identically distrib-uted. Their variances are, respectively, 0.307, 0.244, and0.01037 for Sul_A and 0.405, 0.323, and 0.00381 for Tul_A.So, the values are smaller than the corresponding ones in theARIMA models. The models presented were best among allexperiments performed for a given time series (an experi-ment consists 50 independent GEP runs for each windowsize).

The solutions obtained by GEP are extremely complex.As reported in the literature, they are useful to fit the timeseries, but usually shed little light on the nature of therelationship between the variables involved. For example,the model function derived for the series Tul_A (1961–2007) has the expression given below:

1

49sin x2t�4

� � � xt�4 þ sin xt�2 þ xt�3ð Þ � sin xt�3 � xt�2ð Þþ

þ sin 10þ xt�2

10

� �þ sin sin sin sin sin xt�4ð Þð Þð Þð Þð Þ þ xt�3:

The MSEs of the GEP solutions presented in the paperare reported in Table 4. For statistical consistency reasons,we also report the mean MSE and the standard deviation(SD) of the MSE over the 50 different runs of theexperiment that reported the respective solutions.

It is interesting to note that, in every case, GEP proves tobe reliable over the independent runs (the mean is close tothe best value, and the SD is small).

We report the iteration number in which the solution wasencountered. Also, the last column contains the meannumber of iterations needed to obtain the solution overthe 50 independent runs of each experiment.

Fig. 25 GEP_model for Tul_A (1961–1997)



Series w MSE Mean MSE SD of MSE Iteration Mean no. of iterations

Sul_M 12 3.18 5.21 1.79 465 400

Tul_M 12 5.43 6.24 0.24 441 400

Sul_A (1961–2007) 3 0.31 0.42 0.05 454 406

Sul_A (1961–1997) 2 0.24 0.33 0.03 492 393

Sul_A (1998–2007) 2 0.01 0.05 0.02 497 423

Tul_A (1961–2007) 4 0.40 0.49 0.02 484 404

Tul_A (1961–1997) 2 0.32 0.39 0.03 499 398

Tul_A (1998–2007) 2 0.0038 0.053 0.027 490 368

Table 4 Mean square errors ofGEP models over the time series


It is interesting to note that the solution is encountered inthe second half of the experiment in the majority of cases.This is a proof that the search process is improved overgenerations in the evolutionary process, and it reaches agood enough solution before the maximum number ofgenerations is reached. Although it exceeds the purpose ofour study, it would be interesting to experiment withdifferent GEP settings: for example, to find the appropriatenumber of generations necessary for the algorithm to rununtil it reaches convergence. For now, we use GEP settingsas recommended in the literature.

5 Conclusions

This study presents the GEP algorithm as a fair competitorof classical methods for the problem of modeling meteoro-logical time series. For the classical models, the residualswere not always homoscedastic or normal, while the errorsobtained by GEP models satisfied the conditions ofnormality, homoscedasticity, and the absence of correlation.Better results were obtained by GEP on time series ofsmaller size. A straightforward explanation is that data thatconcerns weather is constantly changing characteristics,which coincides with our intuition that there exist points inmeteorological time series when the underlying processchanges.

Our results come to support the idea that combiningstatistical tests for detecting change points with bothheuristic methods, such as GEP, and classical Box–Jenkinsmethods leads to overall better models. The Box–Jenkinsmodels are easier to grasp, while the GEP solutions aremuch more complex. The advantages of using the GEPmetaheuristic are the creativity of the models and, most ofall, the lack of constraints imposed on the solution. Thefunctional form of the models obtained with GEP is verycomplex and hard to understand. Still, for modelingpurposes and for the accuracy of the prediction, they proveto be useful.

The analytical expression for the time series shows thatnonlinear combinations of the original variables achieve agood level of error reduction. Computation times are biggerthan for traditional methods and are not to be neglectedwhen performing time series modeling with GEP, especiallyfor larger data sets.

The evolutionary approach is an attractive alternative toclassical methods. The algorithm performs a heuristicsearch of the best model composed of a set of functionsand a set of explanatory variables, guided by an error-basedfitness measure. It does not require in advance thespecification of the functional form of the model, this wayenabling the algorithm to discover alternative nonlinearmodels that best fit the data set.

In a further study, we shall focus on developing anadaptive method to set parameters for the GEP algorithm. Itwould be interesting to devise a hybrid method that wouldcombine a heuristic method (e.g., a GA) with statisticalprocedures to identify the change points. Combined withthe constraint-free modeling provided by GEP, the resultsmay be better for long dynamic time series than thoseobtained by classical approaches.

Acknowledgements This paper was supported by grant ID_262 andgrant PNCDI2 NatCOMP 11028/2007.

References

Aksoy H, Gedikli A, Erdem Unal N, Kehagias A (2008) Fastsegmentation algorithms for long hydrometeorological timeseries. Hydrol Process 22(23):4600–4608

Baeck T, Fogel DB, Michalewicz Z (1997) Handbook of evolutionarycomputation. CRC, Boca Raton

Baltagi BH (2008) Econometrics. Springer, BerlinBărbulescu A, Băutu E (2009) ARIMA and GEP models for climate

variation. Int J Math Comput (in press)Bărbulescu A, Ciobanu C (2007) Mathematical characterization of the

signals that determine the erosion by cavitation. InternationalJournal Mathematical Manuscripts 1(1):42–48

Brockwell P, Davies R (2002) Introduction to time series. Springer,New York

Buishard TA (1984) Tests for detecting a shift in the mean ofhydrological time series. J Hydrol 73:51–69

Charles SP, Bates BC, Smith IN, Hughes JP (2004) Statisticaldownscaling of observed and modeled atmospheric fields.Hydrol Process 18(8):1373–1394

De Gooijera JG, Hyndman RJ (2006) Twenty five years of time seriesforecasting. Int J Forecast 22(3):443–473

Ferreira C (2001) Gene expression programming: a new adaptivealgorithm for solving problems. Complex Syst 13(2):87–129

Ferreira C (2006) Gene expression programming: mathematicalmodeling by an artificial intelligence. Springer, Berlin

Gourieroux C, Monfort A (1990) Series temporelles et modelesdynamiques. Economica, Paris

Hubert P, Carbonnel JP (1993) Segmentation des series annuelles dedebits de grands fleuves africains. Bull Liaison Com InterafrÉtud Hydraul 92:3–10

IPCC (2008) Climate Change and Water, IPCC Technical Paper VI,June 2008, Bates, B.C., Z.W. Kundzewicz, S. Wu and J.P.Palutikof, Eds., IPCC Secretariat, Geneva, 210 pp., Available athttp://www.ipcc.ch/ipccreports/technical-papers.htm

Jones PD, Wigley TML, Wright PB (1986) Global temperaturevariations between 1861 and 1984. Nature 322:430–434

Jones PD (1988) Hemispheric surface air temperature variations:recent trends and an update to 1987. J Climate 1:654–660

Jones PD, Horton EB, Folland CK, Hulme M, Parker DE (1999) Theuse of indices to identify changes in climatic extremes. ClimChange 42:131–149

Khronostat 1.1 software. Available at http://www.hydrosciences.orgLee AFS, Heghinian SM (1977) A shift of the mean level in a

sequence of independent normal random variables—a Bayesianapproach. Technometrics 19(4):503–506

Koza JR (1992) Genetic programming: on the programming ofcomputers by means of natural selection. MIT, Cambridge

Maftei C, Gherghina C, Bărbulescu A (2007) A computer program forstatistical analyzes of hydro-meteorological data. InternationalJournal Mathematical Manuscripts 1(1):95–103


http://www.ipcc.ch/ipccreports/technical-papers.htm

http://www.hydrosciences.org

Maftei C, Bărbulescu A (2008) Statistical analysis of the climateevolution in Dobrudja region. Lecture Notes in Engineering andComputer Sciences II:1082–1087

Michalewicz Z, Schmidt M, Michalewicz M, Chiriac C (2006)Adaptive business intelligence. Springer, New York

Pettitt AN (1979) A non-parametric approach to the change-pointproblem. Appl Stat 28(2):126–135

Seskin DJ (2007) Handbook of parametric and nonparametricstatistical procedures. CRC, Boca Raton

Taylor W (2000) Change-Point Analyzer 2.0 shareware program. TaylorEnterprises, Libertyville. Available at http://www.variation.com/cpa

Wagner N, Michalewicz Z, Khouja M, Mcgregor RR (2007) Timeseries forecasting for dynamic environments: the DyFor geneticprogram model. IEEE Trans Evol Comput 11(4):433–452


http://www.variation.com/cpa

mathematical models of climate evolution in dobrudja

Documents