a nonparametric method for ascertaining change points in
TRANSCRIPT
Georgia Southern University
Digital Commons@Georgia Southern
Electronic Theses and Dissertations Graduate Studies, Jack N. Averitt College of
Spring 2010
A Nonparametric Method for Ascertaining Change Points in Regression Regimes Alfreda N. Rogers
Follow this and additional works at: https://digitalcommons.georgiasouthern.edu/etd
Recommended Citation Rogers, Alfreda N., "A Nonparametric Method for Ascertaining Change Points in Regression Regimes" (2010). Electronic Theses and Dissertations. 664. https://digitalcommons.georgiasouthern.edu/etd/664
This thesis (open access) is brought to you for free and open access by the Graduate Studies, Jack N. Averitt College of at Digital Commons@Georgia Southern. It has been accepted for inclusion in Electronic Theses and Dissertations by an authorized administrator of Digital Commons@Georgia Southern. For more information, please contact [email protected].
A NONPARAMETRIC METHOD FOR ASCERTAINING
CHANGE POINTS IN REGRESSION REGIMES
by
ALFREDA N. ROGERS
(Under the direction of Patricia Humphrey)
Abstract
Of interest is the specific model called the joinpoint two regime regression or broken
line model composed of one regression line and a horizontal ray. This is a very restricted
but highly useful subset of the well-researched change point problem. The usual approach
to a more general model was first presented by Quandt (1958) who found the maximum
likelihood estimates of the slope, intercept and joinpoint by assuming that the error terms
are generated under the usual assumptions, that is, from a normal distribution with
constant variance and are uncorrelated. We develop a method that does not rely on this
assumption, demonstrate its use on an example of proximity indexes of whale cow and
calf pairs, and compare the new method to the Quandt estimates in a simulation study
showing this new method performs adequately.
INDEX WORDS: Maximum likelihood, Change point, Moment match
A NONPARAMETRIC METHOD FOR ASCERTAINING
CHANGE POINTS IN REGRESSION REGIMES
by
ALFREDA N. ROGERS
B.S., Armstrong Atlantic State University, 2007
A Thesis Submitted to the Graduate Faculty of Georgia Southern University in Partial
Fulfillment of the Requirements for the Degree
MASTER OF SCIENCE
STATESBORO, GEORGIA
2010
iii
© 2010
Alfreda N. Rogers
All Rights Reserved
iv
A NONPARAMETRIC METHOD FOR ASCERTAINING
CHANGE POINTS IN REGRESSION REGIMES
By
ALFREDA N. ROGERS
Major Professor: Patricia Humphrey
Committee:
Martha Abell
Greg Knofczynski
Electronic Version Approved:
May 2010
v
vi
ACKNOWLEDGEMENTS
What started out as a simple question and blossomed into this thesis, the initial
length and quality was greatly underestimated. Without the help of many people along
the way, this thesis would not have been possible.
First and foremost, I would like to thank my thesis committee, Dr. Martha Abell,
Georgia Southern University; Dr. Patricia Humphrey, Georgia Southern University, and
Dr. Greg Knofczynski, Armstrong Atlantic State University. Special thanks are given to
Dr. Lorrie Hoffman, Armstrong Atlantic State University. Their efforts and guidance
helped me tremendously to endure and complete this thesis. If it were not for their
patience, questioning methods, and willingness to work with my mathematical flaws, I
would not have made it to this end. I would also like to thank Jaree Hudson, Heather
King, and Steve Clark for their previous work using the Hinde methods using actual data
from whales in Sea World. Their finished work helped open the door for the question
this thesis will answer.
vii
TABLE OF CONTENTS
ACKNOWLEDGEMENTS ……………………………………………………...….…....v
LIST OF TABLES ……………………………………………………………................vii
CHAPTER
1 INTRODUCTION……………………………………………………………....1
2 PIECE-WISE REGRESSION ASSUMPTIONS…………………………..........8
3 METHODOLOGY…………………………………………………………….19
4 MAIN RESULTS……………………………………………………………...25
5 CONCLUSION……………………………………………......................…….31
APPENDICES
A SAS code ….…………..………………………………………..………..32
REFERENCES ………………………..………………………………………..……….41
viii
LIST OF TABLES AND GRAPHS
Table 1.1: Values of the Hinde Index for Various Levels of Whale
Activity…………………............................................................................5
Table 2.1: Actual Weekly (t) Whale Hinde Index (y) with Simulated x Values……........9
Table 2.2: Parameter Estimates, Standard Errors, and Calculated Log Likelihood Value
Using Quandt’s Method Under Assumption 1……….....................................10
Figure 2.1: Graph of the Fitted Model Using Quandt’s Method Under
Assumption 1..................................................................................................10
Table 2.3: Parameter Estimates, Standard Errors, and Calculated Log Likelihood Value
Using Quandt’s Method Under Assumption 2.................................................11
Figure 2.2: Graph of the Fitted Model Using Quandt’s Method Under
Assumption 2.................................................................................................12
Figure 2.3: Graph of the Fitted Model Using Quandt’s Method Under
Assumption 3..................................................................................................13
Figure 2.4: Graph of the Log Likelihood Function from Use of Quandt’s Method Under
Assumption 4..................................................................................................14
Figure 2.5: Using Whale Hinde Indices to Show Quandt Iterations at t0=3.....................15
Figure 2.6: Using Whale Hinde Indices to Show Quandt Iterations at t0=4.....................15
Figure 2.7: Using Whale Hinde Indices to Show Quandt Iterations at t0=5.....................16
Table 2.4: Values of the Log Likelihood Function from Use of Quandt’s Method Under
Assumption 4..................................................................................................17
Figure 2.8: Graph of Time versus Log Likelihood Function from Use of Quandt’s
Method Under Assumption 4.........................................................................18
ix
Table 4.1: All Possible Slopes Between All Points of Sea World Data Set.....................25
Table 4.2: SAS Values for Quandt and MMNPR Values.................................................26
Figure 4.1: Graph of Quandt Method versus MMNPR Method.......................................27
Table 4.3: SAS Generated Values for Parameters for Quandt and MMNPR Methods....28
1
CHAPTER 1
INTRODUCTION
Somewhere, in the ocean, a mother whale cow has just given birth to a baby whale
calf. In the initial stages, one can find the cow closely monitoring her calf. She does so
to not only nourish the calf, but also protect it from the many dangers that the calf is too
young and naïve to escape. As time passes and the calf gets older, the mother
understands that the calf needs to learn how to fend for itself. During a time period one
can see that the cow weans the calf slowly, but surely, so it can take its place in the
underwater ecosystem. But when is the best time for the mother to start weaning her
calf? Is there a point in time that one can expect a cow would completely wean her calf?
There is a natural appeal for a model that relies on a regression equation with a
discontinuous derivative function. Thus, the volume of researchers and the totality of
their papers number in the hundreds. An annotated bibliography by Khodadadi and
Asgharian (2008) includes many of the papers pertinent to our research interest. Ciuperca
(2009) has more recent citations in the introduction of his paper. The model we study is a
small but useful subset of the “regime(d)” regression, a phrase first used by Quandt
(1958) in economic literature, or more currently referred to as a broken-line regression
(Gill, 2004). Our interest is in the estimation of all of the straight-line model parameters
when exactly two regimes are assumed. The extensive literature discusses solutions
pertaining to two or more response functions (either linear or not) defined over their
companion independent sets (either univariate or multivariate) with connected or
disconnected transition or change point(s), a term first appearing in the literature in the
late 1960’s, in particular in the 1969 Dagenais paper. More recently for connected
2
transition points the term “joinpoints” is used (Ghosh, Basu and Tiwari, 2009).
Researchers like Koul (2000) study models with either no or few assumptions about error
terms. The error terms of the model can be assumed to be normal or non-normal and
uncorrelated or correlated. A large number of the papers confine themselves to finding
the number of, estimation of, and/or distribution of the change point(s) itself whereas we
want to characterize the estimates of all the parameters of our model, a model having two
straight-line regression equations with exactly one change point.
Many applications (and here we mention several current ones) lend themselves to
analysis under this type of model: quality control (Mahmoud et al., 2006), medical
research (Yu et al., 2007), climatology (DeGaetano, 2006) and many other areas.
Applications account for the largest percentage of the literature. The balance of the
papers propose and/or investigate estimation and testing methods (including maximum
likelihood, semi-parametric, non-parametric, bootstrap, Bayesian, wavelet) and their
goodness (convergence rates, bias, consistency). We present estimation methods we have
derived that are distribution-free and demonstrate our method with an example in the
field of biology on a set of paired mother-calf whale data proximity indices collected and
aggregated weekly. We first consider the parametric procedure described by Quandt
(1958) and updated by him (Quandt, 1960; Quandt, 1972; Goldfeld and Quandt, 1972)
since we found assurances by recently published author Gill (2004) of the stable behavior
of these estimates for the slope, intercept, and change point for our particular model of
interest. The Quandt method has no closed form solutions and is iterative so the process
of finding these statistics is computer intensive but due to improvements in processing
power it is still in use currently (DeGaetano, 2006). Although the literature offers some
3
competing approaches to the iterative Quandt method, we found that no authors consider
a direct moment-matching scheme coupled with slope estimation via a nonparametric
analysis. We develop such an approach and label it MMNPR. Our estimation procedure
computes values quickly and performs in a favorable manner when compared to the
Quandt method.
Through personal communication with his research colleague L. Hoffman, we are
aware that in an extension of his work, Clark (2000) noticed that during the early stages
there was a close proximity between the Killer Whale cow/calf dyads. However, as time
passed and the mother wanted the child to become more independent, there was a shift in
positions between the two. As the cow began to wean its calf, three notions were made:
1) the calf changed how it positioned itself to its mother’s dorsal fin, 2) the mother started
to move clockwise from the calf, and 3) the distance between the cow and calf started to
increase. Numerical and alphabetical representations for each criterion were given to
devise a method of measuring and analyzing the data. In this paper, we concentrate on
the last of these phenomena, the distances. Using the measurements, the Hinde Index
was used and we found linear relationships in order to determine when a cow completely
weans her calf. The time when this phenomenon would occur would be called the change
point.
The Hinde Index (Hinde, 1970) measures the “approaches” that are the narrowing
of the distance between the pair and “leaves” which are a widening between a mother and
child. The Hinde Index definition is as follows:
m m
m i m i
A L
A A L L
4
where = approaches of the mother, = leaves of the mother, = approaches of the
infant, and = leaves of the infant. In the case where the total number of leaves by both
the mother and infant is the same as the number of approaches by both mother and infant,
this equation can be simplified. Therefore, the Hinde Index becomes:
m m m
m i m i m i
m
m i
LA L A
A A L L A A A A
m m
m i
A L
A A
This expression is easily interpreted since it is positive when the mother’s approaches
exceed her leaves and negative when she leaves more than she approaches her calf.
In other cases consider the following. Since Am < Am+ Ai, then 0 1m
m i
A
A A. Using the
same notion, 0 1m
m i
L
L L. Therefore we know that 1 1m m
m i
A L
A A. Using this
index, one may wonder how to tell where the index should be the greatest and least.
Table 1.1 below gives a short synopsis of what the trend should look like.
5
Table 1.1: Values of the Hinde Index for Various Levels of Whale Activity
mA iA
im
m
AA
A mL iL
m
m i
L
L L m m
m i m i
A L
A A L L
High High 5. High High 5. 0
High High 5. High Low 1 5.
High High 5. Low High 0 5.
High High 5. Low Low 5. 0
High Low 1 High High 5. 5.
High Low 1 High Low 1 0
High Low 1 Low High 0 1 High Low 1 Low Low 5. 5.
Low High 0 High High 5. 5.
Low High 0 High Low 1 1 Low High 0 Low High 0 0
Low High 0 Low Low 5. 5.
Low Low 5. High High 5. 0
Low Low 5. High Low 1 5.
Low Low 5. Low High 0 5.
Low Low 5. Low Low 5. 0
Using the information in Table 1.1, the Hinde index is largest, or close to 1, when
the mother approaches the calf more often than it leaves the calf and the calf leaves more
often than it approaches the cow. This also means that although the calf may be trying to
become independent, the mother is still keeping up with her calf. On the other hand, the
index is smallest when the mother leaves more than she approaches her calf and the calf
is approaching the cow more often than it leaves. This lends itself to the notion that the
mother is trying to wean the calf to become independent, though the calf is not ready to
go on its own.
Using this information, whales, as well as various other animals, are tracked for
weeks at a time and a time versus Hinde index plot is formed. Using the plot, one can do
an eyeball estimate of the best fit line or lines for the data. However, this method does
not lend itself to a mathematical estimate of the true t0 (the change point, i.e., when one
6
can expect a whale to have weaned its calf). However, there is another method that
would better suit this notion.
We want to find a t0 for the regime. We need a method that lends itself to finding
true change points. The Quandt (1958) method takes into account that one needs to
separate the regime into two different groups. The groups must each have at least three
observations in order to do a regression analysis. The approach is iterative. Let the
number of observations in the first group correspond to the value of t0. For example, if
the first group has 4 observations, then t0 = 4. Hence, one would find the best fit line for
the first group ( 0tt ) and for the second group (t > t0) up through the total number of
time points T. Using the usual regression estimates for finding standard deviations, one
can find the standard deviation for both groups. The standard deviations are then
substituted into Quandt’s log likelihood function
0 1 0 2( ) log 2 log ( ) log2
TL t T t T t
where T = total number of observations and t0 = the number of data points in the first of
the two groups. The value of t0 yielding the largest value for L(t) would be considered
the true change point. In our case, t0 would represent the true time point where one can
expect the mother to have weaned her calf.
Quandt’s (1958) method is a good start; however, it has its flaws. Quandt
relies on the assumption that the data come from two separate normal distributions and
forms the likelihood function as a product of normal probability density functions. But
deriving a closed form function for all five parameters (slopes, intercepts and the change
point) presents an intractable problem. Instead, he suggests conditioning on each
possible change point and then evaluating L(t). The process of dividing the sample into
7
two groups depending on the size of t0 is time consuming. This method may be easy to
execute when it comes to small sample sizes, but what about larger ones? Quandt’s
method uses the MLE (Maximum Likelihood Estimation) method for estimating a true t0.
Unfortunately, MLE may yield biased estimates when it comes to small samples. It is a
known fact that as the sample size increases, the need to estimate parameters is
minimized and the biasedness of the MLE approaches zero. Since Quandt’s method is
best used for smaller samples, the estimated t0 would not necessarily be an accurate
representation of the true t0. Additionally, the method assumes the data come from an
underlying normal distribution. What other methods can we use that will minimize the
error of estimating the true change point? Can a method be developed that has less
distributional assumptions? Would a new method prove to be better than Quandt’s
method?
8
CHAPTER 2
PIECE-WISE REGRESSSION ASSUMPTIONS
The model introduced by Quandt in 1958 used the assumption that when (xt,yt) are
bivariate normal then
0 1
2 3
( )
( )
t t
t t
E y B B x
E y B B x
0
0
t t
t t
where = the original proximity propensity for the first regression, = the rate of
decrease to steady state for the first regression , = the original proximity of propensity
for the second regression, and = the rate of decrease to steady state for the second
regression. We will refer to these as assumption 1.
To better understand this assumption and others to come, real data taken from
whales in Sea World (t's and y's) shown in Table 2.1. In this case, t = time in weeks and
yt = index of proximity (Hinde Index). We use data acquired from mother-calf pairs of
Killer Whales born at one of two SeaWorld parks (California or Florida) between 1999
and 2002. The vagueness preserves the anonymity of the mammals (a directive from
SeaWorld administrators). The collection process was immense and covered a total of
66,606 hours with observers stationed both above and below the pools viewing spatial
relationship data every 15 minutes via the use of an instantaneous sampling technique
(Clark and Odell, 1999). In the initial stages after birth the mother cow closely monitors
her calf, to nourish and protect it. As the calf gets older, the mother encourages the calf
to begin to fend for itself. For illustrative use of the original Quandt method, the x's were
generated in Microsoft Excel to represent an x value at time t. We need to identify
variables for use with the Quandt method. Since we have three different variables but
9
wish to rely on simple linear regression, then we use different linear models for the
different groupings of the ( , ) defined by the value .
Table 2.1: Actual Weekly (t) Whale Hinde Index (y) with Simulated x Values
t x y
1 -1.00 1.00
2 0.25 0.50
3 0.45 0.20
4 1.00 0.11
5 0.60 0.15
6 -0.05 0.21
7 0.04 0.16
8 0.09 0.12
9 0.30 0.11
10 0.36 0.09
11 -1.00 0.04
12 1.00 0.29
13 -0.60 0.00
14 -0.05 0.11
15 1.00 0.05
16 0.60 0.00
17 0.36 0.02
18 0.12 0.15
19 0.50 0.10
20 0.10 0.09
Quandt’s Method used the maximum likelihood estimate approach to inspect all L(t) for
t = 1 to T. There is no assumption that xi < xi+1 for i = 1, 2,…T, but a grouping of the x’s
occurs once t0 is identified.
In assumption 1, we need to use generated x’s versus proximity (Hinde Index).
Table 2.2 yields the analysis given using Quandt’s method.
10
Table 2.2: Parameter Estimates, Standard Errors, and Calculated Log Likelihood Value
Using Quandt’s Method Under Assumption 1
t0 B0 B1 B2 B3 s1 s2 L(t)
3 0.517 -0.498 0.095 0.041 0.143 0.074 3.786
4 0.534 -0.463 0.096 0.046 0.110 0.076 3.758
5 0.516 -0.479 0.095 0.043 0.104 0.079 3.504
6 0.454 -0.443 0.085 0.051 0.173 0.074 2.431
7 0.411 -0.426 0.078 0.054 0.192 0.074 1.769
8 0.378 -0.417 0.075 0.056 0.200 0.076 1.028
9 0.364 -0.424 0.073 0.055 0.192 0.080 0.528
10 0.353 -0.432 0.074 0.056 0.184 0.085 0.072
11 0.265 -0.215 0.067 0.069 0.258 0.090 -2.098
12 0.277 -0.171 0.066 -0.002 0.254 0.060 -1.074
13 0.242 -0.117 0.107 -0.086 0.263 0.048 -1.182
14 0.232 -0.113 0.107 -0.086 0.255 0.053 -2.022
15 0.229 -0.123 0.134 -0.184 0.246 0.053 -2.437
16 0.221 -0.134 0.121 -0.116 0.240 0.060 -3.160
17 0.213 -0.138 0.124 -0.045 0.235 0.043 -3.187
According to the table, the estimated true change point would be when = 3. Using ,
, ,and , the graph appears as illustrated in Figure 2.1.
Figure 2.1: Graph of the Fitted Model Using Quandt’s Method Under Assumption 1
11
For assumption 1, it is not assumed that the two regressions must join, nor does the
second regression have to have a horizontal slope.
For a more simple model, we assume that t serves as the x variable, i.e. we ignore
the information provided by x, the hypothetical respiratory rate. We now look at time
versus proximity. The simplified model has the assumptions
tBByE
tBByE
t
t
32
10
)(
)(
0
0
tt
tt
and will be referred to as assumption 2. Using Table 2.1, the parameter estimates,
standard errors, and calculated log likelihood values are shown in Table 2.3.
Table 2.3: Parameter Estimates, Standard Errors, and Calculated Log Likelihood Value
Using Quandt’s Method Under Assumption 2
t0 B0 B1 B2 B3 s1 s2 L(t)
3 1.367 -0.400 0.171 -0.005 0.082 0.072 4.724
4 1.195 -0.297 0.186 -0.006 0.145 0.073 3.522
5 1.019 -0.209 0.188 -0.007 0.200 0.076 2.292
6 0.871 -0.145 0.156 -0.005 0.231 0.077 1.454
7 0.783 -0.113 0.135 -0.003 0.230 0.079 0.814
8 0.724 -0.093 0.127 -0.003 0.223 0.083 0.221
9 0.674 -0.078 0.120 -0.002 0.216 0.087 -0.346
10 0.635 -0.067 0.127 -0.003 0.210 0.092 -0.864
11 0.609 -0.061 0.210 -0.008 0.202 0.095 -1.146
12 0.541 -0.045 -0.112 0.011 0.220 0.053 0.094
13 0.534 -0.044 -0.041 0.007 0.210 0.056 -0.418
14 0.507 -0.038 -0.247 0.018 0.207 0.050 -0.564
15 0.492 -0.035 -0.396 0.026 0.200 0.053 -1.121
16 0.483 -0.034 -0.206 0.016 0.194 0.060 -1.712
17 0.470 -0.032 0.683 -0.030 0.189 0.016 -0.324
We find that the estimate of is 3. Using , , ,and , the corresponding graph is
illustrated in Figure 2.2.
12
Figure 2.2: Graph of the Fitted Model Using Quandt’s Method Under Assumption 2
Once again, we do not assume that the two regressions must join, nor does the second
regression have to have a horizontal slope.
As the calf grows older, we assume that the proximity between the mother and
calf decreases over time. Because of this, we expect that until a certain time, t0, there
would be a steady decrease in proximity. After we reach t0, there would be a steady state
of proximity, i.e. a horizontal line, between cow and calf. This lends itself to the next
model, which will be referred to as assumption 3, (an even more simplified one),
0 1
2
( )
( )
t
t
E y B Bt
E y B
0
0
t t
t t
In this case, we only look at t0 = 3, B0, B1, and B2. Our graph for assumption 3 is
illustrated using Figure 2.3.
13
Figure 2.3: Graph of the Fitted Model Using Quandt’s Method Under Assumption 3
Notice in the graph, the two lines are disjoint. However, we assume that the second
regression must have a horizontal slope, which represents the steady state. For an
intersection to occur, we would need to assume that for each t > t0, the expected value
would have to be the same for each t. Hence, our most simplified model is illustrated in
Figure 2.4.
14
Figure 2.4: Graph of the Fitted Model Using Quandt’s Method Under Assumption 4
The model for the graph, referred to as assumption 4, would be
0 1
0 1 0
( )
( )
t
t
E y B Bt
E y B Bt
0
0
t t
t t
With this data set, we find that all assumptions using Quandt’s method yields that the
estimated change point would be at t0 = 3.
Though we have estimated that in this case t0 = 3, graphically, how would it
compare to other t0’s? Let us use Figure 2.5, Figure 2.6, and Figure 2.7 to visualize what
would happen.
15
Y1 = -.400t + 1.367 SSE1 = 0.007
Y2 = -.167 SSE2 = 0.152 L(t) = 20.857
Figure 2.5: Using Whale Hinde Indices to Show Quandt Iterations at t0=3
Y1 = -.297t + 1.195 SSE1 = 0.042
Y2 = .007 SSE2 = 0.245 L(t) = 14.158
Figure 2.6: Using Whale Hinde Indices to Show Quandt Iterations at t0=4
16
Y1 = -.209t + 1.019 SSE1 = 0.119
Y2 = -.026 SSE2 = 0.336 L(t) = 9.451
Figure 2.7: Using Whale Hinde Indices to Show Quandt Iterations at t0=5
Notice that more of the data points are closest to the two regressions when t0 = 3. When
using the Quandt method, one must use the L(t) that is largest. By comparing the L(t) to
the sum of squared errors, one finds for this case that the largest L(t) out of the three
possible change points has the smallest error.
Quandt’s method assumes a normal distribution for a data set with constant
variance and uncorrelated error terms for a data set, but real data may not conform to
these assumptions. Since most of the points from the data are actually captured by the
two lines, this means that the Quandt method does a good job of finding the estimated
change point for this data set that is assumed well behaved. But what if there was a set in
which most of the points do not follow the regressions because the data set is non-
normal? Or the data has non-constant variance? Or the data is correlated over time?
17
Other data, if not from a normal distribution, could not be modeled as a product of two
normal probability distribution functions.
Also, Quandt uses the largest L(t) to determine what would be the estimated
change point. Unfortunately, there can be more than one local maxima using the L(t).
For example, the Table 2.4 and Figure 2.8 both illustrate graphically the points (t, L(t))
exhibit multiple local maxima.
Table 2.4: Parameter Estimates, Standard Errors, and Calculated Log Likelihood Values
Using Quandt’s Method Under Assumption 4
t0 B0 B1 sse1 sse2 L(t)
3 1.367 -0.400 0.007 0.152 20.857
4 1.195 -0.297 0.042 0.245 14.158
5 1.019 -0.209 0.119 0.336 9.451
6 0.871 -0.145 0.214 0.207 11.137
7 0.783 -0.113 0.264 0.187 10.657
8 0.724 -0.093 0.297 0.204 9.239
9 0.674 -0.078 0.328 0.210 8.314
10 0.635 -0.067 0.353 0.220 7.422
11 0.609 -0.061 0.367 0.267 6.155
12 0.541 -0.045 0.485 0.055 10.820
13 0.534 -0.044 0.487 0.096 7.990
14 0.507 -0.038 0.512 0.071 8.093
15 0.492 -0.035 0.523 0.075 7.282
16 0.483 -0.034 0.527 0.095 6.422
17 0.470 -0.032 0.536 0.099 6.117
18
Figure 2.8: Graph of Time versus Log Likelihood Function from Use of Quandt’s
Method Under Assumption 4
As one can see, there are exactly four local maxima at t0 = 3, t0 = 6, t0 =12, and t0 = 14.
Therefore, anyone of these maximums could be mistaken for the true change point if all
iterations are not completed. Therefore, we seek to determine a better method for
estimating the true t0 value independent of the data distribution.
19
CHAPTER 3
METHODOLOGY
One estimation method that does not rely on distributional assumptions utilizes
moment matching in order to find the joint estimate for the slope and t0. This is
accomplished by using the sample data to find the expected value for an unknown
parameter. Prior to doing so, an intuitive study of the behavior of the change point in this
“broken-line” model led to the insight that points early in time and furthest from the
change point were more likely to lie on the regression line rather than on the horizontal
ray; thus we chose to make a distributional assumption about the range of the initial
regression line (first “regime”) in that the probability that the expectation is B0 + B1t,
t < t0 , is proportional to the distance from the unknown actual value t0 .
Since we are dealing with three unknowns (B0, B1, and t0), it would be nice to
eliminate one of the unknowns from our model. First consider the E(yt) as modeled in
assumption 4.
1 0 1
0 1
3 0 1
100
0 10
0 1
2
.
.
.
1 0
0
.
.
.
0
( )
( ) 2
( ) 3
( ) ( 1)
( )
( )
t
t
T
E y B B
E y B B
E y B B
E y B t B
E y B t B
E y B t B
20
To eliminate B0, let wt= yt+1-y1 for t=1,…,T-1. This yields
0
0
0
2 1 0 1 0 1 1
3 1 0 1 0 1 1
1 0 0 1 0 1 0 1
1 1 0 0 1 0 1 0 1
2 1 0 0 1 0 1 0 1
1 0 0 1 0 1
( ) ( 2 ) ( )
( ) ( 3 ) ( ) 2
.
.
.
( ) ( ) ( ) ( 1)
( ) ( ) ( ) ( 1)
( ) ( ) ( ) ( 1)
.
.
.
( ) ( ) (
t
t
t
T
E y y B B B B B
E y y B B B B B
E y y B t B B B t B
E y y B t B B B t B
E y y B t B B B t B
E y y B t B B B 0 1) ( 1)t B
We then can conclude that
1 0
1 0 0
( ) 1,..., 1
( ) ( 1) ,..., 1
t
t
E w B t t t
E w B t t t T
Most data is assumed to come from a normal distribution; however, we are not
making the assumption that the data we use is normal. Hence, we will conclude that wt
has a mean of E(wt). We can see that this model has two parameters to estimate: B1 and
t0. It seems reasonable to assume that wt follows a distribution that is related to the
distance from the actual change point.
We include weights that are proportional to t0 - t. We will use the assumption that
when t > t0 , the probability is zero. To determine what probability is associated with
each time point before t0 we can discern k by knowing the sum must be 1, i.e.
0 1
0
1 0
1t
t
t tk
t
21
Therefore,
0 01 1
0
1 10 0
1t t
t t
t t tk k
t t
0 00
0
00
0
( 1)( )1( 1)
2
1( 1)
2
1
2
t tk t
t
tk t
tk
Solving the equation 0 11
2
tk , we find that k =
0
2
1t and, for example the
probability associated with w1 is 0
2
1t
0
0 0
1 2t
t t. Hence the expected wt, weighted by
these probabilities 0
0 0
2i.e.
1
t t
t tat each time t is
1 0
0 0 01 1 1
0 0 0 0 0 0 0 0
2( 1) 2( 2) 2( 3) 2(1)2 3 ... ( 1)
( 1) ( 1) ( 1) ( 1)
t t tB B B B t
t t t t t t t t
10 0 0 0
0 0
2( 1) 2( 2) 3( 3) ... ( 1)(1)
( 1)
Bt t t t
t t
0
1
0
0 0
1
1
3 2
0 0 0 0 010
0 0
1 0
2( )
( 1)
( 1)( ) ( 1) ( 1) 12
( 1) 2 3 2 6
( 1)
3
t
i
Bi t i
t t
t t t t tBt
t t
B t
22
Combining this with the remaining (T - t0)’s B1(t0 - 1) yields
1 00 0 1 01
1
( 1)( 1) ( ) ( 1)
3( )
1
T
t
i
B tt T t B t
E wT
221 0 1
1 0 1 0 1 0 1
2 2
1 0 0 0 0
2
1 0 0
3
1
1 3 3 3 3
3( 1)
2 (3 3) (1 3 )
3( 1)
B t BTB t B t B t B T
T
B t Tt t t T
T
B t T t T
T
Setting
2
1 0 02 (3 3) (1 3 )
3( 1)
B t T t Tw
T, we then have an expression for t0 and B1.
Hence
2
0 0
1
(3( 1))2 (3 3) (1 3 )
w Tt T t T
B
or
2
0 0
1
(3( 1))0 2 (3 3) (1 3 )
w Tt T t T
B
Combining the expectations from both the regression line and the horizontal line
in a properly weighted manner, results in this formula for the moment match. By re-
writing the equation we do produce an estimate of B1 as a function of t0. This is helpful as
shown in the next section because it provides a restriction on the candidates for the slope
estimate. The slope estimate is derived via usual nonparametric regression where we
compute all pair wise slopes and rely on the median as the best estimator, but can be
23
restricted to only a small subset of feasible values. In conducting a nonparametric
regression procedure the number of potential estimates number ( 1)
2
T Tsince all possible
pairs of points are examined. Using ,w we can also solve for B1
11 0
0 0 1 0
1
( 1)( 1) ( ) ( 1)
3
1 1
T
t
t
tt T t tW
T T
1
11 2
0 0
3
2 3( 1) (3 1)
T
t
t
W
t T t T
and we are able to create a lower and upper bound for the estimate of B1 so that the sort
needed to reveal the median of the sample slopes is conducted only on a small subset of
the slopes generated from all possible point pairings.
To conduct a non-parametric regression, the slopes of all possible pairings of
points slopes are found and the following bounds are used. For T > 14, the interval
needed to compare all slopes is
1 2
3( 1) 24( 1)
6 1 (3 1)
T TB
T T
The median of the slopes that are within the interval is used as the estimate for the
parameter B1. Using the equation above, we are now able to calculate t0 after finding a
good estimate for B1 and calculating .w
24
CHAPTER 4
MAIN RESULTS
We will illustrate the MMNPR using the Sea World data set. Referring back to Table
2.1, we needed to find all possible pairs of slopes and find the interval for the most
influential slopes. Since there are 20 points in the data set, then 20C2 =20!
(20 2)!2!= 190.
Below in Table 4.1, the chart of the 190 slopes that were found using Excel. SAS was
used to find the feasible interval based on bounds developed from the equations in
Chapter 3. In this case, there are 10 slopes (which are highlighted) within the interval
–.45 < B1 < –0.11376.
25
26
The median of the 10 slopes is -0.20375. We use this value as the slope B1 and use
1
1
T
t
t
W = -16.5 to calculate the corresponding t0 from the equation
1
11 2
0 0
3
2 3( 1) (3 1)
T
t
t
W
t T t T
2
0 0
3( 16.5).20375
2 3(21) (61)t t
2
0 0242.945 2 63 61t t
2
0 00 2 63 303.945t t
Using the quadratic formula and using the smaller root inside [3, 17], we get t0 = 5.947.
For reporting, we have elected to truncate this to the integer value of 5. Therefore we
have estimated t0 and B1 and with moment matching, we would use the average of the y’s
to find B0.
Along with calculating it by hand, a SAS program was used to find all of the values.
Table 4.2 shows what values were calculated for both Quandt’s method and the MMNPR
method.
Table 4.2: SAS Values for Quandt and MMNPR Values
B0 B1 t0q MMNPB0 MMNPB1 MMNPt0
1.36667 -0.400 3.000 1.237 -0.204 5.000
0
0 1 0 0 1 0
1
( ) ( )( )t
t
t T t t
yT
27
Figure 4.1 below is a visual comparison of the results using the Quandt and MMNPR
methods.
Figure 4.1: Graph of Quandt Method versus MMNPR Method
In this case, Quandt looks to have done a better job of estimating parameters, resulting in
regressions having more points closer to the lines. However, we must remember that this
sample was a small sample size and this only one case. If we investigate different sample
sizes and many more replications, we may observe different results. Using a SAS
program, the following Table 4.3 was the output comparing Quandt’s method to the
MMNPR method.
Table 4.3: SAS Generated Values for Parameters for Quandt and MMNPR Methods
28
Simulation runs (k=250) comparing Quandt MLE to the MNPR estimates
CASE 1 2 3 4 5 6 7 8 N 20.00 20.00 20.00 20.00 20.00 20.00 20.00 20.00 t0 5.00 5.00 5.00 5.00 10.00 10.00 10.00 10.00
0 0.90 0.90 0.50 0.50 0.90 0.90 0.50 0.50
1 -0.20 -0.20 -0.20 -0.20 -0.10 -0.10 -0.10 -0.10
0.05 0.03 0.05 0.03 0.03 0.02 0.03 0.02
MLE-t0 4.992
(.237) 4.996
(.063) 5.020
(.228) 5.000
(.000) 9.984
(.390) 9.960
(.196) 9.996
(.375) 9.980
(.166)
MLE- 0 0.900
(.051) 0.901
(.032) 0.495
(.054) 0.500
(.031) 0.901
(.022) 0.901
(.013) 0.501
(.021) 0.501
(.014)
MLE- 1 -0.201
(.050) -0.200
(.010) -0.199
(.017) -0.200
(.009) -0.100
(.004) -0.100
(.002) -0.100
(.004) -0.100
(.002) MMNPR- t0
6.224
(.645) 5.956
(.450) 6.280
(.596) 5.940
(.421) 10.556
(.664) 10.896
(.557) 10.620
(.661) 10.932
(.552) MMNPR-
0 0.891
(.059) 0.896
(.038) 0.489
(.065) 0.494
(.040) 0.983
(.041) 0.986
(.026) 0.589
(.040) 0.588
(.029) MMNPR-
1 -0.156
(.017) -0.162
(.013) -0.154
(.017) -0.162
(.013) -0.104
(.004) -0.102
(.002) -0.104
(.004) -0.102
(.003)
CASE 9 10 11 12 13 14 15 16 N 30.00 30.00 30.00 30.00 30.00 30.00 30.00 30.00 t0 10.00 10.00 10.00 10.00 15.00 15.00 15.00 15.00
0 0.90 0.90 0.50 0.50 0.90 0.90 0.50 0.50
1 -0.10 -0.10 -0.10 -0.10 -0.07 -0.07 -0.07 -0.07
0.03 0.02 0.03 0.02 0.02 0.01 0.02 0.01
MLE-t0 10.004
(.304) 9.992
(.089) 9.996
(.304) 9.996
(.110) 14.980
(.290) 15.000
(.000) 14.992
(.297) 15.000
(.000)
MLE- 0 0.899
(.021) 0.901
(.014) 0.502
(.022) 0.500
(.013) 0.900
(.011) 0.900
(.006) 0.501
(.011) 0.500
(.006)
MLE- 1 -0.100
(.004) -0.100
(.002) -0.100
(.003) -0.100
(.002) -0.070
(.001) -0.070
(.001) -0.070
(.001) -0.070
(.001) MMNPR- t0
10.932
(.646) 10.764
(.503) 10.916
(.571) 10.784
(.492) 16.900
(.729) 17.660
(.546) 16.796
(.762) 17.644
(.578) MMNPR-
0 0.948
(.036) 0.946
(.025) 0.550
(.036) 0.550
(.022) 1.003
(.027) 1.009
(.016) 0.602
(.029) 0.609
(.017) MMNPR-
1 -0.095
(.004) -0.096
(.002) -0.095
(.004) -0.096
(.002) -0.072
(.002) -0.071
(.001) -0.072
(.002) -0.071
(.001)
These simulations are based on 250 runs each. The values in the table are the
means with their standard deviations in parenthesis. Cases 1-8 are based on sample sizes
of 20 and cases 9-16 are based on sample sizes of 30. With the usual normal error
29
regression assumptions, when comparing the values for Quandt’s method (MLE) to the
MMNPR method, one finds that both are accurate at estimating the true change point
slope and intercept. The MMNPR’s estimate of t0 tends to be larger than the true value of
t0. However, the MMNPR is within 2.5 standard deviations from the true value of t0.
Another advantage of using the MMNPR method is its speed. When running a computer
program to find the needed values, Quandt’s method takes 2 CPU seconds per
replication; the MMNPR uses 1/10 CPU seconds.
30
CHAPTER 5
CONCLUSION
We have developed a good quick method to find the parameter estimates for the
intercept, slope, and change point in the broken line regression model. We illustrated its
use by modeling data from mother and calf whale proximity data, comparing our
estimates to estimators generated by Quandt (1958). We were able to detect a change in
the pairs behavior around week 5 (with our MMNPR method) and week 3 with Quandt’s
method.
This discovery would indicate that need for early bonding due to nursing declines
quite rapidly in Killer Whales. Additionally, we wanted to gain insight into whether our
quick MMNPR method was computationally adequate and so we compared our method
to the Quandt MLE method. We expected the Quandt method to be superior since our
simulations were set up to mimic the assumptions he used to derive his estimators. Yet,
the MMNPR performed adequately. Our simulations reveal a repeated small over-
estimation of the location of the change point that needs investigation. Further research
will involve simulations and include cases where the error terms for the broken line
regression are not well behaved. Additionally, we will devise testing procedures and look
at their performance under null and alternative hypotheses as suggested in Gregoire and
Hamrouni (2002).
31
APPENDIX A
SAS code
/* When using this program please reference via the following citation:
Hoffman, L.L., Knofczynski, G., Clark, S., Rogers, A., Hudson, J., King, H., and Reiss,
E. 2009. A Nonparametric Method for the
Estimation of All Parameters in the Joinpoint Two Regime Regression Model. In JSM
Proceedings, Alexandria, VA: American Statistical
Association.
*/
data onehinde;
input time index;
cards;
1 1
2 .5
3 .2
4 .11
5 .15
6 .21
7 .16
8 .12
9 .11
10 .09
11 .04
32
12 .29
13 0
14 .11
15 .05
16 0
17 .02
18 .15
19 .1
20 .09
;
%let n=20;
data allruns;
/* creating the macro to fit first and second line */
%macro regreg;
/* grabs the first &j set of points */
data first;
set onehinde;
if _n_ gt &j then delete;
/* grabs the second &j set of points */
data second;
set onehinde;
if _n_ le &j then delete;
/* quandt approach */
33
/* regression on first &j points to find SSE and b0 and b1 */
proc glm data=first;
model index=time; output out=quandt1 p=yhat r=yresid;
data qvals1;
set quandt1;
retain b10 b0 b1 sse1 sse2;
if _n_=1 then do; b10=yhat; sse1=0;end;
if _n_=2 then do; b1=yhat-b10; b0=b10-b1; end;
sse2=sse1+yresid*yresid;
sse1=sse2;
sseI=sse2;
if _n_ ne &j then delete;
data clnqv1;
set qvals1;
keep b0 b1 sseI;
proc print;
/* finding SSE of the last points on horizontal line */
data qvals2;
merge second clnqv1;
retain bhorz sse1 sse2;
if _n_=1 then do; sse1=0; bhorz=b0+b1*(&j); end;
sse2=sse1+(index-bhorz)* (index-bhorz);
sse1=sse2;
34
sseII=sse2;
if _n_ ne &jend then delete;
data clnqv2;
set qvals2;
keep bhorz sseII;
proc print;
/* creating a dataset with all possible &j divisions */
data meshqval;
merge clnqv1 clnqv2;
data allruns;
set allruns meshqval;
data results;
set allruns;
if _n_ = 1 then delete;
ssetot=sseI+sseII;
/*find likelihood value */
Lt=-(&n)*log(sqrt(2*3.1415962))-(_n_+1)*log(sqrt(sseI/(_n_+1)))
-(&n-(_n_+1))*log(sqrt(sseII/(&n-(_n_+1))))-(&n/2);
t0q=_n_+1;
proc print;
/* keeping only the answers from largest Lt */
proc sort data=results; by descending Lt ;
data quntparm; set results; if _n_ > 1 then delete;
35
/* keeping the estimates */
data qntcln;
set quntparm;
keep b0 b1 t0q;
%mend regreg;
/* this is the loop over &j from 3 to n-3 */
%macro loop;
%let nend = &n - 3;
%do j=3 %to &nend;
%let jend=&n - &j;
%regreg;
%end;
%mend loop;
%loop;
/* MMNPR approach */
/* finding sum of w's */
data prestats;
set onehinde;
array yyval(50);
retain sum y1 yyval;
if _n_=1 then y1=index;
%let m=_n_;
if _n_=1 then sum=0;
36
sum=sum+(index-y1);
yyval(&m)=index;
output;
/* nonparametric estimation of b1 slope by creating all
possible slopes and only saving those within feasible bounds */
data stats;
set prestats;
array yyval(50);
array sloper(2500);
retain sloper loopcount;
if _n_ eq &n then loopcount=0;
if _n_ eq &n then do;
do iclr=1 to 2500;
sloper(iclr)=99999;
end;
end;
if _n_<&n then delete;
/* calculating bounds */
maxb=8*(3*sum)/((3*&n-1)*(3*&n-1));
minb=(3*sum)/(6*&n-10);
midb=(3*sum)/((&n-4)*(&n+7));
put 'key values sum= ' sum ' minb = ' minb
' midb = ' midb ' maxb = ' maxb;
37
/* calculating slopes */
kkk=1;endn=&n-1;
do k=1 to endn;
kkbeg=k+1;
do kk=kkbeg to &n;
sloper(kkk)=(yyval(kk)-yyval(k))/(kk-k);
loopcount=loopcount+1;
if sloper(kkk)>minb & sloper(kkk)<maxb then
if sloper(kkk)>minb & sloper(kkk)<maxb then kkk=kkk+1;
end;
end;
sloper(kkk)=99999;
output;
/* finding nonmissing slopes */
data sortslop;
set stats;
array sloper(2500);
keep slopcand;
do k=1 to 2500;
if sloper(k) ne 99999 then slopcand=sloper(k);
if sloper(k) ne 99999 then output;
end;
/* finding median etc. of slopes */
38
proc print;
proc means mean median min max q1 q3 n var data=sortslop;
output out=slopeout median=med;
/* singling out sum of w's from previous data */
data feedft0;
set prestats;
if _n_ ne &n then delete;
/* making a dataset with sum of w's and median slope in it
to calculate t0 */
data findt0;
merge slopeout feedft0;
retain med sum;
ourt0 = (.25*(3.*med*&n+3.*med+sqrt(9.*med*med*&n*&n
-6.*med*med*&n+med*med-24.*med*sum)))/med;
ourb1=med;
output;
data findb0;
set findt0;
ourb0=((sum+&n*y1)-(.5*med*ourt0*(ourt0+1))-(&n*med*ourt0)
+(med*ourt0*ourt0))/&n;
output;
/* keeping the estimates */
data ourscln;
39
set findb0;
keep ourb0 ourb1 ourt0;
data twoests;
merge qntcln ourscln;
data twoest2;
merge qvals2 twoests ;
keep b0 b1 t0q ourb1 ourb0 ourt0 ourt0r;
ourt0r=floor(ourt0); output;
proc print;
run;
40
REFERENCES
Clark, S.T., Odell, D.K., and Lacinak, C.T. (2000), Aspects of Gorwth in Captive
Killer Whales (Orcinus Orca), Marine Mammal Science, 16(1), 110-123.
Clark, S. and Odell, D. (1999), “Nursing parameters in captive killer whales (Orcinus
orca),” Zoo Biology, 18, 373-384.
Chen, X. R. (1987), “Testing and Interval Estimation in a Change-point Model Allowing
at most One Change,” Technical Report 87-25, Center for Multivariate Analysis,
University of Pittsburgh.
Ciuperca, G. (2009), “S-Estimator in Change-Point Random Model with Long Memory,”
pre-print from http://arxiv.org/abs/0906.1710v1, last accessed on September 23,
2009.
Dagenais, M.G. (1969), “A threshold regression model,” Econometrica, 37, 193-203.
DeGaetano, A. T. (2006), “Attributes of Several Methods for Detecting Discontinuities in
Mean Temperature Series,” Journal of Climate, 19, 838-853.
Ghosh, P., Basu, S., and Tiwari, R.C. (2009), “Bayesian Analysis of Cancer Rates From
SEER Program Using Parametric and Semiparametric Joinpoint Regression
Models,” Journal of the American Statistical Association, 104, 439-452.
Gijbels, I., and Goderniaux, A. (2004), “Bandwidth Selection for Changepoint Estimation
in Nonparametric Regression,” Technometrics, 46, 76-86.
Gill, R. (2004), “Maximum Likelihood Estimation in Generalized Broken-Line
Regression,” The Canadian Journal of Statistics, 32, 227-238.
Goldfeld, S.M. and Quandt, R.E. (1972), Nonlinear Methods in Econometrics,
Amsterdam: North-Holland Pub. Co.
41
Gregoire, G., and Hamrouni, Z. (2002), “Two Non-Parametric Tests for Change-Point
Problems,” Nonparametric Statistics, 14, 87-112.
Hinde, R. A. and Atkinson, S., (1970), “Assessing the Roles of Social Partners in
Maintaining Mutual Proximity, as Exemplified by Mother-Infant Relations in
Rhesus Monkeys,” Animal Behaviour, 18, 169-176.
Jandhyala, V.K., and Fotopoulos, S.B. (1999), “Capturing the Distributional Behavior of
the Maximum Likelihood Estimator of a Changepoint,” Biometrika, 86, 129-140.
Khodadadi, A., and Asgharian, M. (November, 2008), “Change-point problem and
Regression: An Annotated Bibliography,” COBRA Preprint Series, Article 44,
http://biostats.bepress.com/cobra/ps/art44, last accessed on September 13, 2009.
Koul, Hira L. (2000), “Fitting a two phase linear regression model,” Journal of the Indian
Statistical Association, 38, 331-353.
Krishnaiah, P.R., and Maio, B. Q. (1988), “Review About Estimation of Change Points,”
in Handbook of Statistics, Volume 7 (Quality Control and Reliability), Eds.
Krishnaiah, P.R., and Rao, C.R., Amsterdam: Elsevier Science Pub.
Mahmoud, M.A., Parker, P.A., Woodall, W.H., and Hawkins, D.M. (2006), “A Change
Point Method for Linear Profile Data,” Journal of Quality and Reliability
Engineering International, 23, 247-268.
Quandt, R. E. (1958), "The Estimation of the Parameters of a Linear-Regression System
Obeying Two Separate Regimes," Journal of the American Statistical
Association, 53, 873-880.
42
Quandt, R. E. (1960), "Tests of the Hypothesis That a Linear-Regression System Obeys
Two Separate Regimes," Journal of the American Statistical Association, 55, 324-
330.
Quandt, R. E. (1972), "New Approach to Estimating Switching Regressions," Journal of
the American Statistical Association, 67, 306-317.
Yu, B., Barrett, M.J., Kim, H-J, Feuer, E.J. (2007), “Estimating joinpoints in continuous
time scale for multiple change-point models,” Computational Statistics & Data
Analysis, 51, 2420-2427.