polygon 2016

58

Upload: mdc-polygon

Post on 02-Aug-2016

252 views

Category:

Documents


1 download

DESCRIPTION

Polygon is a tribute to the scholarship and dedication of the faculty at Miami Dade College in interdisciplinary areas.

TRANSCRIPT

Page 1: Polygon 2016
Page 2: Polygon 2016

I

Editorial Note:

Polygon is MDC Hialeah's Academic Journal. It is a multi-disciplinary online publication whose purpose is to display the academic work produced by faculty and staff. We, the editorial committee of Polygon, are pleased to publish the 2016 Spring Issue Polygon which is the tenth consecutive issue of Polygon. It includes five regular research papers. We are pleased to present work from a diverse array of fields written by faculty from across the college. The editorial committee of Polygon is thankful to the Miami Dade College President, Dr. Eduardo J. Padrón, Miami Dade College District Board of Trustees, the Hialeah Campus Academic Dean, Professor Joaquin G. Martinez, Chairperson of Hialeah Campus Liberal Arts and Sciences, Dr. Caridad Castro, Chairperson of Hialeah Campus English, Communications and World Languages, Dr. Victor McGlone, Director of Hialeah Campus Administrative Services, Ms. Andrea M. Forero, all staff and faculty of Hialeah Campus and Miami Dade College, in general, for their continued support and cooperation for the publication of Polygon. Sincerely, The Editorial Committee of Polygon: Dr. M. Shakil (Mathematics), Dr. Jaime Bestard (Mathematics), and Professor Victor Calderin (English)

Patrons: Professor Joaquin Martinez, Dean of Academic and Student Affairs Dr. Caridad Castro, Chair of Liberal Arts and Sciences Dr. Jon Mcglone, Chair of World Language

Miami Dade College District Board of Trustees:

Helen Aguirre Ferré, Chair

Armando J. Bucelo Jr. Benjamin León III

Marili Cancio Jose K. Fuentes

Armando J. Olivera Bernie Navarro

Eduardo J. Padrón, College President

Mission of Miami Dade College The mission of the College is to provide accessible, affordable, high‐quality education that keeps the

learner’s needs at the center of the decision-making process.

Page 3: Polygon 2016

II

CONTENTS

ARTICLES

Dynamic Stability Analysis of Tumor-Host Interactions

AUTHOR(S)

Dr. Keysner Boet

A comparison between TRON and Levenberg-Marquardt methods and their relationship to Tikhonov’s Regularization Method in Nonlinear Parameter Estimation

Dr. Justina L. Castellanos and Dr. Angel P´erez

SURVEY OF STUDENTS’ FAMILIARITY WITH DEVELOPMENTAL MATHEMATICS- A STATISTICAL ANALYSIS

Dr. M. Shakil

Item Analysis Statistics and Their Uses: An Overview

Dr. M. Shakil

Testing the Goodness of Fit of Continuous Probability Distributions to Some Flood Data

Comments about Polygon: (http://www.mdc.edu/hialeah/Polygon2013/docs2013b/Comments_About_Polygon.pdf)

Dr. M. Shakil

Page 4: Polygon 2016

III

Previous Editions

Polygon, 2008 (http://issuu.com/polygon5/docs/polygon2008) Polygon, 2009 (http://issuu.com/polygon5/docs/polygon2009) Polygon, 2010 (http://issuu.com/polygon5/docs/polygon_2010) Polygon, 2011 (http://issuu.com/polygon5/docs/polygon_2011) Polygon, 2012 (http://issuu.com/polygon5/docs/polygon_2012) Polygon, 2013 (http://issuu.com/polygon5/docs/polygon2013) Polygon, 2014 (http://issuu.com/polygon5/docs/polygon_2014)

Disclaimer: The views and perspectives presented in the articles published in Polygon do not represent those of Miami Dade College.

Back to Front Cover

Page 5: Polygon 2016
Page 6: Polygon 2016

A comparison between TRON and

Levenberg-Marquardt methods and their

relationship to Tikhonov’s Regularization Method

in Nonlinear Parameter Estimation

Justina L. Castellanos∗ Angel Perez†

Abstract

Parameter estimation problems are usually solved by minimizing a nonlinear or linear least squares function (NLS or LLS). For the nonlinear case, the Levenberg-Marquardt method (L-M) has for long been the best method. The connection of this method to Tikhonov’s Regularization method is described. The possibility of using a Trust Region Newton’s Method (TRON) that deals with bound constraints for NLS problems, that performs like the L-M method, is shown. Key words: Inverse Problem, Newton method, Trust Region strategy, Tikhonov method

1 Introduction

The parameter estimation problem for nonlinear models is of great interest notonly for mathematicians but for many specialists in other applied areas suchas engineering, biology, and so forth. It is usually posed as the solution of thefollowing Nonlinear Least Squares (NLS) problem

Minimize F (x) =m∑

i=1

(φ(x; ti)− yobs

i

)2=

12‖f(x)‖22

s.t l≤x≤u l,x,u∈Rn

(1)

f(x) = (f1(x), .., fm(x))tfi(x) = φ(x; ti)− yobs

i , i = 1, ..., m

where φ(x; t) represents the desired model function, with t an independent vari-able where the data

{yobs

i

}are measured, which may be subject to experimental

error. The independent variables {xj}, j = 1, ..., n, can be interpreted as pa-rameters of the problem that are to be manipulated in order to adjust the model

∗Miami Dade College, e-mail: [email protected]†Woolton Inc., e-mail: [email protected]

1

Page 7: Polygon 2016

to the data. If the model is to have any validity, we can expect that ‖f(x∗)‖2(with x∗ being the solution of (1)) will be ”small” and that m, the numberof data points, will be much greater than n. The vector function f is calledthe residual vector and vectors u and l are the upper and lower bounds on theunknown vector of parameters x, respectively.

Although problem (1) can be minimized by any general nonlinear opti-mization method, in most circumstances, the properties of function F makeit worthwhile to use methods designed specifically for the least squares prob-lem. In particular, the gradient and Hessian matrix of F have a special struc-ture. If J = [∇f t

1(x) ∇f t2(x)....,∇f t

m(x)]t is the m×n Jacobian matrix of f(x),then g(x) = ∇F (x) = J(x)tf(x), and the n×n Hessian matrix is ∇2F (x) =

J(x)tJ(x) +m∑

i=1

fi(x)(∇2fi(x)). If∥∥f(x∗)∥∥

2is sufficiently small, then the Hes-

sian matrix can be approximated by J(x)tJ(x). One of these methods is the well known Levenberg-Marquardt Method (L-M), which in the other hand, can be viewed as the iterative solution of the approximation of the nonlinear problem (1) by a linear-least squares problem, as we will see ahead.

The Linear Least-Squares problem (LLS) is the solution of

Minimizex∈Rn

12‖Ax− b‖22 A : m×n matrix, b : m− data vector (2)

where the difficulty to find a r easonable approximate s olution c omes f rom the usually ill-posedness of the problem, which is reflected in the i ll-conditioning of the matrix A.This is a cause of the quasi-linear dependency of its columns and may produce solutions quite far from the correct one, because of small errors in the data. Tikhonov’s regularization method solves this problem penalizing the LLS function in order to force the solution vector to be not so big:

Minimizex∈Rn

12‖Ax−b‖22 + λ ‖x‖22 (3)

The difficulty is to select the scalar λ which is problem dependent (see [7]).In NLS problems is used the approximation ∇2F (xk) ≈ J(xk)

tJ(xk) at each iteration and a Linear Least-Squares problem like (2) is solved, but with the Jacobian matrix J(xk) and residual vector f(xk) at the current iteration being A and b respectively. The minimization process seeks the descent direction s that solves

Minimize12‖J(xk)s + f(xk)‖22 (4)s∈Rn

The ill-conditioning of the Jacobian matrix can cause that the iteration of the NLS method may generate solutions to the associated LLS subproblem that are quite far from the exact one. Then, the resulting solution to the NLS problem might also be far from the expected. Thus a Tikhonov’s regularization would be useful in order to avoid this situation.

2

Page 8: Polygon 2016

In this paper, the relationship between the iteration of a Levenberg-Marquardt and the Tikhonov’s regularization methods is presented to explain the good performance of the former method. Afterwards, the use of a Trust Region Newton’s method (TRON) [5] to solve the NLS problem taking the Hessian matrix as the approximation J(xk)

tJ(xk) at each iteration is shown. This method can deal with bound constraints and its relationship to the Levenberg-Marquardt iteration guarantees its good behavior for the NLS problems. Our main goal is to point out these relationships among the methods and, in a paper coming soon, to use these ideas in practical applications.

The paper is organized as follows: the connection of the linear iteration of the Levenberg-Marquardt method to Tikhonov’s regularization method is made in section 2. Section 3 describes the TRON method pointing out its similarity to the L-M method when applied to NLS using only first order information in the approximation of the Hessian matrix and thus, the corresponding connection of the former method to the Tikhonov’s regularization method. Section 4 presents the way Levenberg-Marquardt and TRON methods solve the linear subproblem (4) and compute the regularization parameter. Section 5 is devoted to the conclusions.

In what follows we will use the following notation: fk for f(xk), gk for g(xk) and Jk for J(xk). The norm used here will be the Euclidean norm, so we omit the subscript.

2 Relationship between the Levenberg-Marquardtand Tikhonov’s Regularization methods

The Levenberg-Marquardt method (L-M) [2] has been used to solve the NLSproblem with great success, since it takes advantage of the specific form of thefunction to be minimized; that is to say it exploits the particular form of thegradient and the possibility of approximating the true Hessian matrix at eachiteration, under the assumption of small residuals, by Jt

kJk, as was noticed inthe introduction to this work.

The search direction is defined as the solution of the equations

(Jt

kJk + λkI)sk = −Jt

kfk

where λk is a non-negative scalar. A unit step is always taken along sk, givingxk+1 = xk + sk. It can be shown that, for some scalar ∆k related to λk, thevector sk is the solution of the constrained subproblem

Minimize12‖Jks + fk‖2 (5)

subject to ‖s‖≤∆k

which is equivalent to the solution of the unconstrained minimization of the Lagrangian function for problem (5):

3

Page 9: Polygon 2016

Minimize12‖Jks + fk‖2 + λk

(‖s‖2 −∆2

k

)

s∈Rn

.

This is also equivalent to Tikhonov’s Regularization (3) for LLS problems exceptthat, in that case, the bound ∆k on the size of the descent direction is not knownand therefore, λk must be fixed. In the L-M iteration, ∆k is given in some wayby the major iteration, and then λk can be chosen as explained in section 4.

This relationship between the L-M method and Tikhonov’s regularizationis the reason for the good behavior of the L-M on noisy problems, since, in acertain way, it prevents the size of the iteration vector from growing too much.

3 TRON: Trust Region Newton’s method

TRON [5] is a routine that uses a trust region version of Newton’s method forgeneral nonlinear minimization of bound constrained problems, i.e.,

Minimize F (x)s.t. l≤x≤u

where F is a nonlinear smooth scalar funtion in Rn, l and u are the lower andupper bounds, respectively. The basic iteration of the method is

xk+1 = xk + sk

sk = arg min(

mk(s) =12st∇2Fks + stgk

)

s.t. ‖s‖≤∆k

When applied to the problem of parameter estimation (1) with the trueHessian matrix approximated using first order information (∇2Fk ≈ Jt

kJk), thismethod becomes a Levenberg-Marquardt method, since at each iteration thenonlinear least squares problem is approximated by the solution of the associatedlinear least-squares problem. The basic iteration now becomes

xk+1 = xk + sk

sk = arg min(

mk(s) =12‖Jks + fk‖22

)

s.t. ‖s‖≤∆k

. (6)

To solve the constrained LLS subproblem (6), the vector sk is obtained usinga Linear Preconditioned Conjugate Gradient method (LPCG). This is the es-sential part of the TRON method in which we are interested in this work. Formore information about the treatment of the bound constraints see [5].

As the TRON method solves the same subproblem (6) as the L-M method,when applied to a nonlinear least-squares problem using the approximate Hes-sian matrix JtJ, the equivalence to Tikhonov’s regularization method is alsovalid for this method.

4

Page 10: Polygon 2016

4 Relationship between the solution of theLinear Least Squares subproblem by L-M andTRON methods

The L-M algorithm is of the trust region type and a ”good ” value of λk (or∆k) must be chosen in order to ensure descent. If λk is zero, sk is the Gauss-Newton direction, as λk → ∞, ∆k → 0 and ‖sk‖ → 0 and sk becomes parallel to the steepest-descent direction. The difficulty in this approach is an appropriate strategy for choosing ∆k, which must rely on heuristic considerations. Most standard strategies (see Dennis and Schnabel [1], More [2]), have originally been developed to ”globalize” the convergence of the Gauss-Newton iteration for well-posed minimization problems, so the parameter λk is chosen once the corresponding ∆k has been fixed by some criteria on the agreement between the nonlinear model and the linear one. If the Gauss-Newton direction (sk =−(Jt

kJk)−1Jtkfk) satisfies the constraint on the

norm and the matrix JtkJk is non-singular, then λk is set to zero, but if one of these

conditions fails, it is used an algorithm to compute the scalar λk that is the root of the equation ‖sk(λ)‖ − ∆k ( where sk(λ) = −(Jt

kJk + λI)−1Jtkfk).

The TRON method solves (5) using a LPCG algorithm that begins at s0k = 0,

stopping the iteration whenever:

∥∥sik

∥∥ > ∆k, (7)

∥∥JtkJksi

k + gk

∥∥ ≤ rtol ‖gk‖ , and (8)

ptiJ

tkJkpi = 0 (9)

where pi is the conjugate gradient direction, the subscript i is the counter of

the LPCG iterations, and rtol is a tolerance less than one.These criteria solve the problem of the size of the descent direction sk and

the possible singularity of the approximate Hessian matrix. If the n iterationsof the LPCG are made or (8) is satisfied with a sufficiently low value of rtol or‖gk‖ near zero, then an approximate solution to the Gauss-Newton equationsis obtained. If rtol is not sufficiently small or ‖gk‖ is large, then the iterationis stopped very early and a direction close to the steepest descent is accepted.If (7) or (9) are satisfied, then the LPCG finds a scalar τ > 0 such that ‖sk‖ =∥∥si

k + τpi

∥∥ = ∆k. As the LPCG iteration starts at s0k = 0, the iteration vector

moves from the right-hand side of the equations (−gk) to their complete solution(Gauss-Newton direction) then, the vector sk is the same as in the L-M iterationand the criterion (7) controls the size of the vector. The regularizing effect (see[4]) of the LPCG iterations assures that the approximate solution to the linearsubproblem will be a regularized solution, where the regularization parameterλk is implicitly given by the number of the iteration where the LPCG stops.

5

Page 11: Polygon 2016

Thus, the TRON and L-M methods solve the same subproblem (5) by doing a Tikhonov’s regularization to the linearized problem. The difference between them is in the method for computing the scalar λk, which in both cases depends on the trust region radius. In the former method λk is given implicitly by the iteration number of the LPCG method and in the L-M method, an algorithm explicitly computes the root of the univariate equation ‖sk(λ)‖ − ∆k.

5 Conclusions

The Levenberg-Marquardt Method is a standard method that gives very good results when applied to Nonlinear Least-Squares Problems. Its good behaviour relies on the fact that at each iteration a Tikhonov’s regularization is made for the associated linear least-squares subproblem, so that the iterations do not grow too much outside a ”permissible” region, where the regularization parameter is chosen by an optimization criterion on the objective function.

The use of the TRON method using the approximate Hessian JtJ is an interesting alternative method, since it deals with bounds on the variables and the iterations have the nice properties of the L-M method, as was shown in this paper. Also the use of a LPCG in the inner iteration to compute the descent direction guarantees the regularization of the associated LLS subproblem in the presence of errors, because of the regularizing effect of this latter method. The use of the TRON method in nonlinear parameter estimation problems will be presented in a paper coming soon.

References

[1] Dennis J.E., Schnabel R.B., Numerical Methods for Unconstrained Opti-mization and Nonlinear Equations, SIAM, Philadelphia, 1996

[2] More J.J., The Levenberg-Marquardt algorithm: implementation and the-ory, in Numerical Analysis (G.A. Watson, ed.), Lecture Notes in Mathemat-ics 630, Springer-Verlag, pp.105-116, 1977

[3] Hanke M., Regularizing properties of a Truncated Newton-CG algorithm forNonlinear Inverse Problems, Numer. Funct. Anal. Optim., 18, 971-993, 1997

[4] Hansen P.C., Rank-Deficient and Discrete Ill-posed Problems, SIAMPhipadelphia, 1998

[5] More J.J., Chih-Jen L., Newton’s Method for large-scale bound constrainedoptimization problems., SIAM Journal on Opt. Vol.9, No. 4, pp. 1100–1127,1999.

[6] Nocedal J., Stephen J.W., Numerical Optimization, Springer, 1999

6

Page 12: Polygon 2016

[7] Tikhonov, A.N., Solution of incorrectly formulated problems and the regu-larization method, Soviet Math. Dokl., 4, pp.1035-1038, 1963; English trans-lation of Dokl. Akad. Nauk. SSSR, 151, pp. 501-504, 1963

7

Page 13: Polygon 2016

1

SURVEY OF STUDENTS’ FAMILIARITY WITH DEVELOPMENTAL MATHEMATICS – A STATISTICAL ANALYSIS

M. Shakil, Ph.D.

Professor of Mathematics

Department of Liberal Arts and Sciences

Miami Dade College, Hialeah Campus

FL 33012, USA E-mail: [email protected]

ABSTRACT

In recent years, there has been a great interest in developmental mathematics and college students’ familiarity with it at all level. In this paper, the students’ familiarity with developmental mathematics has been studied from a statistical point of view. By administering a survey developmental mathematics in some math classes, the data have been analysed statistically which shows some interesting results. It is hoped that the findings of the paper will be quite useful for researchers in various disciplines.

KEYWORDS ANOVA, Developmental Mathematics, Hypothesis Testing, Shannon’s Diversity Index.

1. INTRODUCTION

The importance of college students’ familiarity with developmental mathematics in the present day instruction of mathematics at various levels cannot be overlooked. It appears from the literature that not much work has been done on the problem of students’ familiarity with developmental mathematics. Motivated by the importance of college students’ familiarity with developmental mathematics, in this paper, the students’ familiarity with developmental mathematics has been statistically investigated and analyzed. The interested readers are referred to Shakil et al. (2010) and references therein, where the authors have investigated similar studies to analyze the students’ familiarity with grammar and mechanics of English language from an exploratory point of view. Please also Shanno (1951) and Siromoney (1964), among others. The organization of this paper is as follows. Section 2 discusses the methodology. The results are given in section 3. The discussion and conclusion are provided in Section 4.

2. METHODOLOGY

A survey consisting of 20 multiple choice questions on developmental mathematics (see Appendix I) were constructed to test students’ familiarity with developmental mathematics. It was administered to six different math courses in the spring semester of 2016. The courses were MAC 1147, MAC 2233 (two sections) and STA 2023 (three sections), which be referred as MAC 2233-A, MAC2233-B, STA2023-A, STA2023-B and STA2023-C. The survey was administered online in Blackboard by the instructor in each of these courses. A total of 126 students (out of 151 enrolled students) participated in the survey the details of which are provided in the following Table 1 below.

Table 1: Surveyed Courses

Discipline Courses Respondents

STA STA2023 (Three Sections) 67

MAC MAC1147, MAC2233 (Two Sections) 59

Total 6 126

Page 14: Polygon 2016

2

3. RESULTS 3.1 MASTERY REPORT The total number of questions in the survey was 20. Each question was assigned 1 point. The possible points in the survey were 20. The score unit was assumed to be percent. There was no passing or failure score in the survey. However, it was expected that the students at the above level of courses will achieve 100 % (that is, 20 out of 20 points) or at least 75 % (that is, 15 out of 20 points) in the developmental mathematics survey. Thus, the minimum % of score of passing the developmental mathematics survey was assumed to be 75 % for a satisfactory knowledge of developmental mathematics. There were two students in STA2023 course who scored 5 (25 %) and 8 (40 %) out of 20 points respectively, and so were discarded from the analysis. The mastery report of the 124 survey participants (excluding the above two students in STA2023 courses) is provided in the Table 2 and Figure 1 below.

Table 2: Mastery Report Proportion of Students

Total Number of Students Surveyed Reported: 124 Total Number of Survey Questions: 20

Points Assigned Per Question: 1 Minimum % of Passing Score: 75 %

% of Students Scoring 20 Points (100 %)

% of Students Scoring 15-19 Points (75-95 %)

40.30% 59.70%

Figure 1: Mastery Report 3.2 PERFOMANCE ANALYSIS For the performance analysis of students in the developmental mathematics survey, the participants were divided into two different categories as follows: Category (A):

i. MAC Group: MAC1147, MAC2233 (Two Sections); ii. STA Group: STA2023 (Three Sections).

Page 15: Polygon 2016

3

Category (B):

(i) MAC1147; (ii) MAC2233 (Two Sections); (iii) STA2023 (Three Sections).

Category (C):

(i) MAC1147; (ii) MAC2233 (Section-1); (iii) MAC2233 (Section-2); (iv) STA2023 (Section-1) ; (v) STA2023 (Section-2) ; (vi) STA2023 (Section-3).

The descriptive statistics of the performance of the Categories (A) and (B) in the survey are provided respectively in Tables 3 and 4 below. For the descriptive statistics of the performance of the Category (C), please see the Table 7 in Sub-Section 3.3 below.

Table 3: Descriptive Statistics of Category (A)

Group Respondent Mean Median St. Dev.

Coeff. Of

Var.

Min. Score

Max. Score

1Q 2Q 3Q

MAC Group

59 18.73 19 1.26 6.71% 15

20

18 19 20

STA Group

65 18.85 19 1.30 6.91% 15

20

18 19 20

Table 4: Descriptive Statistics of Category (B)

Group Respondent Mean Median St. Dev.

Coeff. Of

Var.

Min. Score

Max. Score

1Q 2Q 3Q

MAC1147 23 18.70 19 1.30 6.92% 16

20

18 19 20

MAC2233 (Two Sections)

36 18.75 19 1.25 6.67% 15

20

18 19 20

STA2023 (Three Sections)

65 18.85 19 1.30 6.91% 15

20

18 19 20

3.3 HYPOTHESIS TESTING: INFERENCES ABOUT MEAN SCORES This section discusses the hypothesis testing and draws inferences about the mean scores of different independent samples. The results of these tests of hypotheses are provided below. (I) INFERENCES ABOUT MEAN SCORES OF CATEGORY (A): MAC AND STA PARTICIPANTS Here we discuss the hypothesis testing and draw the inferences about the mean scores of two independent samples: MAC and STA groups, defined as follows: Category (A):

i. MAC Group: MAC1147, MAC2233 (Two Sections); ii. STA Group: STA2023 (Three Sections).

Page 16: Polygon 2016

4

For the descriptive statistic of MAC and STA Groups, please see the Table 3 above. Following the procedure on Pages 474 - 475 in Triola (2010) of not equal variances: no pool, the hypothesis testing was conducted for these two independent groups by using the statistical software package STATDISK. The results of the hypothesis test to draw the inferences about the mean scores of MAC and STA Groups are provided in Table 5 and Figure 3 below

Table 5: Hypothesis Testing about Mean Scores of MAC and STA Groups

Assumption: Not Equal Variances: No Pool Alpha = 0.05 Let µ1 = Mean Score of MAC Group and µ2 = Mean Score of STA Group. Claim: µ1 = µ2 (Null Hypothesis) Not eq. vars: No Pool Alternative Hypothesis: µ1 not equal µ2 Test Statistic, t: -0.5217 Critical t: ±1.979685 P-Value: 0.6028 Degrees of freedom: 121.4640 95% Confidence interval: -0.5753641 < µ1-µ2 < 0.3353641 Fail to Reject the Null Hypothesis There is not enough evidence to warrant the rejection of the claim that µ1 = µ2.

Figure 2: Hypothesis Testing about Mean Scores of MAC and STA Groups (II) ANALYSIS OF VARIANCE (ANOVA): INFERENCES ABOUT MEAN SCORES OF CATEGORY (B): MAC1147, MAC2233 (TWO SECTIONS), AND STA2023 (THREE SECTIONS) PARTICIPANTS For the descriptive statistic of Category B: MAC1147, MAC2233 (Two Sections), and STA2023 (Three Sections) Groups, please see the Table 4 above. Following the procedure on Pages 628 - 631 in Triola (2010), we discuss here the ANOVA for testing the hypothesis of the equality of the mean scores of three independent groups based on the courses, that is, MAC1147, MAC2233 (Two Sections), and STA2023 (Three Sections). The results of ANOVA are provided in Table 6 and Figure 3 below.

Page 17: Polygon 2016

5

Table 6: ANOVA: Hypothesis Testing About Equality of Mean Scores

ANOVA OF MAC1147, MAC2233 (Two Sections), and STA2023 (Three Sections). Alpha = 0.05 Claim: Equality of the mean scores of three independent groups based on the courses (Null Hypothesis) Alternative Hypothesis: Mean Scores are not equal Source: DF: SS: MS: Test Stat, F: Critical F: P-Value: Treatment: 2 0.467283 0.233642 0.141296 3.071137 0.868375 Error: 121 200.081104 1.653563 Total: 123 200.548387 Fail to Reject the Null Hypothesis There is enough evidence to support the claim of the equality of the mean scores of three independent groups based on the courses, that is, MAC1147, MAC2233 (Two Sections), and STA2023 (Three Sections).

Figure 3: ANOVA: Hypothesis Testing About Equality of Mean Scores (III) ANALYSIS OF VARIANCE (ANOVA): INFERENCES ABOUT MEAN SCORES OF CATEGORY (C): MAC1147, MAC2233 (SECTION-1), MAC2233 (SECTION-2), STA2023 (SECTION-1), STA2023 (SECTION-2), AND STA2023 (SECTION-3) PARTICIPANTS For the descriptive statistic of Category C: MAC1147, MAC2233 (Section-1), MAC2233 (Section-2), STA2023 (Section-1), STA2023 (Section-2), and STA2023 (Section-3) Groups, please see the Table 7 below.

Page 18: Polygon 2016

6

Table 7: Descriptive Statistics of Category (C)

Group Respondent Mean Median St. Dev.

Coeff. Of

Var.

Min. Score

Max. Score

1Q 2Q 3Q

MAC1147 23 18.70 19 1.30 6.92% 16

20

18 19 20

MAC2233 (Section-1)

18 18.44 18.5 1.20 6.50% 15

20

18 18.5 19

MAC2233 (Section-2)

18 19.06 20 1.26 6.61% 17

20

18 20 20

STA2023 (Section-1)

23 18.83 20 1.61 8.57% 15

20

17 20 20

STA2023 (Section-2)

21 18.76 19 1.14 6.05% 17

20

18 19 20

STA2023 (Section-3)

21 18.95 19 1.12 5.89% 17

20

18 19 20

Following the procedure on Pages 628 - 631 in Triola (2010), we discuss here the ANOVA for testing the hypothesis of the equality of the mean scores of six independent groups based on the courses, that is, Category C: MAC1147, MAC2233 (Section-1), MAC2233 (Section-2), STA2023 (Section-1), STA2023 (Section-2), and STA2023 (Section-3) Groups. The results of ANOVA are provided in Table 8 and Figure 4 below.

Table 8: ANOVA: Hypothesis Testing About Equality of Mean Scores

ANOVA OF Category C: MAC1147, MAC2233 (Section-1), MAC2233 (Section-2), STA2023 (Section-1), STA2023 (Section-2), and STA2023 (Section-3) Groups. Alpha = 0.05 Claim: Equality of the mean scores of Category C: MAC1147, MAC2233 (Section-1), MAC2233 (Section-2), STA2023 (Section-1), STA2023 (Section-2), and STA2023 (Section-3) Groups (Null Hypothesis) Alternative Hypothesis: Mean Scores are not equal Source: DF: SS: MS: Test Stat, F: Critical F: P-Value: Treatment: 4 1.704643 0.426161 0.25042 2.461696 0.9088 Error: 101 171.880262 1.701785 Total: 105 173.584906 Fail to Reject the Null Hypothesis There is enough evidence to support the claim of the equality of the mean scores of Category C: MAC1147, MAC2233 (Section-1), MAC2233 (Section-2), STA2023 (Section-1), STA2023 (Section-2), and STA2023 (Section-3) Groups.

Page 19: Polygon 2016

7

Figure 4: ANOVA: Hypothesis Testing About Equality of Mean Scores 3.4 DIVERSITY ANALYSIS This sub-section discusses the diversity analysis for testing the hypothesis of evenness ratio of respondents (two independent samples: MAC and STA groups) based on gender. All these analyses were carried out by using the statistical software packages STATDISK and EXCEL. (I) Respondent Performance Based on Gender (Two Independent Samples: MAC and STA Groups Based on Gender): The performance of respondents (two independent samples: MAC and STA groups) based on gender is provided in Table 9 and Figure 5 below.

Table 9: Respondent Performance Based on Gender (Two Independent Samples: MAC and STA Groups Based on Gender)

Group- Gender % of Students (out of 124)

Scoring 20 Points (100 %)

% of Students (out of 124)

Scoring 15-19 Points (75-95 %)

MAC Group- Male 5.65% 16.13%

STA Group- Male 7.26% 8.87%

MAC Group- Female 12.10% 13.71%

STA Group- Female 15.32% 20.97%

Page 20: Polygon 2016

8

Figure 5: Respondent Performance Based on Gender (II) Diversity Analysis: This Sub-section discusses the diversity analysis for testing the hypothesis of evenness ratio of respondent performance based on gender belonging to two independent samples: MAC and STA groups. For Diversity Analysis, we first compute the proportion (p) of male and female student population scoring 20 and 15-19 points out of 20 points respectively belonging to the MAC and STA groups, thus making eight different categories, which are provided in Table 10 below.

Table 10: Diversity Analysis Based on Gender

Group Gender Proportion (p) of Students Scoring 20 Out of 20 Points

Proportion (p) of Students Scoring 15-19 Out of 20 Points

MAC Group

Male 0.0565 0.1613

STA Group

Male 0.0726 0.0887

MAC Group

Female 0.1210 0.1371

STA Group

Female 0.1532 0.2097

Hypothesis: Does the respondent performance (that is, the proportion (p) of male and female student population scoring 20 and 15-19 points out of 20 points respectively belonging to the eight different categories as given in Table 10 above) suggest more diversity in the groups’ familiarity with developmental mathematics? The above hypothesis can be analyzed by applying Shannon’s Measure of Diversity Index (or, entropy) (Shannon, 1948), which is a measure of the diversity of a population, as given below.

Shannon’s Diversity Index: For a discrete random variable associated with n (countable) possible outcomes iE ’s, where

ii pEP , and npppP ,,, 21 , Shannon’s diversity index (or, entropy), PHn , or, simply, H , is defined by the

following formula:

in

i

i ppH ln1

(1)

It can be easily verified that Shannon’s diversity index (or, entropy), H , satisfies the following conditions:

Page 21: Polygon 2016

9

(i) H is maximum when

n

ppp n

121 .

(ii) H is minimum when 1ip , 0jp , ij , ni ,,2,1 , that is, H is minimum, when one of the probabilities is

unity and all others are zero. (iii) From (i) and (ii), it follows that, for the discrete case,

nH ln0 .

Further, the largest value of Shannon’s diversity index, maxH , is given by the following formula:

SH lnmax , (2)

where S denotes the number of categories in the population.

Evenness Ratio: The evenness ratio, HE , is given by the following formula:

maxH

HEH , (3)

where 10 HE . Note that if 1HE , there is complete evenness.

Now, using the values of the proportion (p) from the Table 10 in Equations (1), (2) and (3), the values of Shannon’s

diversity index, H , the largest value of Shannon’s diversity index, maxH , and the evenness ratio, HE , are computed as

follows:

2.004878H ,

2.079442max H ,

and

0.96414237HE .

Since 10.96414237 HE , there appears to be complete evenness in the respondent performance (that is, the

proportion (p) of male and female student population scoring 20 and 15-19 points out of 20 points respectively belonging to the eight different categories as given in Table 10 above).

4. CONCLUSIONS This paper discussed the students’ familiarity with developmental mathematics from a statistical point of view. A survey consisting of 20 multiple choice questions on developmental mathematics was constructed to test students’ familiarity with developmental mathematics in six different courses, that is, MAC 1147, MAC 2233 (two sections) and STA 2023 (three sections), during the spring semester of 2016, and was administered online in Blackboard by the instructor. A total of 126 students (out of 151 enrolled students) participated in the survey. There were two students in STA2023 course who scored 5 (25 %) and 8 (40 %) out of 20 points respectively, and so were discarded from the analysis. The mastery report of the 124 survey participants (excluding the above two students in STA2023 courses) is provided in the Table 2 and Figure 1 in Sub-section 3.1. The minimum % to pass was 60. Out of 124 survey participants, considered in this research project, 40.30 %

Page 22: Polygon 2016

10

students scored 20 out of 20 points, whereas 59.70% students scored 15-19 out of 20 points. Based on the hypothesis testing, the following inferences were drawn about the survey participants.

There was sufficient evidence to support the claim that the mean scores of MAC group and STA group participants were same.

There was sufficient evidence to support the claim of the equality of the mean scores of three independent group

participants based on the courses, that is, MAC1147, MAC2233 (Two Sections), and STA2023 (Three Sections). There is was sufficient to support the claim of the equality of the mean scores of Category C: MAC1147,

MAC2233 (Section-1), MAC2233 (Section-2), STA2023 (Section-1), STA2023 (Section-2), and STA2023 (Section-3) group participants.

There appeared to be complete evenness in the respondent performance (that is, the proportion (p) of male and

female student population scoring 20 and 15-19 points out of 20 points respectively belonging to the eight different categories.

It is hoped that the findings of the paper will be quite useful for researchers in various disciplines.

ACKNOWLEDGMENT

The author would like to express his sincere gratitude and acknowledge their indebtedness to his students of the courses, that is, MAC 1147, MAC 2233 (two sections) and STA 2023 (three sections), in the spring semester of 2016, for their cooperation in participating in the survey. Further, the author would like to thank the Editorial Committee of Polygon for accepting this paper for publication in Polygon. I would also like to acknowledge my sincere indebtedness to the works of various authors and resources on the subject which I have consulted during the preparation of this research project. The author is thankful to his wife for her patience and perseverance for the period during which this paper was prepared. The author would like to dedicate this paper to his late parents, brothers and sisters. Last but not least, the author is thankful to the Miami Dade College for giving an opportunity to serve this college, without which it was impossible to conduct his research.

REFERENCES

[1] Shakil, M., Calderin, V., and Pierre-Philippe, L. (2010). Survey of Students’ Familiarity with Grammar and Mechanics of English Language – An Exploratory Analysis, Polygon, Vol. 4, pp. 43 – 55. [2] Shannon, C.E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27, pp. 379 - 423; 623 - 656.

[3] Shannon, C.E. (1951). Prediction and Entropy of Printed English. Bell System Technical Journal, 30, pp. 50 - 64.

[4] Siromoney, G. (1964). An Information-theoretical Test for Familiarity with a Foreign Language. Journal of Psychological Researches, viii, pp. 267 – 272.

[5] Triola, M. F. (2010). Elementary Statistics. Addison-Wesley, N. Y.

Page 23: Polygon 2016

11

APPENDIX I

Spring 16

"Survey of Student's Familiarity with Developmental Mathematics"

Name: GENDER:

Current GPA: Major:

Course Name/Reference: Term/Year:

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Simplify.

1) Write "forty-one thousand five hundred forty-three" in standard form. 1)

A) 410,543 B) 415,043 C) 41,543 D) 401,543

Solve. Write the answer in simplest form.

2 2) Mary is saving 19 of her monthly income of $5358 for retirement. How much money is she setting 2)

aside each month for retirement? A) $50,901 B) $282 C) $141 D) $564

Solve.

3) Subtract 9 from 54. 3)

A) 46 B) 55 C) 44 D) 45

Round the decimal to the indicated place value.

4) 10.849, nearest tenth 4)

A) 10.9 B) 10.8 C) 10.85 D) 10.7

Perform the indicated operations. Round the result to the nearest thousandth, if necessary.

5) 85.42 + 79.65 + 15.475 5) A) 180.645 B) 181.545 C) 180.555 D) 180.545

Simplify the expression.

6) -12 - (-7) 6) A) -5 B) 19 C) 5 D) -19

7) 5(-11) 7) A) -55 B) -60 C) -550 D) -155

Perform the indicated operations. Round the result to the nearest thousandth, if necessary.

8) A country reports total exports of $4,771 million last year. Write this number using 8) standard notation.

A) $4,771,000 B) $4,771,000,000 C) $4,771 D) $4,771,000,000,000

Write the decimal as a percent.

9) 0.41 9) A) 410% B) 0.041% C) 4.1% D) 41%

Page 24: Polygon 2016

12

48°

66°

Solve.

10) In a survey of 100 people, 47 preferred ketchup on their hot dogs. What percent preferred ketchup? 10)

A) 47% B) 0.47% C) 100

47% D) 4.7%

The following plane figure is called a triangle. The sum of the three angles of a triangle is always 180°. Find the measure of the missing angle in the figure.

11) 11)

A) 58° B) 66° C) 76° D) 48°

Fill in the blank with one of the words or phrases listed

below.

equivalent, or >, least common denominator, or

mixed number, <, least common multiple, like

12) The symbol means is greater than. 12)

A) equivalent B) < C) > D) like

Find the GCF for the list.

13) 36, 15 13) A) 6 B) 1 C) 15 D) 3

Simplify the radical. Indicate if the radical is not a real number. Assume that x represents a positive real number.

14) 625 14)

A) -25

C) 25

B) 312

D) Not a real number

15) In the following figure, the sum of the angles x and 30° is 90°, that is, the angles x

and 30° are complementary of each other. Find the measure of x .

15)

30°

A) 55° B) 115° C) 150° D) 60° Insert <, >, or = to make the statement true.

16) - 6 - 3 16) A) = B) < C) >

Page 25: Polygon 2016

13

Evaluate the expression for the given replacement values.

17) x2 + y2 for x = 5 and y = - 2 17)

A) 29 B) 100 C) 14 D) 20

Simplify the expression.

18) 7x + 2 - 3x + 1 18) A) 4x + 1 B) 7x C) 4x + 3 D) 10x + 3

Solve the equation. Don't forget to first simplify each side of the equation, if possible.

19) 6x - 5x + 4 = 4 19) A) 4 B) -4 C) 0 D) 8

The bar graph shows the number of students who flunk Dr. Jones' class each year.

20) During which year(s) did Dr. Jones' have more than 10 students flunk his

class? 20)

A) 1998, 1999 B) 2002 C) 1998, 1999, 2000 D) 1998, 2002

Page 26: Polygon 2016

1

Item Analysis Statistics and Their Uses: An Overview

M. Shakil, Ph.D.

Professor of Mathematics

Department of Liberal Arts and Sciences

Miami Dade College, Hialeah Campus

FL 33012, USA

E-mail: [email protected]

Abstract

In this paper, we have presented an overview of some item analysis statistics which are available in the ParSCORETM analysis report. The uses of item analysis statistics to some multiple-choice math examinations have been investigated. It is hoped that the present study would be quite useful in recognizing the most critical pieces of the test items data, and evaluating whether or not that test item needs revision. The methods discussed in this project can be used to describe the relevance of test item analysis to classroom tests.

Keywords: Item Analysis Statistics, Multiple-Choice Examinations, ParSCORETM Analysis.

1. Introduction

An item analysis involves many statistics that can provide useful information for determining the validity and improving the quality and accuracy of multiple-choice or true/false items. These statistics are used to measure the ability levels of examinees from their responses to each item. The ParSCORETM item analysis report, when a Multiple-Choice Exam is machine scored, consists of three types of reports, that is, a summary of test statistics, a test frequency table, and item statistics. The test statistics summary and frequency table describe the distribution of test scores. The item analysis statistics evaluate class-wide performance on each test item. The ParSCORETM report on item analysis statistics gives an overall view of the test results and evaluates each test item, which are also useful in comparing the item analysis for different test forms. The organization of this paper is as follows. In Section 2, descriptions of some useful, common item analysis statistics, that is, item difficulty, item discrimination, distractor analysis, and reliability, are presented. For the sake of completeness, in Section 2, definitions of some test statistics as reported in the ParSCORETM analysis report are also provided. Section 3 contains the uses of item analysis statistics to some multiple-choice math examinations. The concluding remarks are presented in Section 4.

2. Item Analysis Statistics In what follows, we shall present some commonly used item analysis statistics available on ParSCORETM report when a Multiple-Choice Exam is machine scored. For details on these, the

Page 27: Polygon 2016

2

interested readers are referred to Wood (1960), Lord & Novick (1968), Henrysson (1971), Nunally (1978), Thompson and Levitov (1985), Crocker and Algina (1986), Ebel and Frisbie (1986), Suen (1990), Thorndike et al. (1991), DeVellis (1991), Millman and Greene (1993), Haladyna (1999), Tanner (2001), Haladyna et al. (2002), and Mertler (2003), among others. (I) Item Difficulty: Item difficulty is a measure of the difficulty of an item. For items (that is, multiple-choice questions) with one correct alternative worth a single point, the item difficulty (also known as the item difficulty index, or the difficulty level index, or the difficulty factor, or the item facility index, or the item easiness index, or the p -value) is defined as the proportion of

respondents (examinees) selecting the answer to the item correctly, and is given by

n

cp

where p the difficulty factor, c the number of respondents selecting the correct answer to

an item, and n total number of respondents. Item difficulty is relevant for determining whether

students have learned the concept being tested. It also plays an important role in the ability of an item to discriminate between students who know the tested material and those who do not. Note that

(i) 10 p .

(ii) A higher value of p indicate low difficulty level index, that is, the item is easy. A

lower value of p indicate high difficulty level index, that is, the item is difficult. In

general, an ideal test should have an overall item difficulty of around 0.5; however it is acceptable for individual items to have higher or lower facility (ranging from 0.2 to 0.8). In a criterion-referenced test (CRT), with emphasis on mastery-testing of the topics covered, the optimal value of p for many items is

expected to be 0.90 or above. On the other hand, in a norm-referenced test (NRT), with emphasis on discriminating between different levels of achievement,

it is given by 50.0p . For details on these, see, for example, Chase(1999),

among others.

(iii) To maximize item discrimination, ideal (or moderate or desirable) item difficulty

level, denoted as Mp , is defined as a point midway between the probability of

success, denoted as Sp , of answering the multiple - choice item correctly (that

is, 1.00 divided by the number of choices) and a perfect score (that is, 1.00) for the item, and is given by

2

1 S

SM

ppp

.

(iv) Thus, using the above formula in (iv), ideal (or moderate or desirable) item

difficulty levels for multiple-choice items can be easily calculated, which are provided in the following table, (for details, see, for example, Lord, 1952; among others).

Page 28: Polygon 2016

3

Number of Alternatives Probability of Success

( Sp )

Ideal Item Difficulty Level

( Mp )

2 0.50 0.75

3 0.33 0.67

4 0.25 0.63

5 0.20 0.60

(Ia) Mean Item Difficulty (or Mean Item Easiness): Mean item difficulty is the average of difficulty easiness of all test items. It is an overall measure of the test difficulty and ideally

ranges between 60 % and 80 % (that is, 80.060.0 p ) for classroom achievement tests.

Lower numbers indicate a difficult test while higher numbers indicate an easy test.

(II) Item Discrimination: The item discrimination (or the item discrimination index) is a basic measure of the validity of an item. It is defined as the discriminating power or the degree of an item's ability to discriminate (or differentiate) between high achievers (that is, those who scored high on the total test) and low achievers (that is, those who scored low), which are determined on the same criterion, that is, (1) internal criterion, for example, test itself; and (2) external criterion, for example, intelligence test or other achievement test. Further, the computation of the item discrimination index assumes that the distribution of test scores is normal and that there is a normal distribution underlying the right or wrong dichotomy of a student’s performance on an item. For details on the item discrimination index, see, for example, Kelly (1939), Wood (1960), Henrysson (1971), Nunally (1972), Ebel (1979), Popham (1981), Ebel & Frisbie (1986), Weirsma & Jurs (1990), Glass & Hopkins (1995), Brown (1996), Chase (1999), Haladyna (1999), Nitko (2001), Tanner (2001), Oosterhof (2001), Haladyna et al. (2002), and Mertler (2003), among others. There are several ways to compute the item discrimination, but, as shown on the ParSCORETM item analysis report and also as reported in the literature, the following formulas are most commonly used indicators of item’s discrimination effectiveness.

(a) Item Discrimination Index (or Item Discriminating Power, or D -Statistics), D : Let the students’ test scores be rank-ordered from lowest to highest. Let

groupupperinstudentsofNumberTotal

correctlyitemtheansweringgroupupperinstudentsofNopU

%30%25

%30%25.

,

and

grouplowerinstudentsofNumberTotal

correctlyitemtheansweringgrouplowerinstudentsofNopL

%30%25

%30%25.

The ParSCORETM item analysis report considers the upper %27 and the lower %27 as the

analysis groups. The item discrimination index, D , is given by

Page 29: Polygon 2016

4

LU ppD .

Note that

(i) 11 D .

(ii) Items with positive values of D are known as positively discriminating items, and

those with negative values of D are known as negatively discriminating items.

(iii) If 0D , that is, LU pp , there is no discrimination between the upper and

lower groups.

(iv) If 00.1D , that is, 000.1 LU pandp , there is a perfect discrimination

between the two groups.

(v) If 00.1D , that is, 00.10 LU pandp , it means that all members of the

lower group answered the item correctly and all members of the upper group answered the item incorrectly. This indicates the invalidity of the item, that is, the item has been miskeyed and needs to be rewritten or eliminated.

(vi) A guideline for the value of an item discrimination index is provided in the following table, see, for example, Chase(1999), and Mertler(2003), among others.

Item Discrimination Index, D Quality of an Item

50.0D Very Good Item; Definitely Retain

49.040.0 D Good Item; Very Usable

39.030.0 D Fair Quality; Usable Item

29.020.0 D Potentially Poor Item; Consider Revising

20.0D Potentially Very Poor; Possibly Revise Substantially, or Discard

(b) Mean Item Discrimination Index, D : This is the average discrimination index for all test items combined. A large positive value (above 0.30) indicates good discrimination between the upper and lower scoring students. Tests that do not discriminate well are generally not very reliable and should be reviewed. (c) Point-Biserial Correlation (or Item-Total Correlation or Item Discrimination)

Coefficient, pbisr : The point-biserial correlation coefficient is another item discrimination index

of assessing the usefulness (or validity) of an item as a measure of individual differences in knowledge, skill, ability, attitude, or personality characteristic. It is defined as the correlation between the student performance on an item (correct or incorrect) and overall test score, and is given by either of the following two equations (which are mathematically equivalent).

(i) Suen (1990); DeVellis (1991); Haladyna (1999)

Page 30: Polygon 2016

5

q

p

s

XXr

TC

pbis

,

where pbisr the point-biserial correlation coefficient;

CX the mean total score for

examinees who have answered the item correctly;

TX the mean total score for all

examines; p the difficulty value of the item; pq 1 ; and s the standard deviation of

total exam scores.

(ii) Brown (1996)

qps

mmr

qp

pbis

,

where pbisr the point-biserial correlation coefficient; pm the mean total score for

examinees who have answered the item correctly; qm the mean total score for examinees

who have answered the item incorrectly; p the difficulty value of the item; pq 1 ; and

s the standard deviation of total exam scores.

Note that

(i) The interpretation of the point-biserial correlation coefficient, pbisr , is same as

that of the D -statistic.

(ii) It assumes that the distribution of test scores is normal and that there is a normal

distribution underlying the right or wrong dichotomy of a student performance on

an item.

(iii) It is mathematically equivalent to the Pearson (product moment) correlation

coefficient, which can be shown by assigning two distinct numerical values to the

dichotomous variable (test item), that is, incorrect = 0 and correct = 1.

(iv) 11 pbisr .

(v) 0pbisr means little correlation between the score on the item and the score on

the test.

(vi) A high positive value of pbisr indicates that the examinees who answered the item

correctly also received higher scores on the test than those examinees who

answered the item incorrectly. (vii) A negative value indicates that the examinees who answered the item correctly

received low scores on the test and those examinees who answered the item

incorrectly did better on the test. It is advisable that an item with 0pbisr or with

large negative value of pbisr should be eliminated or revised. Also, an item with

low positive value of pbisr should be revised for improvement.

Page 31: Polygon 2016

6

(viii) Generally, the value of pbisr for an item may be put into two categories as

provided in the following table.

Point-Biserial Correlation Coefficient, pbisr Quality

30.0pbisr Acceptable Range

1pbisr Ideal Value

(ix) The statistical significance of the point-biserial correlation coefficient, pbisr , may

be determined by applying the Student’s t test; see, for example, Triola (2007),

among others.

Remark: It should be noted that the use of point-biserial correlation coefficient, pbisr , is more

advantageous than that of item discrimination index statistics, D , because every student taking

the test is taken into consideration in the computation of pbisr , whereas only 54 % of test-takers

passing each item in both groups (that is, the upper 27 % + the lower 27 %) are used to

compute D .

(d) Mean Item-Total Correlation Coefficient, pbisr : It is defined as the average correlation of

all the test items with the total score. It is a measure of overall test discrimination. A large positive value indicates good discrimination between students.

(III) Internal Consistency Reliability Coefficient (Kuder-Richardson 20, 20KR , Reliability

Estimate): The statistic that measures the test reliability of inter-item consistency, that is, how well the test items are correlated with one another, is called the internal consistency reliability coefficient of the test. For a test, having multiple-choice items that are scored correct or incorrect, and that is administered only once, the Kuder-Richardson formula 20 (also known as KR-20) is used to measure the internal consistency reliability of the test scores; see, for example, Nunally (1972), and Haladyna (1999), among others. The KR-20 is also reported in the ParSCORETM item analysis. It is given by the following formula:

12

1

2

20

ns

qpsn

KR

n

i

ii

where 20KR = the reliability index for the total test; n = the number of items in the test; 2s = the

variance of test scores; ip = the difficulty value of the item; and ii pq 1 .

Note that

(i) 0.10.0 20 KR .

(ii) 020 KR indicates a weaker relationship between test items, that is, the overall

test score is less reliable. A large value of 20KR indicates high reliability.

(iii) Generally, the value of 20KR for an item may be put into the following categories

as provided in the table below.

Page 32: Polygon 2016

7

20KR Quality

60.020 KR Acceptable Range

75.020 KR Desirable

85.080.0 20 KR Better t

120 KR Ideal Value

(iv) Remarks: The reliability of a test can be improved as follows:

a) By increasing the number of items in the test for which the following

Spearman-Brown prophecy formula is used (Mertler, 2003).

rn

rnrest

11

where estr = the estimated new reliability coefficient; r = the

original 20KR reliability coefficient; n = the number of times the

test is lengthened.

b) Or, using the items that have high discrimination values in the test.

c) Or, performing an item-total statistic analysis as described above.

(IV) Standard Error of Measurement ( mSE ): It is another important component of test item

analysis to measure the internal consistency reliability of a test see, for example, Nunally (1972), and Mertler (2003), among others. It is given by the following formula:

201 KRsSEm , 0.10.0 20 KR ,

where mSE = the standard error of measurement; s = the standard deviation of test scores;

and 20KR = the reliability coefficient for the total test.

Note that

(i) 0mSE , when 120 KR .

(ii) 1mSE , when 020 KR .

(iii) A small value of mSE (e.g., 3 ) indicates high reliability; whereas a large value

of mSE indicates low reliability.

Page 33: Polygon 2016

8

(iv) Remark: Higher reliability coefficient (i.e., 120 KR ) and smaller standard

deviation for a test indicate smaller standard error of measurement. This is

considered to be more desirable situation for classroom tests. (v) Test Item Distractor Analysis: It is an important and useful component of test item analysis. A test item distractor is defined as the incorrect response options in a multiple-choice test item. According to the research, there is a relationship between the quality of the distractors in a test item and the student performance on the test item, which also affect the student performance on his/her total test score. The performance of these incorrect item response options can be determined through the test item distractor analysis frequency table which contains the frequency, or number of students, that selected each incorrect option. The test item distractor analysis is also provided in the ParSCORETM item analysis report. For details on test item distractor analysis, see, for example, Thompson & Levitov (1985), DeVellis (1991), Milman & Greene (1993), Haladyna (1999), and Mertler (2003), among others. A general guideline for the item distractor analysis is provided in the following table:

Item Response Options

Item Difficulty p

Item Discrimination Index

D or pbisr

Correct Response

85.035.0 p (Better) 30.0D or 30.0pbisr

(Better)

Distractors 02.0p (Better) 0D or 0pbisr (Better)

(v) Mean: The mean is a measure of central tendency and gives the average test score of a sample of respondents (examinees), and is given by

n

x

x

n

i

i

1

,

where scoretestindividualxi , scoretestindividualxi , srespondentofnon . .

(vi) Median: If all scores are ranked from lowest to highest, the median is the middle score. Half of the scores will be lower than the median. The median is also known as the 50th percentile or the 2nd quartile. (vii) Range of Scores: It is defined as the difference of the highest and lowest test scores. The range is a basic measure of variability.

(viii) Standard Deviation: For a sample of n examinees, the standard deviation, denoted by s ,

of test scores is given by the following equation

Page 34: Polygon 2016

9

1

1

2

n

xx

s

n

i

i

,

where scoretestindividualxi and scoretestaveragex

. The standard deviation is a

measure of variability or the spread of the score distribution. It measures how far the scores deviate from the mean. If the scores are grouped closely together, the test will have a small standard deviation. A test with a large value of the standard deviation is considered better in discriminating the student performance levels.

(ix) Variance: For a sample of n examinees, the variance, denoted by 2s , of test scores is

defined as the square of the standard deviation, and is given by the following equation

1

1

2

2

n

xx

s

n

i

i

.

(x) Skewness: For a sample of n examinees, the skewness, denoted by 3 , of the distribution

of the test scores is given by the following equation

n

i

i

s

xx

nn

n

1

3

321

,

where scoretestindividualxi , scoretestaveragex

and

scorestestofdeviationdardss tan . It measures the lack of symmetry of the distribution.

The skewness is 0 for symmetric distribution and is negative or positive depending on whether the distribution is negatively skewed (has a longer left tail) or positively skewed (has a longer right tail).

(xi) Kurtosis: For a sample of n examinees, the kurtosis, denoted by 4 , of the distribution of

the test scores is given by the following equation

32

13

321

12

1

4

4

nn

n

s

xx

nnn

nn n

i

i ,

where scoretestindividualxi , scoretestaveragex

, and

scorestestofdeviationdardss tan . It measures the tail-heaviness (the amount of probability

in the tails). For the normal distribution, 34 . Thus, depending on whether 334 or , a

distribution is heavier tailed or lighter tailed than the normal distribution.

Page 35: Polygon 2016

10

3. Use of Item Analysis Statistics

This section provides some of the uses of item analysis statistics to some multiple-choice math examinations (which we call as MAT0000-Version A and MAT0000-Version B). It consists of three parts, which are described below. 3.1. Test Item Analysis of MAT0000-Version A and MAT0000-Version B Exams An item analysis of the data obtained from my MAT0000-Version A and MAT0000-Version B Exam Items is presented here based upon the classical test theory (CRT). Various test item statistics and relevant statistical graphs (for both test forms, Versions A and B) using the ParSCORETM item analysis report and the Minitab software are computed and summarized in the Tables 1 – 5 below. Each version consisted of 30 items. There were two different groups of 7 students for each version. It appears from these statistical analyses that a large value of

190.020 KR for Version B indicates its high reliability in comparison to Version A, which is

also substantiated by large positive values of 3.0450.0 DIMean and

4223.0.. BisrPtMean , small value of standard error of measurement (that is, 82.1SEM ),

and an ideal value of mean (that is, 1857.19 , the passing score) for Version B. These

analyses are also evident by the bar charts and scatter plots drawn for various test item

statistics using Minitab, that is, item difficulty ( p ), item discrimination index ( D ) and point-

biserial correlation coefficient ( pbisr ), which are presented below in Figures 1 and 2.

Table 1

A Comparison of MAT0000-Version A and MAT0000-Version B Exam Test Items

Exam. Version 20

Re

KR

liability

Mean SD SEM 3.0p 7.03.0 p 7.0p

2.0D

A 0.53 17.14 2.80 1.92 8 10 12 14

B 0.90 19.57 5.75 1.82 1 15 14 20

Exam. Version DIMean .. BisrPtMean

A 0.233 0.2060

B 0.450 0.4223

Page 36: Polygon 2016

11

Table 2 (MAT0000-Version A - Data Display)

Disc. Ind. Difficulty Pt-Bis Row PU PL (D) Difficulty (p) (p) % (r)

1 1.0 0.0 1.0 0.4286 42.86 0.78 2 1.0 1.0 0.0 0.8571 85.71 0.02 3 1.0 0.5 0.5 0.8571 85.71 0.46 4 1.0 0.0 1.0 0.5714 57.14 0.66 5 1.0 0.0 1.0 0.5714 57.14 0.77 6 1.0 0.0 1.0 0.7143 71.43 0.82 7 0.5 0.0 0.5 0.5714 57.14 0.56 8 1.0 1.0 0.0 1.0000 100.00 0.00 9 0.0 0.5 -0.5 0.1429 14.29 -0.46 10 0.5 0.5 0.0 0.4286 42.86 0.27 11 0.5 0.5 0.0 0.4286 42.86 -0.15 12 1.0 1.0 0.0 1.0000 100.00 0.00 13 1.0 1.0 0.0 1.0000 100.00 0.00 14 0.0 0.0 0.0 0.0000 0.00 0.00 15 1.0 0.5 0.5 0.5714 57.14 0.25 16 1.0 0.5 0.5 0.7143 71.43 0.37 17 1.0 0.5 0.5 0.8571 85.71 0.60 18 1.0 1.0 0.0 1.0000 100.00 0.00 19 1.0 1.0 0.0 1.0000 100.00 0.00 20 1.0 0.5 0.5 0.8571 85.71 0.46 21 1.0 0.5 0.5 0.8571 85.71 0.46 22 0.5 0.5 0.0 0.5714 57.14 -0.16 23 0.0 0.5 -0.5 0.1429 14.29 -0.46 24 0.5 1.0 -0.5 0.5714 57.14 -0.27 25 0.0 0.0 0.0 0.2857 28.57 0.08 26 0.0 0.0 0.0 0.1429 14.29 -0.02 27 1.0 0.5 0.5 0.4286 42.86 0.37 28 0.5 0.0 0.5 0.1429 14.29 0.71 29 0.5 0.0 0.5 0.2857 28.57 0.53 30 0.0 0.5 -0.5 0.1429 14.29 -0.46

Table 3

Descriptive Statistics: MAT0000-Version A

Variable Mean SE Mean StDev Variance Minimum Q1 Disc. Ind. (D) 0.2333 0.0821 0.4498 0.2023 -0.5000 0.000000000 Difficulty (p) 0.5714 0.0573 0.3139 0.0985 0.000000000 0.2857 Difficulty (p) % 57.14 5.73 31.39 985.11 0.000000000 28.57 Pt-Bis (r) 0.2063 0.0703 0.3850 0.1482 -0.4600 -0.00500

Variable Median Q3 Maximum Disc. Ind. (D) 0.000000000 0.5000 1.0000 Difficulty (p) 0.5714 0.8571 1.0000 Difficulty (p) % 57.14 85.71 100.00 Pt-Bis (r) 0.1650 0.5375 0.8200

Page 37: Polygon 2016

12

Difficulty (p) %

Co

un

t

100.0085.7171.4357.1442.8628.5714.290.00

6

5

4

3

2

1

0

Chart of Difficulty (p) %

Disc. Ind. (D)

Co

un

t

1.00.50.0-0.5

12

10

8

6

4

2

0

Chart of Disc. Ind. (D)

Pt-Bis (r)

Co

un

t

0.82

0.78

0.77

0.71

0.66

0.60

0.56

0.53

0.46

0.37

0.27

0.25

0.08

0.02

0.00

-0.02

-0.15

-0.16

-0.27

-0.46

6

5

4

3

2

1

0

Chart of Pt-Bis (r)

Difficulty (p) %

Dis

c.

Ind

. (D

)

100806040200

1.00

0.75

0.50

0.25

0.00

-0.25

-0.50

Scatterplot of Disc. Ind. (D) vs Difficulty (p) %

Difficulty (p) %

Dis

c.

Ind

. (D

)

28.570.0014.29100.0071.4357.1485.7142.86

3

2

1

0

-1

Chart of Disc. Ind. (D) vs Difficulty (p) %

Difficulty (p) %

Pt-

Bis

(r)

100806040200

1.00

0.75

0.50

0.25

0.00

-0.25

-0.50

Scatterplot of Pt-Bis (r) vs Difficulty (p) %

Difficulty (p) %

Pt-

Bis

(r)

28.570.0014.29100.0071.4357.1485.7142.86

2.0

1.5

1.0

0.5

0.0

-0.5

-1.0

Chart of Pt-Bis (r) vs Difficulty (p) %

Figure 1

(Bar Charts and Scatter Plots for p , D , and pbisr , Version A)

Page 38: Polygon 2016

13

Table 4 (MAT0000-Version B - Data Display)

Disc. Ind. Difficulty Pt-Bis Row PU PL (D) Difficulty (p) (p) % (r) 1 1.0 1.0 0.0 1.0000 100.00 0.00 2 1.0 1.0 0.0 0.7143 71.43 0.06 3 1.0 1.0 0.0 1.0000 100.00 0.00 4 1.0 1.0 0.0 0.8571 85.71 0.11 5 1.0 0.5 0.5 0.8571 85.71 0.54 6 1.0 0.5 0.5 0.7143 71.43 0.67 7 1.0 0.0 1.0 0.4286 42.86 0.92 8 1.0 0.5 0.5 0.4286 42.86 0.37 9 0.5 0.5 0.0 0.4286 42.86 0.42 10 1.0 0.0 1.0 0.4286 42.86 0.92 11 1.0 0.5 0.5 0.5714 57.14 0.69 12 1.0 1.0 0.0 1.0000 100.00 0.00 13 1.0 0.5 0.5 0.8571 85.71 0.32 14 0.5 0.0 0.5 0.4286 42.86 0.37 15 1.0 0.5 0.5 0.5714 57.14 0.54 16 0.5 0.0 0.5 0.5714 57.14 0.34 17 1.0 0.0 1.0 0.5714 57.14 0.69 18 1.0 1.0 0.0 1.0000 100.00 0.00 19 1.0 1.0 0.0 1.0000 100.00 0.00 20 1.0 0.5 0.5 0.8571 85.71 0.54 21 0.5 1.0 -0.5 0.8571 85.71 -0.39 22 1.0 0.5 0.5 0.7143 71.43 0.67 23 0.5 0.0 0.5 0.1429 14.29 0.67 24 1.0 0.0 1.0 0.4286 42.86 0.92 25 1.0 0.0 1.0 0.5714 57.14 0.44 26 1.0 0.0 1.0 0.4286 42.86 0.67 27 1.0 0.5 0.5 0.7143 71.43 0.06 28 0.5 0.0 0.5 0.1429 14.29 0.67 29 1.0 0.5 0.5 0.8571 85.71 0.54 30 1.0 0.0 1.0 0.4286 42.86 0.92

Table 5

Descriptive Statistics: MAT0000-Version B

Variable Mean SE Mean StDev Variance Minimum Q1 Disc. Ind. (D) 0.4500 0.0733 0.4015 0.1612 -0.5000 0.000000000 Difficulty (p) 0.6524 0.0458 0.2508 0.0629 0.1429 0.4286 Difficulty (p) % 65.24 4.58 25.08 628.81 14.29 42.86 Pt-Bis (r) 0.4223 0.0628 0.3440 0.1183 -0.3900 0.0600

Variable Median Q3 Maximum Disc. Ind. (D) 0.5000 0.6250 1.0000 Difficulty (p) 0.6429 0.8571 1.0000 Difficulty (p) % 64.29 85.71 100.00 Pt-Bis (r) 0.4900 0.6700 0.9200

Page 39: Polygon 2016

14

Difficulty (p) %

Co

un

t

100.0085.7171.4357.1442.8614.29

9

8

7

6

5

4

3

2

1

0

Chart of Difficulty (p) %

Disc. Ind. (D)

Co

un

t

1.00.50.0-0.5

14

12

10

8

6

4

2

0

Chart of Disc. Ind. (D)

Pt-Bis (r)

Co

un

t

0.920.690.670.540.440.420.370.340.320.110.060.00-0.39

5

4

3

2

1

0

Chart of Pt-Bis (r)

Difficulty (p) %

Dis

c.

Ind

. (D

)

100908070605040302010

1.00

0.75

0.50

0.25

0.00

-0.25

-0.50

Scatterplot of Disc. Ind. (D) vs Difficulty (p) %

Difficulty (p) %

Dis

c.

Ind

. (D

)

14.2957.1442.8685.7171.43100.00

6

5

4

3

2

1

0

Chart of Disc. Ind. (D) vs Difficulty (p) %

Difficulty (p) %

Pt-

Bis

(r)

100908070605040302010

1.00

0.75

0.50

0.25

0.00

-0.25

-0.50

Scatterplot of Pt-Bis (r) vs Difficulty (p) %

Difficulty (p) %

Pt-

Bis

(r)

14.2957.1442.8685.7171.43100.00

6

5

4

3

2

1

0

Chart of Pt-Bis (r) vs Difficulty (p) %

Figure 2

(Bar Charts and Scatter Plots for p , D , and pbisr , Version B)

3.2. A Comparison of MAT0000-Version A and MAT0000-Version B Exams Performance A Two-Sample T-Test: To identify if there is a significant difference between the MAT0000-Version A and MAT0000-Version B Exams Performance of the students, a two-sample T-test was conducted using the Minitab and Statdisk software. For this, first the assumption of normality was checked using the Anderson-Darling Test for both groups, and the normality tests were met. The results are provided in the Tables 6 -7. Moreover, at the significance level of

05.0 , the two-sample T-test conducted fails to reject the claim that BA , that is, the

sample does not provide enough evidence to reject the claim.

Page 40: Polygon 2016

15

Table 6

Descriptive Statistics: MAT0000-Version A and MAT0000-Version B Exams

Total Variable Count N Mean SE Mean StDev Variance Minimum Q1 Median MAT0000A 7 7 17.14 1.14 3.02 9.14 13.00 14.00 17.00 MAT0000B 7 7 19.57 2.35 6.21 38.62 12.00 15.00 18.00

Variable Q3 Maximum Skewness Kurtosis MAT0000A 19.00 22.00 0.16 -0.03 MAT0000B 25.00 29.00 0.40 -1.31

Table 7

Two-Sample T-Test and CI: MAT0000-Version A and MAT0000-Version B

(Assume Unequal Variances) Two-sample T for MAT0000-Version A vs MAT0000-Version B N Mean StDev SE Mean MAT0000A 7 17.14 3.02 1.1 MAT0000B 7 19.57 6.21 2.3 Difference = mu (MAT0000A) - mu (MAT0000B) Estimate for difference: -2.42857 95% CI for difference: (-8.45211, 3.59497) T-Test of difference = 0 (vs not =): T-Value = -0.93 P-Value = 0.380 DF = 8

Two-Sample T-Test and CI: MAT0000-Version A and MAT0000-Version B (Assume Equal Variances)

Two-sample T for MAT0000-Version A vs MAT0000-Version B N Mean StDev SE Mean MAT0000A 7 17.14 3.02 1.1 MAT0000B 7 19.57 6.21 2.3 Difference = mu (MAT0000A) - mu (MAT0000B) Estimate for difference: -2.42857 95% CI for difference: (-8.11987, 3.26273) T-Test of difference = 0 (vs not =): T-Value = -0.93 P-Value = 0.371 DF = 12 Both use Pooled StDev = 4.8868

Page 41: Polygon 2016

16

3.3. A Comparison of MAT0000 Classroom Test Aver (Pre) Vs Final Exam (Post) Performance A Paired Samples T-Test: To identify if there is a significant gain in the MAT0000 posttest (state exit exam) compared to the pretest (classroom test Average) performance of the students, a paired samples T-test was conducted using the Minitab and Statdisk software. For this, first to check whether normality assumption for a paired samples t-test is met, the hypothesis tests for the gain scores were conducted using Minitab, which suggested that the normality tests were met, the distributions being close to normal. The results are provided in the Tables 8 -10 and Figure 5 below below. It is evident that the normality tests are easily met.

Moreover, at the significance level of 05.0 , the paired samples T-test conducted fails to

reject the claim that BA , that is, the sample does not provide enough evidence to reject

the claim. STATDISK OUTPUT: MAT0000

Paired T-Test and CI: MAT0000-Post, MAT0000-Pre, (Gain Score = Post – Pre)

Figure 5 (Paired Samples T-Test: MAT0000 Pre Vs Post (Exams)

Table 8

Page 42: Polygon 2016

17

MINITAB OUTPUT

Data Display: MAT0000 MAT0000-Post, MAT0000-Pre (Gain Score = Post – Pre)

Row 20071-Pre 20071-Post Gain 1 69.4 56.7 -12.7 2 63.2 50.0 -13.2 3 54.8 60.0 5.2 4 78.0 83.3 5.3 5 75.6 76.7 1.1 6 66.8 63.3 -3.5 7 51.8 46.7 -5.1 8 44.6 40.0 -4.6 9 72.6 56.7 -15.9 10 68.4 60.0 -8.4 11 67.2 50.0 -17.2 12 76.6 96.7 20.1 13 82.6 73.3 -9.3 14 49.0 43.3 -5.7

Table 9

MAT0000

Descriptive Statistics: MAT0000-Post, MAT0000-Pre (Gain Score = Post – Pre)

Total Variable Count N Mean SE Mean StDev Variance Minimum Q1 Median 20071-Post 14 14 61.19 4.33 16.21 262.62 40.00 49.18 58.35 20071-Pre 14 14 65.76 3.12 11.66 136.01 44.60 54.05 67.80 Gain 14 14 -4.56 2.67 10.01 100.14 -17.20 -12.83 -5.40 Variable Q3 Maximum Range IQR Skewness Kurtosis 20071-Post 74.15 96.70 56.70 24.98 0.84 0.22 20071-Pre 75.85 82.60 38.00 21.80 -0.51 -0.80 Gain 2.13 20.10 37.30 14.95 1.10 1.56

Table 10

MAT0000

Paired T-Test and CI: MAT0000-Post, MAT0000-Pre (Gain Score = Post – Pre) Paired T for 20071-Post - 20071-Pre N Mean StDev SE Mean 20071-Post 14 61.1929 16.2056 4.3311 20071-Pre 14 65.7571 11.6622 3.1169 Difference 14 -4.56429 10.00704 2.67450 95% CI for mean difference: (-10.34218, 1.21361) T-Test of mean difference = 0 (vs not = 0): T-Value = -1.71 P-Value = 0.112

Page 43: Polygon 2016

18

4. Concluding Remarks and Recommendation for Future Research This paper discusses some item analysis statistics which are available in the ParSCORETM analysis report. The uses of item analysis statistics to some multiple-choice math examinations have been investigated. It is hoped that the present study would be helpful in recognizing the most critical pieces of the state exit test items data, and evaluating whether or not that test item needs revision. The methods discussed in this project can be used to describe the relevance of test item analysis to classroom tests. These procedures can also be used or modified to measure, describe and improve tests or surveys such as college mathematics placement exams (that is, CPT), mathematics study skills, attitude survey, test anxiety, information literacy, other general education learning outcomes, etc. Further, research based on Bloom’s cognitive taxonomy of test items, the applicability of Beta-Binomial models and Bayesian analysis of test items and item response theory (IRT) using the 1-parameter logistic model (also known as Rasch model), 2- & 3- parameter logistic models, plots of the item characteristic curves (ICCs) of different test items, and other characteristics of measurement instruments of IRT may also be investigated.

Acknowledgments

The author would like to thank the Editorial Committee of Polygon for accepting this paper for publication in Polygon. I would also like to acknowledge my sincere indebtedness to the works of various authors and resources on the subject which I have consulted during the preparation of this research project. The author is thankful to his wife for her patience and perseverance for the period during which this paper was prepared. The author would like to dedicate this paper to his late parents, brothers and sisters. Last but not least, the author is thankful to the Miami Dade College for giving an opportunity to serve this college, without which it was impossible to conduct his research.

References

Brown, J. D. (1996). Testing in language programs. Prentice Hall, Upper Saddle River, NJ. Chase, C. I. (1999). Contemporary assessment for educators. Longman, New York. Crocker, L. and Algina, J. (1986). Introduction to classical and modern test theory. Holt, Rinehart and Winston, New York. DeVellis, R. F. (1991). Scale development: Theory and applications. Sage Publications, Newbury Park. Ebel, R.L. (1979). Essentials of educational measurement (3rd ed). Prentice Hall, Englewood Cliffs, NJ. Ebel, R. L. and Frisbie, D. A. (1986). Essentials of educational measurement. Prentice- Hall, Inc, Englewood Cliffs, NJ. Glass, G. V. and Hopkins, K. D. (1995). Statistical Methods in Education and

Page 44: Polygon 2016

19

Psychology, 3rd edition, Allyn & Bacon, Boston. Haladyna. T. M. (1999). Developing and validating multiple-choice exam items, 2nd ed. Lawrence Erlbaum Associates, Mahwah, NJ. Haladyna, T. M., Downing, S.M. and Rodriguez, M.C. (2002). A review of multiple- choice item-writing guidelines for classroom assessment. Applied Measurement in Education, 15(3), 309-334. Henrysson, S. (1971). Gathering, analyzing, and using data on test items. In R.L. Thorndike (Ed.), Educational Measurement (p. 141). American Council on Education, Washington DC. Kelley, T. L. (1939). The selection of upper and lower groups for the validation of test items. J. Ed. Psych., 30, 17-24. Lord, F. M. and Novick, M. R. (1968). Statistical Theories of Mental Test Scores. Addison-Wesley, Reading, MA. Mertler, C. A. (2003). Classroom Assessment – A Practical Guide for Educators. Pyrczak Publishing, Los Angeles, CA. Millman, J. and Greene, J. (1993). The specification and development of tests of achievement and ability. In R.L. Linn (Ed.), Educational measurement (pp. 335-366). Oryx Press, Phoenix, AZ. Nitko, A. J. (2001). Educational assessment of students (3rd edition). Prentice Hall, Upper Saddle River, NJ Nunnally, J. C. (1972). Educational measurement and evaluation (2nd ed). McGraw-Hill, New York. Nunnally, J. C. (1978). Psychometrics Theory, Second Edition. : McGraw Hill, New York. Oosterhof, A. (2001). Classroom applications for educational measurement. Merrill Prentice Hall, Upper Saddle River, NJ. Popham, W. J. (1981). Modern educational measurement. Prentice-Hall, Englewood Cliff, NJ. Suen, H. K. (1990). Principles of exam theories. Lawrence Erlbaum Associates, Hillsdale, NJ. Tanner, D. E. (2001). Assessing academic achievement. Allyn & Bacon, Boston. Thompson, B. and Levitov, J. E. (1985). Using microcomputers to score and evaluate test items. Collegiate Microcomputer, 3, 163-168.

Thorndike, R. M., Cunningham, G. K., Thorndike, R. L. and Hagen, E.P. (1991). Measurement and evaluation in psychology and education (5th ed). MacMillan, New York.

Page 45: Polygon 2016

20

Triola, M. F. (2006). Elementary Statistics. Pearson Addison-Wesley, New York. Wiersma, W. and Jurs, S. G. (1990). Educational measurement and testing (2nd ed). Allyn and Bacon, Boston, MA. Wood, D. A. (1960). Test construction: Development and interpretation of achievement tests. Charles E. Merrill Books, Inc, Columbus, OH.

Page 46: Polygon 2016

1

Testing the Goodness of Fit of Continuous Probability Distributions

to Some Flood Data

M. Shakil, Ph.D.

Professor of Mathematics

Department of Liberal Arts and Sciences

Miami Dade College, Hialeah Campus

FL 33012, USA

E-mail: [email protected]

Abstract

In this paper, we have tested the goodness of fit for Cauchy, generalized extreme value, Laplace,

log-Pearson 3, logistic, and normal probability distributions to ordered differences in flood

heights for two stations on the Fox River in Wisconsin for 33 years, as reported in in Best et al.

(2008). It was found that the generalized extreme value distribution was the best fit amongst the

six continuous probability distributions for the ordered differences in flood heights data based on

both Kolmogorov-Smirnov and Anderson-Darling goodness of fit tests. On the other hand, it was

found that log-Pearson 3 distribution fitted reasonably well to the ordered differences in flood

heights data based on Chi-Squared tests goodness of fit test. Since fitting of a probability

distribution to flood data may be helpful in predicting the probability or forecasting the

frequency of occurrence of the flood during monsoon and hurricanes, and planning beforehand,

it is hoped that this study will be quite useful in many problems of business and economic

planning, hydrological processes and designs, and other applied research.

2010 Mathematics Subject Classifications: 62C12, 62F03, 62N02, 62N03, 62-07.

Keywords: Flood data, Goodness of fit test, Hurricane, Monsoon, Probability distribution.

1. Introduction

According to the Wikipedia, “a flood is an overflow of water that submerges land which is

usually dry. The European Union (EU) Floods Directive defines a flood as a covering by water

of land not normally covered by water. Flooding may occur as an overflow of water from water

bodies, such as a river, lake, or ocean, in which the water overtops or breaks levees, resulting in

some of that water escaping its usual boundaries, or it may occur due to an accumulation of

rainwater on saturated ground in an areal flood. Floods can also occur in rivers when the flow

rate exceeds the capacity of the river channel, particularly at bends or meanders in the waterway.

Floods often cause damage to homes and businesses if they are in the natural flood plains of

rivers. Some floods develop slowly, while others such as flash floods, can develop in just a few

minutes and without visible signs of rain,” (https://en.wikipedia.org/wiki/Flood). The rainfall or

Page 47: Polygon 2016

2

other types of precipitation produced by hurricanes cause also widespread flooding in the

affected areas due to which people have to face a lot of damage and destruction of their property,

including loss of life, resulting into great socio-economic problems. The statistical analysis of

flood data is therefore very crucial, and plays an important role in many studies of hydrological

processes and designs. Many researchers have investigated the statistical analysis of flood data,

see, for example, Pericchi and Rodríguez-Iturbe (1985), Opere et al. (2006), Yiou et al. (2006),

Van Bladeren et al. (2007), Ghorbani et al. (2011), Win and Win (2014), and Ahn et al. (2014),

and references therein. Since fitting of a probability distribution to flood data may be helpful in

predicting the probability or forecasting the frequency of occurrence of the flood during

monsoon and hurricanes, and planning beforehand, it is hoped that this study will be useful in

many problems of business and economic planning, hydrological processes and designs, and

other applied research. Therefore, the statistical analysis of flood data is very necessary and

important. Motivated by the importance of the study of flood data in many problems of

hydrological processes and designs, we have investigated in this paper the goodness of fit test

(GOF) for Cauchy, generalized extreme value, Laplace, log-Pearson 3, logistic, and normal

probability distributions to ordered differences in flood heights for two stations on the Fox River

in Wisconsin for 33 years, as reported in in Best et al. (2008) to determine their applicability and

best fit to these data based on the goodness of fit (GOF) tests, namely, Kolmogorov-Smirnov,

Anderson-Darling, and Chi-Squared tests for the goodness-of-fit. The other researchers have also

investigated their statistical analysis of these data for the ordered differences in flood heights for

two stations on the Fox River in Wisconsin for 33 years, see, for example, Bain and Engelhardt

(1973), Puig and Stephens (2000), Meintanis (2004), Krishnamoorthy (2006), and Gulati, S.

(2011). For the applications of the log Pearson type-3 distribution in hydrology, see, for example,

Phien and Ajirajah (1984). Also, for a discussion of the GOF tests, the interested readers are

referred to Massey (1951), Stephens (1974), Conover (1999), Blischke and Murthy (2000), Hogg

and Tanis (2006), and Ahsanullah et al. (2014), among others.

The organization of this paper is as follows, Section 2 contains the Methodology, along with the

description of the ordered differences in flood heights for two stations on the Fox River in

Wisconsin for 33 years, as reported in in Best et al. (2008). Also, in Section 2, we have provided

the continuous probability distributions considered in this paper, namely, Cauchy,

generalized extreme value, Laplace, log-Pearson 3, logistic, and normal probability distributions.

In Section 3, we have presented the results and discussions of our findings respectively. Some

concluding remarks are given in Section 4.

2. Methodology

In this section, we will test the goodness of fit test (GOF) for Cauchy, generalized extreme value,

Laplace, log-Pearson 3, logistic, and normal probability distributions to ordered differences in

flood heights for two stations on the Fox River in Wisconsin for 33 years, as reported in in Best

et al. (2008) to determine their applicability and best fit to these data based on the goodness of fit

(GOF) tests, namely, Kolmogorov-Smirnov, Anderson-Darling, and Chi-Squared tests for

goodness-of-fit. For the sake of completeness, the ordered differences in flood heights for two

stations on the Fox River in Wisconsin for 33 years are provided in Table 1 below. In Table 2,

we have provided the probability density functions and parameters of Cauchy,

Page 48: Polygon 2016

3

generalized extreme value, Laplace, log-Pearson 3, logistic, and normal probability distributions

considered in this paper.

Table 1

(Source: Best et al., 2008)

Ordered differences in flood heights for two stations on the Fox River

in Wisconsin for 33 years

1.96, 1.96, 3.60, 3.80, 4.79, 5.66, 5.76, 5.78, 6.27, 6.30, 6.76, 7.65, 7.84, 7.99, 8.51,

9.18, 10.13, 10.24, 10.25, 10.43, 11.45, 11.48, 11.75, 11.81, 12.34, 12.78, 13.06, 13.29,

13.98, 14.18, 14.40, 16.22 and 17.06.

Table 2

(Continuous Probability Distributions Used in Flood Data Analysis)

Sl.

No.

Name of the

Distributions

xf Parameters

1

Cauchy

2

1

11

xxf

0 : scale parameter

(real): location parameter,

and x

2 Generalized

Extreme Value

0expexp1

011exp1

11

1

kxx

kxkxk

xf

kk

0k : shape parameter

0 : scale parameter

(real): location parameter,

where

0

001

kforx

kforxk

3

Laplace

xxf exp2

0 : inverse scale

parameter

(real): location parameter,

and x

4

Log-Pearson

III (LP3)

x

x

xxf

lnexp

ln11

0 , 0 , ,

where 00 whenex

and 0 whenxe

Page 49: Polygon 2016

4

5 Logistic

2exp1

exp

x

xxf

0 : scale parameter

(real): location parameter,

and x

6 Normal

2

2

1exp

2

1

xxf

0 : scale parameter

(real): location parameter,

and x

Fitting of the above-said distributions to flood data are carried as follows. As a first step, using

Easyfit software, we have computed the descriptive statistics of the flood data as given in Table

3. Also, using the statdisk software, we have tested the normality of the flood data by Ryan-

Joiner Test (Similar to Shapiro-Wilk Test), along with drawing a histogram of the data, which

are given in Figure 1 and Table 4.

We have tested the fitting of Cauchy, generalized extreme value, Laplace, log-Pearson 3,

logistic, and normal probability distributions to ordered differences in flood heights for two

stations on the Fox River in Wisconsin for 33 years (Table 1). For this, we have used the Easyfit

software for estimating the parameters of these distributions, and the goodness of fit (GOF) tests,

namely, Kolmogorov-Smirnov, Anderson-Darling, and Chi-Squared tests for goodness-of-fit,

which are provided in the Tables 5 and 6 below. For the parameters estimated in Table 5,

Cauchy, generalized extreme value, Laplace, log-Pearson 3, logistic, and normal probability

distributions respectively have been superimposed on the histogram of the ordered differences in

flood heights, which is provided in Figure 2 below. For these distributions, we have provided the

cumulative distribution function, survival function, hazard function, cumulative hazard function,

P-P plot, Q-Q plot and probability difference in Figures 3 – 9 respectively as given below.

Table 3

(Descriptive Statistics)

Statistic Value

Sample Size 33

Range 15.1

Mean 9.3533

Variance 16.169

Std. Deviation 4.0211

Coef. of Variation 0.42991

Std. Error 0.69999

Skewness -0.07331

Excess Kurtosis -0.79828

Percentile Value

Min 1.96

5% 1.96

10% 3.68

25% (Q1) 6.025

50% (Median) 10.13

75% (Q3) 12.56

90% 14.312

95% 16.472

Max 17.06

Page 50: Polygon 2016

5

Figure 1: Normality Assessment of Flood Data

Table 4

(Ryan-Joiner Test of Normality Assessment)

Ryan-Joiner Test

Test statistic, Rp: 0.9925

Critical value for 0.05 significance level: 0.9666

Critical value for 0.01 significance level: 0.9528

Fail to reject normality with a 0.05 significance level.

Fail to reject normality with a 0.01 significance level.

Possible Outliers

Number of data values below Q1 by more than 1.5 IQR: 0

Number of data values above Q3 by more than 1.5 IQR: 0

Table 5

Fitting Results

# Distribution Parameters

1 Cauchy =2.8118 =9.6936

2 Gen. Extreme Value k=-0.32444 =4.2124 =7.9779

3 Laplace =0.3517 =9.3533

4 Log-Pearson 3 =2.9931 =-0.31728 =3.065

5 Logistic =2.217 =9.3533

6 Normal =4.0211 =9.3533

Page 51: Polygon 2016

6

Table 6

Goodness of Fit – Summary

# Distribution

Kolmogorov

Smirnov

Anderson

Darling Chi-Squared

Statistic Rank Statistic Rank Statistic Rank

1 Cauchy 0.11607 5 0.85112 5 1.3161 3

2 Gen. Extreme Value 0.07953 1 0.18391 1 1.3503 4

3 Laplace 0.15476 6 1.0484 6 5.5995 6

4 Log-Pearson 3 0.09304 3 0.22833 2 1.0865 1

5 Logistic 0.1142 4 0.4503 4 2.0981 5

6 Normal 0.0929 2 0.2467 3 1.1639 2

Probability Density Function

Histogram Cauchy Laplace Logistic Normal Gen. Extreme Value Log-Pearson 3

x

171615141312111098765432

f(x)

0.26

0.24

0.22

0.2

0.18

0.16

0.14

0.12

0.1

0.08

0.06

0.04

0.02

0

Figure 2: Fitting of Probability Density Functions to the Flood Data

Page 52: Polygon 2016

7

Cumulative Distribution Function

Sample Cauchy Laplace Logistic Normal Gen. Extreme Value Log-Pearson 3

x

171615141312111098765432

F(x

)

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

Figure 3: Fitting of Cumulative Distribution Functions to the Flood Data

Survival Function

Sample Cauchy Laplace Logistic Normal Gen. Extreme Value Log-Pearson 3

x

171615141312111098765432

S(x

)

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

Figure 4: Survival Functions of Distributions for the Flood Data

Page 53: Polygon 2016

8

Hazard Function

Cauchy Laplace Logistic Normal Gen. Extreme Value Log-Pearson 3

x

171615141312111098765432

h(x

)

0.8

0.72

0.64

0.56

0.48

0.4

0.32

0.24

0.16

0.08

0

Figure 5: Hazard Functions of Distributions for the Flood Data

Cumulative Hazard Function

Cauchy Laplace Logistic Normal Gen. Extreme Value Log-Pearson 3

x

171615141312111098765432

H(x

)

3.6

3.2

2.8

2.4

2

1.6

1.2

0.8

0.4

0

Page 54: Polygon 2016

9

Figure 6: Cumulative Hazard Functions of Distributions for the Flood Data

P-P Plot

Cauchy Laplace Logistic Normal Gen. Extreme Value Log-Pearson 3

P (Empirical)

10.90.80.70.60.50.40.30.20.1

P (

Model)

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

Figure 7: P-P Plot of Distributions for the Flood Data

Q-Q Plot

Cauchy Laplace Logistic Normal Gen. Extreme Value Log-Pearson 3

x

171615141312111098765432

Quantile

(M

odel)

17

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

Figure 8: Q-Q Plot of Distributions for the Flood Data

Page 55: Polygon 2016

10

Probability Difference

Cauchy Laplace Logistic Normal Gen. Extreme Value Log-Pearson 3

x

171615141312111098765432

Pro

babili

ty D

iffe

rence

0.32

0.24

0.16

0.08

0

-0.08

-0.16

-0.24

-0.32

Figure 9: Probability Differences of Distributions for the Flood Data

3. Results and Discussions

The descriptive statistics of the ordered differences in flood heights for two stations on the Fox

River in Wisconsin for 33 years, as reported in in Best et al. (2008), (please see Table 1), are

provided in Table 3 above. Also, we have tested the normality of the flood data by Ryan-Joiner

Test (Similar to Shapiro-Wilk Test), along with drawing a histogram of the data, which are given

in Figure 1 and Table 4. The following are the observations based on Ryan-Joiner Test of

Normality Assessment of the flood data, which is also confirmed from the skewness of the flood

data as computed in Table 3:

(a) Fail to reject normality with a 0.05 significance level,

(b) Fail to reject normality with a 0.01 significance level.

Further, we have tested the fitting of Cauchy, generalized extreme value, Laplace, log-Pearson 3,

logistic, and normal probability distributions to ordered differences in flood heights for two

stations on the Fox River in Wisconsin for 33 years. The estimates of parameters of Cauchy,

generalized extreme value, Laplace, log-Pearson 3, logistic, and normal probability distributions

for the flood data are given in Table 5. For the parameters estimated in Table 5, the probability

density functions of Cauchy, generalized extreme value, Laplace, log-Pearson 3, logistic, and

normal probability distributions respectively have been superimposed on the histogram of the

precipitation data, which is provided in Figure 2. The goodness of fit (GOF) of Cauchy,

generalized extreme value, Laplace, log-Pearson 3, logistic, and normal probability distributions

Page 56: Polygon 2016

11

to the flood data by Kolmogorov-Smirnov, Anderson-Darling, and the Chi-Squared GOF tests is

summarized in the Table 6 above. Further, for these distributions, we have provided cumulative

distribution function, survival function, hazard function, cumulative hazard function, P-P plot, Q-

Q plot and probability difference in Figures 3 – 9 respectively as given above. From the

Kolmogorov-Smirnov and Anderson-Darling GOF tests as provided in Table 6 and Figure 2

above, we observed that the generalized extreme value distribution is the best fit amongst the six

continuous probability distributions to the ordered differences in flood heights for two stations

on the Fox River in Wisconsin for 33 years. On the other hand, Log-Pearson 3 distribution was

found to be the best fit for these data by Chi-Squared tests goodness of fit test (Table 6). The

graphs of cumulative distribution function, survival function, hazard function, cumulative hazard

function, P-P plot, Q-Q plot and probability difference as provided in Figures 3 – 9 respectively

also confirm these results.

4. Concluding Remarks

In many problems of hydrological processes and designs, fitting of a probability distribution to

the flood data may be helpful in predicting the probability or forecasting the frequency of

occurrence of the flood, and planning beforehand. Motivated by the importance of the study of

flood data in many problems of hydrological processes and designs, and planning beforehand, in

this paper, we have tested the goodness of fit of Cauchy, generalized extreme value, Laplace,

log-Pearson 3, logistic, and normal probability distributions to the ordered differences in flood

heights for two stations on the Fox River in Wisconsin for 33 years, as reported in in Best et al.

(2008). It was found that the generalized extreme value distribution was the best fit for these

flood data by both Kolmogorov-Smirnov and Anderson-Darling goodness of fit tests, whereas

Log-Pearson 3 distribution was found to be the best fit for these flood data by Chi-Squared tests

goodness of fit test. It is hoped that this study will be quite helpful in many problems of

hydrological research.

Acknowledgment

The author would like to thank the Editorial Committee of Polygon for accepting this paper for

publication in Polygon. Also, the author would like to thank Professor M. Ahsanullah, Rider

University, New Jersey, USA, and Professor B. M. Golam Kibria, FIU, Miami, USA, for their

valuable and helpful suggestions, which improved the quality and presentation of the paper.

Also, the author is thankful to his wife for her patience and perseverance for the period during

which this paper was prepared. The author would like to dedicate this paper to his late parents,

brothers and sisters. Last but not least, the author is thankful to the Miami Dade College for

giving an opportunity to serve this college, without which it was impossible to conduct his

research.

Page 57: Polygon 2016

12

References

Ahsanullah, M., Kibria, B. M. G., and Shakil, M. (2014). Normal and Student´s t Distributions

and Their Applications. Atlantis Press, Paris, France.

Ahn, J., Cho, W., Kim, T., Shin, H., & Heo, J. H. (2014). Flood frequency analysis for the annual

peak flows simulated by an event-based rainfall-runoff model in an urban drainage basin. Water,

6(12), 3841 - 3863.

Bain, L., and Engelhardt, M. (1973). Interval estimation for the two parameter double

exponential distribution. Technometrics, 15, 875 – 887.

Best, D., Rayner, J., and Thas, O. (2008). Comparison of some tests of fit for the Laplace

distribution. Computational Statistics and Data Analysis, 52, 5338 – 5343.

Blischke, W. R., and Murthy, D. N. P. (2000). Reliability, Modeling, Prediction, and

Optimization. John Wiley & Sons, New York.

Conover, W. J. (1999). Practical Nonparametric Statistics, John Wiley & Sons, New York.

Ghorbani, M. A., Ruskeepaa, H., Singh, V. P., and Sivakumar, B. (2011). Flood frequency

analysis using Mathematica. Turkish Journal of Engineering and Environmental Sciences, 34(3),

171 - 188.

Gulati, S. (2011). Goodness of fit test for the Rayleigh and the Laplace distributions.

International Journal of Applied Mathematics and Statistics™, 24(SI-11A), 74 - 85.

Hogg, R. V., and Tanis, E. A. (2006). Probability and Statistical Inference. Pearson/Prentice

Hall, NJ.

Krishnamoorthy, K. (2006). Handbook of Statistical Distributions with Applications. Chapman

and Hall, CRC, Boca Raton.

Massey, F. J. (1951). The Kolmogorov-Smirnov test for goodness of fit. Journal of the American

Statistical Association, 6, 68 - 78.

Meintanis, S. G. (2004). A class of omnibus tests for the Laplace distribution based on the

empirical characteristic function. Communications in Statistics, Theory and Methods, 33(4), 925

– 948.

Opere, A. O., Mkhandi, S., and Willems, P. (2006). At site flood frequency analysis for the Nile

Equatorial basins. Physics and Chemistry of the Earth, Parts A/B/C, 31(15), 919 - 927.

Pericchi, L. R., and Rodríguez-Iturbe, I. (1985). On the statistical analysis of floods. In A

celebration of statistics (pp. 511-541). Springer, New York.

Page 58: Polygon 2016

13

Phien, H. N., and Ajirajah, T. J. (1984). Applications of the log Pearson type-3 distribution in

hydrology. Journal of hydrology, 73, 3, 359 - 372.

Puig, P., and Stephens, M. A. (2000). Tests of fit for the Laplace distribution, with applications.

Technometrics, 42(4), 417 – 424.

Stephens, M. A. (1974). EDF statistics for goodness-of-fit, and some comparisons. Journal of the

American Statistical Association, 69, 730 – 737.

Van Bladeren, D., Zawada, P. K., and MAHLANGu, D. (2007). Statistical Based Regional

Flood Frequency Estimation Study for South Africa Using Systematic, Historical and

Palaeoflood Data: Pilot Study, Catchment Management Area 15. Water Research Commission.

Win, N. L., and Win, K. M. (2014). Comparative Study of Flood Frequency Analysis on

Selected Rivers in Myanmar. In InCIEC 2013 (pp. 287-299). Springer Singapore.

Wikipedia. https://en.wikipedia.org/wiki/Flood

Yiou, P., Ribereau, P., Naveau, P., Nogaj, M., and Brázdil, R. (2006). Statistical analysis of

floods in Bohemia (Czech Republic) since 1825. Hydrological Sciences Journal, 51(5), 930 -

945.