applied statistics using spss, statistica, matlab and r978-3-540-71972-4/1 · with 195 figures and...
TRANSCRIPT
Applied Statistics Using SPSS, STATISTICA,MATLAB and R
With 195 Figures and a CD
123
Joaquim P. Marques de Sá
Applied Statistics Using SPSS, STATISTICA, MATLAB and R
Printed on acid-free paper 5 4 3 2 1 0SPIN: 11908944 42/
E d itors
3100/Integra
TypesettinProduction: Integra Software Services Pvt. Ltd., IndiaCover design: WMX design, Heidelberg
g: by the editors
Library of Congress Control Number: 2007926024
This work is subject to copyright. All rights are reserved, whether the whole or part of the material isconcerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting,reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publicationor parts thereof is permitted only under the provisions of the German Copyright Law of September 9,1965, in its current version, and permission for use must always be obtained from Springer. Violationsare liable for prosecution under the German Copyright Law.
Springer is a part of Springer Science+Business Mediaspringer.com© Springer-Verlag Berlin Heidelberg 2007
The use of general descriptive names, registered names, trademarks, etc. in this publication does notimply, even in the absence of a specific statement, that such names are exempt from the relevant pro-tective laws and regulations and therefore free for general use.
ISBN 978-3-540-71971-7 Springer Berlin Heidelberg New York
Prof. Dr. Joaquim P. Marques de Sá Universidade do Porto Fac. Engenharia
4200-465 Porto Portugal e-mail: [email protected]
Rua Dr. Roberto Frias s/n
To Wiesje and Carlos.
Contents
Preface to the Second Edition xv
Preface to the First Edition xvii
Symbols and Abbreviations xix
1 Introduction 1
1.1 Deterministic Data and Random Data.........................................................1 1.2 Population, Sample and Statistics ...............................................................5 1.3 Random Variables.......................................................................................8 1.4 Probabilities and Distributions..................................................................10
1.4.1 Discrete Variables .......................................................................10 1.4.2 Continuous Variables ..................................................................12
1.5 Beyond a Reasonable Doubt... ..................................................................13 1.6 Statistical Significance and Other Significances.......................................17 1.7 Datasets .....................................................................................................19 1.8 Software Tools ..........................................................................................19
1.8.1 SPSS and STATISTICA..............................................................20 1.8.2 MATLAB and R..........................................................................22
2 Presenting and Summarising the Data 29
2.1 Preliminaries .............................................................................................29 2.1.1 Reading in the Data .....................................................................29 2.1.2 Operating with the Data...............................................................34
2.2 Presenting the Data ...................................................................................39 2.2.1 Counts and Bar Graphs................................................................40 2.2.2 Frequencies and Histograms........................................................47 2.2.3 Multivariate Tables, Scatter Plots and 3D Plots ..........................52 2.2.4 Categorised Plots .........................................................................56
2.3 Summarising the Data...............................................................................58 2.3.1 Measures of Location ..................................................................58 2.3.2 Measures of Spread .....................................................................62 2.3.3 Measures of Shape.......................................................................64
2.3.4 Measures of Association for Continuous Variables.....................66 2.3.5 Measures of Association for Ordinal Variables...........................69 2.3.6 Measures of Association for Nominal Variables .........................73
Exercises.................................................................................................................77
3 Estimating Data Parameters 81
3.1 Point Estimation and Interval Estimation..................................................81 3.2 Estimating a Mean ....................................................................................85 3.3 Estimating a Proportion ............................................................................92 3.4 Estimating a Variance ...............................................................................95 3.5 Estimating a Variance Ratio......................................................................97 3.6 Bootstrap Estimation.................................................................................99 Exercises...............................................................................................................107
4 Parametric Tests of Hypotheses 111
4.1 Hypothesis Test Procedure......................................................................111 4.2 Test Errors and Test Power .....................................................................115 4.3 Inference on One Population...................................................................121
4.3.1 Testing a Mean ..........................................................................121 4.3.2 Testing a Variance.....................................................................125
4.4 Inference on Two Populations ................................................................126 4.4.1 Testing a Correlation .................................................................126 4.4.2 Comparing Two Variances........................................................129 4.4.3 Comparing Two Means .............................................................132
4.5 Inference on More than Two Populations..............................................141 4.5.1 Introduction to the Analysis of Variance...................................141 4.5.2 One-Way ANOVA ....................................................................143 4.5.3 Two-Way ANOVA ...................................................................156
Exercises...............................................................................................................166
5 Non-Parametric Tests of Hypotheses 171
5.1 Inference on One Population...................................................................172 5.1.1 The Runs Test............................................................................172 5.1.2 The Binomial Test .....................................................................174 5.1.3 The Chi-Square Goodness of Fit Test .......................................179 5.1.4 The Kolmogorov-Smirnov Goodness of Fit Test ......................183 5.1.5 The Lilliefors Test for Normality ..............................................187 5.1.6 The Shapiro-Wilk Test for Normality .......................................187
5.2 Contingency Tables.................................................................................189 5.2.1 The 2×2 Contingency Table ......................................................189 5.2.2 The rxc Contingency Table .......................................................193
viii Contents
Contents ix
5.2.3 The Chi-Square Test of Independence ......................................195 5.2.4 Measures of Association Revisited............................................197
5.3 Inference on Two Populations ................................................................200 5.3.1 Tests for Two Independent Samples..........................................201 5.3.2 Tests for Two Paired Samples ...................................................205
5.4 Inference on More Than Two Populations..............................................212 5.4.1 The Kruskal-Wallis Test for Independent Samples ...................212 5.4.2 The Friedmann Test for Paired Samples ...................................215 5.4.3 The Cochran Q test ....................................................................217
Exercises...............................................................................................................218
6 Statistical Classification 223
6.1 Decision Regions and Functions.............................................................223 6.2 Linear Discriminants...............................................................................225
6.2.1 Minimum Euclidian Distance Discriminant ..............................225 6.2.2 Minimum Mahalanobis Distance Discriminant.........................228
6.3 Bayesian Classification ...........................................................................234 6.3.1 Bayes Rule for Minimum Risk..................................................234 6.3.2 Normal Bayesian Classification ................................................240 6.3.3 Dimensionality Ratio and Error Estimation...............................243
6.4 The ROC Curve ......................................................................................246 6.5 Feature Selection.....................................................................................253 6.6 Classifier Evaluation ...............................................................................256 6.7 Tree Classifiers .......................................................................................259 Exercises...............................................................................................................268
7 Data Regression 271
7.1 Simple Linear Regression .......................................................................272 7.1.1 Simple Linear Regression Model ..............................................272 7.1.2 Estimating the Regression Function ..........................................273 7.1.3 Inferences in Regression Analysis.............................................279 7.1.4 ANOVA Tests ...........................................................................285
7.2 Multiple Regression ................................................................................289 7.2.1 General Linear Regression Model .............................................289 7.2.2 General Linear Regression in Matrix Terms .............................289 7.2.3 Multiple Correlation ..................................................................292 7.2.4 Inferences on Regression Parameters ........................................294 7.2.5 ANOVA and Extra Sums of Squares.........................................296 7.2.6 Polynomial Regression and Other Models ................................300
7.3 Building and Evaluating the Regression Model......................................303 7.3.1 Building the Model....................................................................303 7.3.2 Evaluating the Model ................................................................306 7.3.3 Case Study.................................................................................308
7.4 Regression Through the Origin...............................................................314
x Contents
7.5 Ridge Regression ....................................................................................316 7.6 Logit and Probit Models .........................................................................322 Exercises...............................................................................................................327
8 Data Structure Analysis 329
8.1 Principal Components .............................................................................329 8.2 Dimensional Reduction...........................................................................337 8.3 Principal Components of Correlation Matrices.......................................339 8.4 Factor Analysis .......................................................................................347 Exercises...............................................................................................................350
9 Survival Analysis 353
9.1 Survivor Function and Hazard Function .................................................353 9.2 Non-Parametric Analysis of Survival Data .............................................354
9.2.1 The Life Table Analysis ............................................................354 9.2.2 The Kaplan-Meier Analysis.......................................................359 9.2.3 Statistics for Non-Parametric Analysis......................................362
9.3 Comparing Two Groups of Survival Data ..............................................364 9.4 Models for Survival Data ........................................................................367
9.4.1 The Exponential Model .............................................................367 9.4.2 The Weibull Model....................................................................369 9.4.3 The Cox Regression Model .......................................................371
Exercises...............................................................................................................373
10 Directional Data 375
10.1 Representing Directional Data ................................................................375 10.2 Descriptive Statistics...............................................................................380 10.3 The von Mises Distributions ...................................................................383 10.4 Assessing the Distribution of Directional Data.......................................387
10.4.1 Graphical Assessment of Uniformity ........................................387 10.4.2 The Rayleigh Test of Uniformity ..............................................389 10.4.3 The Watson Goodness of Fit Test .............................................392 10.4.4 Assessing the von Misesness of Spherical Distributions...........393
10.5 Tests on von Mises Distributions............................................................395 10.5.1 One-Sample Mean Test .............................................................395 10.5.2 Mean Test for Two Independent Samples .................................396
10.6 Non-Parametric Tests..............................................................................397 10.6.1 The Uniform Scores Test for Circular Data...............................397 10.6.2 The Watson Test for Spherical Data..........................................398 10.6.3 Testing Two Paired Samples .....................................................399
Exercises...............................................................................................................400
Contents xi
Appendix A - Short Survey on Probability Theory 403
A.1 Basic Notions ..........................................................................................403 A.1.1 Events and Frequencies .............................................................403 A.1.2 Probability Axioms....................................................................404
A.2 Conditional Probability and Independence .............................................406 A.2.1 Conditional Probability and Intersection Rule...........................406 A.2.2 Independent Events ...................................................................406
A.3 Compound Experiments..........................................................................408 A.4 Bayes’ Theorem ......................................................................................409 A.5 Random Variables and Distributions ......................................................410
A.5.1 Definition of Random Variable .................................................410 A.5.2 Distribution and Density Functions ...........................................411 A.5.3 Transformation of a Random Variable ......................................413
A.6 Expectation, Variance and Moments ......................................................414 A.6.1 Definitions and Properties .........................................................414 A.6.2 Moment-Generating Function ...................................................417 A.6.3 Chebyshev Theorem..................................................................418
A.7 The Binomial and Normal Distributions.................................................418 A.7.1 The Binomial Distribution.........................................................418 A.7.2 The Laws of Large Numbers .....................................................419 A.7.3 The Normal Distribution ...........................................................420
A.8 Multivariate Distributions .......................................................................422 A.8.1 Definitions .................................................................................422 A.8.2 Moments....................................................................................425 A.8.3 Conditional Densities and Independence...................................425 A.8.4 Sums of Random Variables .......................................................427 A.8.5 Central Limit Theorem ..............................................................428
Appendix B - Distributions 431
B.1 Discrete Distributions .............................................................................431 B.1.1 Bernoulli Distribution................................................................431 B.1.2 Uniform Distribution .................................................................432 B.1.3 Geometric Distribution..............................................................433 B.1.4 Hypergeometric Distribution.....................................................434 B.1.5 Binomial Distribution................................................................435 B.1.6 Multinomial Distribution...........................................................436 B.1.7 Poisson Distribution ..................................................................438
B.2 Continuous Distributions ........................................................................439 B.2.1 Uniform Distribution .................................................................439 B.2.2 Normal Distribution...................................................................441 B.2.3 Exponential Distribution............................................................442 B.2.4 Weibull Distribution..................................................................444 B.2.5 Gamma Distribution ..................................................................445 B.2.6 Beta Distribution .......................................................................446 B.2.7 Chi-Square Distribution.............................................................448
xii Contents
B.2.8 Student’s t Distribution..............................................................449 B.2.9 F Distribution ...........................................................................451 B.2.10 Von Mises Distributions............................................................452
Appendix C - Point Estimation 455
C.1 Definitions...............................................................................................455 C.2 Estimation of Mean and Variance...........................................................457
Appendix D - Tables 459
D.1 Binomial Distribution .............................................................................459 D.2 Normal Distribution ................................................................................465 D.3 Student´s t Distribution ...........................................................................466 D.4 Chi-Square Distribution ..........................................................................467 D.5 Critical Values for the F Distribution .....................................................468
Appendix E - Datasets 469
E.1 Breast Tissue...........................................................................................469 E.2 Car Sale...................................................................................................469 E.3 Cells ........................................................................................................470 E.4 Clays .......................................................................................................470 E.5 Cork Stoppers..........................................................................................471 E.6 CTG ........................................................................................................472 E.7 Culture ....................................................................................................473 E.8 Fatigue ....................................................................................................473 E.9 FHR.........................................................................................................474 E.10 FHR-Apgar .............................................................................................474 E.11 Firms .......................................................................................................475 E.12 Flow Rate ................................................................................................475 E.13 Foetal Weight..........................................................................................475 E.14 Forest Fires..............................................................................................476 E.15 Freshmen.................................................................................................476 E.16 Heart Valve .............................................................................................477 E.17 Infarct......................................................................................................478 E.18 Joints .......................................................................................................478 E.19 Metal Firms.............................................................................................479 E.20 Meteo ......................................................................................................479 E.21 Moulds ....................................................................................................479 E.22 Neonatal ..................................................................................................480 E.23 Programming...........................................................................................480 E.24 Rocks ......................................................................................................481 E.25 Signal & Noise........................................................................................481
Contents xiii
E.26 Soil Pollution ..........................................................................................482 E.27 Stars ........................................................................................................482 E.28 Stock Exchange.......................................................................................483 E.29 VCG........................................................................................................484 E.30 Wave .......................................................................................................484 E.31 Weather ...................................................................................................484 E.32 Wines ......................................................................................................485
Appendix F - Tools 487
F.1 MATLAB Functions ...............................................................................487 F.2 R Functions .............................................................................................488 F.3 Tools EXCEL File ..................................................................................489 F.4 SCSize Program ......................................................................................489
References 491
Index 499
Preface to the Second Edition
Four years have passed since the first edition of this book. During this time I have had the opportunity to apply it in classes obtaining feedback from students and inspiration for improvements. I have also benefited from many comments by users of the book. For the present second edition large parts of the book have undergone major revision, although the basic concept – concise but sufficiently rigorous mathematical treatment with emphasis on computer applications to real datasets –, has been retained.
The second edition improvements are as follows:
• Inclusion of R as an application tool. As a matter of fact, R is a free software product which has nowadays reached a high level of maturity and is being increasingly used by many people as a statistical analysis tool.
• Chapter 3 has an added section on bootstrap estimation methods, which have gained a large popularity in practical applications.
• A revised explanation and treatment of tree classifiers in Chapter 6 with the inclusion of the QUEST approach.
• Several improvements of Chapter 7 (regression), namely: details concerning the meaning and computation of multiple and partial correlation coefficients, with examples; a more thorough treatment and exemplification of the ridge regression topic; more attention dedicated to model evaluation.
• Inclusion in the book CD of additional MATLAB functions as well as a set of R functions.
• Extra examples and exercises have been added in several chapters.
• The bibliography has been revised and new references added. I have also tried to improve the quality and clarity of the text as well as notation.
Regarding notation I follow in this second edition the more widespread use of denoting random variables with italicised capital letters, instead of using small cursive font as in the first edition. Finally, I have also paid much attention to correcting errors, misprints and obscurities of the first edition.
J.P. Marques de Sá
Porto, 2007
Preface to the First Edition
This book is intended as a reference book for students, professionals and research workers who need to apply statistical analysis to a large variety of practical problems using STATISTICA, SPSS and MATLAB. The book chapters provide a comprehensive coverage of the main statistical analysis topics (data description, statistical inference, classification and regression, factor analysis, survival data, directional statistics) that one faces in practical problems, discussing their solutions with the mentioned software packages.
The only prerequisite to use the book is an undergraduate knowledge level of mathematics. While it is expected that most readers employing the book will have already some knowledge of elementary statistics, no previous course in probability or statistics is needed in order to study and use the book. The first two chapters introduce the basic needed notions on probability and statistics. In addition, the first two Appendices provide a short survey on Probability Theory and Distributions for the reader needing further clarification on the theoretical foundations of the statistical methods described.
The book is partly based on tutorial notes and materials used in data analysis disciplines taught at the Faculty of Engineering, Porto University. One of these
management. The students in this course have a variety of educational backgrounds and professional interests, which generated and brought about datasets and analysis objectives which are quite challenging concerning the methods to be applied and the interpretation of the results. The datasets used in the book examples and exercises were collected from these courses as well as from research. They are included in the book CD and cover a broad spectrum of areas: engineering, medicine, biology, psychology, economy, geology, and astronomy.
Every chapter explains the relevant notions and methods concisely, and is illustrated with practical examples using real data, presented with the distinct intention of clarifying sensible practical issues. The solutions presented in the examples are obtained with one of the software packages STATISTICA, SPSS or MATLAB; therefore, the reader has the opportunity to closely follow what is being done. The book is not intended as a substitute for the STATISTICA, SPSS and MATLAB user manuals. It does, however, provide the necessary guidance for applying the methods taught without having to delve into the manuals. This includes, for each topic explained in the book, a clear indication of which STATISTICA, SPSS or MATLAB tools to be applied. These indications appear in
use the tools, whenever necessary. In this way, a comparative perspective of the specific “Commands” frames together with a complementary description on how to
disciplines is attended by students of a Master’s Degree course on information
xviii Preface to the First Edition
capabilities of those software packages is also provided, which can be quite useful for practical purposes.
STATISTICA, SPSS or MATLAB do not provide specific tools for some of the statistical topics described in the book. These range from such basic issues as the choice of the optimal number of histogram bins to more advanced topics such as directional statistics. The book CD provides these tools, including a set of MATLAB functions for directional statistics.
I am grateful to many people who helped me during the preparation of the book. Professor Luís Alexandre provided help in reviewing the book contents. Professor Willem van Meurs provided constructive comments on several topics. Professor Joaquim Góis contributed with many interesting discussions and suggestions, namely on the topic of data structure analysis. Dr. Carlos Felgueiras and Paulo Sousa gave valuable assistance in several software issues and in the development of some software tools included in the book CD. My gratitude also to Professor Pimenta Monteiro for his support in elucidating some software tricks during the preparation of the text files. A lot of people contributed with datasets. Their names are mentioned in Appendix E. I express my deepest thanks to all of them. Finally, I would also like to thank Alan Weed for his thorough revision of the texts and the clarification of many editing issues.
J.P. Marques de Sá Porto, 2003
Symbols and Abbreviations
Sample Sets
A event
A set (of events)
{A1, A2,…} set constituted of events A1, A2,…
A complement of {A}
BAU union of {A} with {B}
BAI intersection of {A} with {B}
E set of all events (universe)
φ empty set
Functional Analysis
∃ there is
∀ for every
∈ belongs to
∉
≡ equivalent to
|| || Euclidian norm (vector length)
⇒ implies
→ converges to
ℜ real number set +ℜ [0, +∞ [
[a, b] closed interval between and including a and b
]a, b] interval between a and b, excluding a
[a, b[ interval between a and b, excluding b
doesn’t belong to
xx Symbols and Abbreviations
]a, b[ open interval between a and b (excluding a and b)
∑ =ni 1 sum for index i = 1,…, n
∏=
n
i 1 product for index i = 1,…, n
∫b
a integral from a to b
k! factorial of k, k! = k(k−1)(k−2)...2.1
( )nk combinations of n elements taken k at a time
| x | absolute value of x
x largest integer smaller or equal to x
gX(a) function g of variable X evaluated at a
dXdg
derivative of function g with respect to X
a
n
dXgdn
derivative of order n of g evaluated at a
ln(x) natural logarithm of x
log(x) logarithm of x in base 10
sgn(x) sign of x
mod(x,y) remainder of the integer division of x by y
Vectors and Matrices
x vector (column vector), multidimensional random vector
x' transpose vector (row vector)
[x1 x2…xn] row vector whose components are x1, x2,…,xn
xi i-th component of vector x
xk,i i-th component of vector xk
∆x vector x increment
x'y inner (dot) product of x and y
A matrix
aij i-th row, j-th column element of matrix A
A' transpose of matrix A
A−1 inverse of matrix A
Symbols and Abbreviations xxi
|A| determinant of matrix A
tr(A) trace of A (sum of the diagonal elements)
I unit matrix
λi eigenvalue i
Probabilities and Distributions
X random variable (with value denoted by the same lower case letter, x)
P(A) probability of event A
P(A|B) probability of event A conditioned on B having occurred
P(x) discrete probability of random vector x
P(ωi|x) discrete conditional probability of ωi given x
f(x) probability density function f evaluated at x
f(x |ωi) conditional probability density function f evaluated at x given ωi
X ~ f X has probability density function f
X ~ F X has probability distribution function (is distributed as) F
Pe probability of misclassification (error)
Pc probability of correct classification
df degrees of freedom
xdf,α α-percentile of X distributed with df degrees of freedom
bn,p binomial probability for n trials and probability p of success
Bn,p binomial distribution for n trials and probability p of success
u uniform probability or density function
U uniform distribution
gp geometric probability (Bernoulli trial with probability p)
Gp geometric distribution (Bernoulli trial with probability p)
hN,D,n hypergeometric probability (sample of n out of N with D items)
HN,D,n hypergeometric distribution (sample of n out of N with D items)
pλ Poisson probability with event rate λ
Pλ Poisson distribution with event rate λ
nµ,σ normal density with mean µ and standard deviation σ
xxii Symbols and Abbreviations
Nµ,σ normal distribution with mean µ and standard deviation σ
ελ exponential density with spread factor λ
Ελ exponential distribution with spread factor λ
wα,β Weibull density with parameters α, β
Wα,β Weibull distribution with parameters α, β
γa,p Gamma density with parameters a, p
Γa,p Gamma distribution with parameters a, p
βp,q Beta density with parameters p, q
Βp,q Beta distribution with parameters p, q 2dfχ Chi-square density with df degrees of freedom
2dfΧ Chi-square distribution with df degrees of freedom
tdf
Tdf
21,dfdff F density with df1, df2 degrees of freedom
21,dfdfF F distribution with df1, df2 degrees of freedom
Statistics
x̂ estimate of x
[ ]XΕ expected value (average, mean) of X
[ ]XV variance of X
Ε[x | y] expected value of x given y (conditional expectation)
km central moment of order k
µ mean value
σ standard deviation
XYσ covariance of X and Y
ρ correlation coefficient
µ mean vector
Student’s t density with df degrees of freedom
Student’s t distribution with df degrees of freedom
Symbols and Abbreviations xxiii
Σ covariance matrix
x arithmetic mean
v sample variance
s sample standard deviation
xα α-quantile of X ( αα =)(xFX )
med(X) median of X (same as x0.5)
S sample covariance matrix
α significance level (1−α is the confidence level)
xα α-percentile of X
ε tolerance
Abbreviations
FNR False Negative Ratio
FPR False Positive Ratio
iff if an only if
i.i.d. independent and identically distributed
IRQ inter-quartile range
pdf probability density function
LSE Least Square Error
ML Maximum Likelihood
MSE Mean Square Error
PDF probability distribution function
RMS Root Mean Square Error
r.v. Random variable
ROC Receiver Operating Characteristic
SSB Between-group Sum of Squares
SSE Error Sum of Squares
SSLF Lack of Fit Sum of Squares
SSPE Pure Error Sum of Squares
SSR Regression Sum of Squares
xxiv Symbols and Abbreviations
SST Total Sum of Squares
SSW Within-group Sum of Squares
TNR True Negative Ratio
TPR True Positive Ratio
VIF Variance Inflation Factor
Tradenames
EXCEL Microsoft Corporation
MATLAB The MathWorks, Inc.
SPSS SPSS, Inc.
STATISTICA Statsoft, Inc.
WINDOWS Microsoft Corporation